Performance questions? - Githubissues

samuelcolvin commented 2 years ago

I'm keen to "run onto the spike" and find any big potential performance improvements in pydantic-core while the API can be changed easily.

I'd therefore love anyone with experience of rust and/or pyo3 to have a look through the code and see if I'm doing anything dumb.

Particular concerns:

The "cast_as vs. extract" issues described in https://github.com/PyO3/pyo3/issues/2278 was a bit scary as I only found the solution by chance, are there any other similar issues with pyo3?
generally copying and cloning values - am I copying stuff where I don't have to? In particular, is input or parts of input (in the case of a dict/list/tuple/set etc.) copied when it doesn't need to be?
Similarly, could we use a PyObject instead of PyAny or visa-versa and improve performance?
here and in a number of other implementations of ListInput and DictInput we do a totally unnecessary map, is this avoidable? Is this having a performance impact? Is there another way to give a general interface to the underlying datatypes that's more performance
The code for generating models here seems to be pretty slow compared to other validators, can anything be done?
Recursive models are slowing than I had expected, I thought it might be the use of RwLock that was causing the performance problems, but I managed to remove that (albeit in a slightly unsafe way) in #32 and it didn't make a difference. Is something else the problem? Could we remove Arc completely?
lifetimes get pretty complicated, I haven't even checked if get a memory leak from running repeat validation, should we/can we change any lifetimes?

I'll add to this list if anything else comes to me.

More generally I wonder if there are performance improvements that I'm not even aware of? "What you don't know, you can't optimise"

@pauleveritt @robcxyz

samuelcolvin commented 2 years ago

I tried doing some performance profiling, see

https://github.com/samuelcolvin/pydantic-core/blob/7c9058e641549e816614a097148019c11feae95e/benchmarks/minimal.py#L3-L15

But I didn't find anything particularly interesting.

Stranger6667 commented 2 years ago

Hi :) First of all, thank you for Pydantic!

Here are some ideas about allocations:

truncate - as it allocates in format!, but it could be avoided:

use std::fmt::Write;

macro_rules! truncate {
    ($out: expr, $value: expr) => {
        if $value.len() > 50 {
            write!($out, "{}...{}", &$value[0..25], &$value[$value.len() - 24..]);
        } else {
            $out.push_str(&$value);
        }
    };
}

Similarly, some push_str calls together with &format! could be replaced by write! (clippy should warn about it from Rust 1.61 btw):

output.push_str(&format!(", input_type={}", type_));

to

write!(output, ", input_type={}", type_);

There are a few more to_string calls that could be avoided:

let loc = self
    .location
    .iter()
    .map(|i| i.to_string())
    .collect::<Vec<String>>()
    .join(" -> ");

to:

let mut first = true;
for item in &self.location {
    if !first {
        output.push_str(" -> ");
        first = false
    }
    match item {
        LocItem::S(s) => output.push_str(&s),
        LocItem::I(i) => {
            output.push_str(itoa::Buffer::new().format(*i))
        },
    }
}

Required the itoa crate though - it would be helpful in a few more places (ryu would help with floats as well)

StrConstrainedValidator::_validation_logic if strip_whitespace and to_lower are true, then it will allocate twice, but only the latter actually requires allocating a new string.

Also, the signature implies an allocation, but it is not needed if e.g. min_length / max_length validation fails, so it might be better to use Cow instead (maybe also use Cow for Validator::get_name?)

py_error! uses format!, so I assume that calling to_string on its argument should not be necessary (like this one).

Are you ok with me submitting a PR for such things?

Static vs Dynamic dispatch. Is there any particular reason for using &'data dyn Input in the Validator trait? I'd assume it is possible to use &impl Input instead and avoid vtable indirection in cost of some compilation time.

The code for generating models here seems to be pretty slow compared to other validators, can anything be done?

inside with_prefix_location:

self.location = [location.clone(), self.location].concat();

One allocation could be avoided:

let mut new_location = location.clone();
new_location.extend_from_slice(&self.location);
self.location = new_location;

I'll take a look at other places during the weekend :)

samuelcolvin commented 2 years ago

Thanks so much, this is really interesting.

I didn't know allocation was particularly slow, I'll research it.

My initial guess is that most of these things will have a pretty small effect, e.g. so the truncate stuff is only called if you get an error and if print repr. I don't want to add more dependencies unless they allow a significant improvement.

The most significant thing is dyn Input Vs impl Input - if that makes a difference, it could make a big change.

PR very welcome.

samuelcolvin commented 2 years ago

Unfortunately using impl Input on Validators doesn't work because the trait needs to be object safe:

error[E0038]: the trait `Validator` cannot be made into an object
   --> src/validators/union.rs:12:22
    |
12  |     choices: Vec<Box<dyn Validator>>,
    |                      ^^^^^^^^^^^^^ `Validator` cannot be made into an object
    |
note: for a trait to be "object safe" it needs to allow building a vtable to allow the call to be resolvable dynamically; for more information visit <https://doc.rust-lang.org/reference/items/traits.html#object-safety>
   --> src/validators/mod.rs:189:8
    |
181 | pub trait Validator: Send + Sync + fmt::Debug {
    |           --------- this trait cannot be made into an object...
...
189 |     fn validate<'s, 'data>(
    |        ^^^^^^^^ ...because method `validate` has generic type parameters
...
196 |     fn validate_strict<'s, 'data>(
    |        ^^^^^^^^^^^^^^^ ...because method `validate_strict` has generic type parameters
    = help: consider moving `validate` to another trait
    = help: consider moving `validate_strict` to another trait

danielSanchezQ commented 2 years ago

I would like to help too (been keen about helping this project for some time 😃 ). I would take a look and if I see some low hanging fruit will be posting (small) PRs. that is ok? Are stylistic changes welcome too (thinks like iterators vs loops etc)?

samuelcolvin commented 2 years ago

Amazing yes please.

Are stylistic changes welcome too (thinks like iterators vs loops etc)?

yes, but my third priority (after safety and speed) is readability, particularly for python developers - e.g. me. So if changes make it technically more correct rust but make it harder to read for notices, and have no other impact, I might be inclined to refuse them.

Best to create some small PRs and I'll comment.

mejrs commented 2 years ago

Are stylistic changes welcome too (thinks like iterators vs loops etc)?

That's not just a stylistic choice - avoiding bound checks means iterators can often be faster than index based loops.

Similarly, could we use a PyObject instead of PyAny or visa-versa and improve performance?

PyObject is just the owned version of PyAny - use the former whenever you need to keep around references to Python objects.

generally copying and cloning values - am I copying stuff where I don't have to? In particular, is input or parts of input (in the case of a dict/list/tuple/set etc.) copied when it doesn't need to be?

Where is this?

here and in a number of other implementations of ListInput and DictInput we do a totally unnecessary map, is this avoidable? Is this having a performance impact? Is there another way to give a general interface to the underlying datatypes that's more performance

If you just want to provide an interface, consider using a visitor rather than returning an iterator. This also avoids the boxing, but it doesn't really matter for performance I think.

For example, you have:

impl<'data> ListInput<'data> for &'data PyList {
    fn input_iter(&self) -> Box<dyn Iterator<Item = &'data dyn Input> + 'data> {
        Box::new(self.iter().map(|item| item as &dyn Input))
    }
    // ...
}

But you could also accept a closure to run over each item:

impl<'data> ListInput<'data> for &'data PyList {
    fn visit(&self, visitor: impl FnMut(???))  {
        self.iter().for_each(visitor)
    }
}

Or if you prefer something more typed:

trait Validator{
    fn validate(???)
}

impl<'data> ListInput<'data> for &'data PyList {
    fn visit(&self, visitor: impl Validator)  {
        self.iter().for_each(|item| visitor.validate(item))
    }
}

More generally I wonder if there are performance improvements that I'm not even aware of? "What you don't know, you can't optimise"

I mainly see some papercuts, not any actual performance crimes.

For example, in apply_validator, you have:

let loc = if key_loc {
    vec![key.to_loc(), "[key]".to_loc()]
} else {
    vec![key.to_loc()]
};
for err in line_errors {
    errors.push(err.with_prefix_location(&loc));
}

But if you fix the with_prefix_location to take a reference to a slice, rather than a reference to a Vec, you can avoid creating the vecs altogether:

if key_loc {
    let locs = [key.to_loc(), "[key]".to_loc()];
    for err in line_errors {
        errors.push(err.with_prefix_location(&locs));
    }

} else {
     let loc = key.to_loc();
     for err in line_errors {
        errors.push(err.with_prefix_location(std::slice::from_ref(&loc)));
    }
};

Note that the above comment holds true in general: because taking &Vec has more indirection and requires the caller to provide a vec rather than any continuous sequence you don't want this

fn foo(blahs: &Vec<Blah>){}

but this:

fn foo(blahs: &[Blah]){}

Recursive models are slowing than I had expected, I thought it might be the use of RwLock that was causing the performance problems, but I managed to remove that (albeit in a slightly unsafe way) in https://github.com/samuelcolvin/pydantic-core/pull/32 and it didn't make a difference. Is something else the problem?

    fn set_ref(&mut self, name: &str, validator_weak: ValidatorWeak) -> PyResult<()> {
        unsafe { Arc::get_mut_unchecked(&mut self.validator_arc).set_ref(name, validator_weak) }
    }

Functions like these are incredibly dangerous, they enable unsynchronized shared mutability. At least make set_ref unsafe. You could try using parking_lot's synchronization primitives - they are a bit faster.

Could we remove Arc completely?

Arc is just for sharing things. If you want to share things, you will need it or Rc.

lifetimes get pretty complicated, I haven't even checked if get a memory leak from running repeat validation, should we/can we change any lifetimes?

Do you mean Rust's lifetime annotations?

mejrs commented 2 years ago

Also, I recommend to do benchmarking as part of your CI workflows, so you can identify performance regressions (or improvements :) ) quickly.

adriangb commented 2 years ago

cc @woodruffw

samuelcolvin commented 2 years ago

I got some really helpful suggestions from @hoodmane, noting them here before i forgot:

[x] Use an enum for validators, then use enum_dispatch, instead of dyn Validator, same for InputList
[x] Use a slot table instead of weak references for recursive references
[x] use tp_new to create classes.

woodruffw commented 2 years ago

This is a small thing, but getattr with constant strings can be slightly optimized by using intern!: https://docs.rs/pyo3/latest/pyo3/prelude/struct.Py.html#method.getattr

For example, this:

https://github.com/samuelcolvin/pydantic-core/blob/7c9058e641549e816614a097148019c11feae95e/src/validators/model_class.rs#L29

could become:

let new_method = class.getattr(intern!("__new__"))?;

...to save an allocation. I can make a small PR for these, if you'd like.

Edit: Actually looks like one of these is already interned, so this would be a single line change 🙂

Stranger6667 commented 2 years ago

I didn't know allocation was particularly slow, I'll research it. My initial guess is that most of these things will have a pretty small effect, e.g. so the truncate stuff is only called if you get an error and if print repr. I don't want to add more dependencies unless they allow a significant improvement.

Those are indeed micro-optimizations for some specific scenarios, but sometimes they may yield some nice cumulative effect :) I.e. if there is an opportunity to avoid heap allocations I usually prefer to take it

Here is another small Idea that may help a bit in some scenarios (with some cost of slowing down the others) - using a bitvector (e.g. u64) instead of a set for storing what fields were hit inside ModelValidator::validate (fields are ordered there). It might work for validators with <65 fields (which I think is quite common), for others, it could fall back to a regular set. The check_extra branch will require an actual set as those field names are arbitrary. The final Ok branch will require creating a set. It might not improve the happy path case when there are no validation errors, but maybe it could be considered if those other scenarios would need some extra performance (bitwise operations should be orders of magnitude cheaper than set operations)

samuelcolvin commented 2 years ago

@woodruffw thanks so much for the input. I think on this occasion this is superseded by #43.

I'll look at all the other suggestions here when I have time.

jorgecarleitao commented 2 years ago

posted on r/rust for visibility: https://www.reddit.com/r/rust/comments/ugniwv/creator_of_pydantic_is_looking_for_help_in_moving/

jcdyer commented 2 years ago

I didn't know allocation was particularly slow, I'll research it.

An individual allocation is pretty quick, but it can add up. You know how in python, you try to avoid patterns like:

s = "Something"
for line in lines():
    s += "\n" + line

and instead do something like:

s = "Something\n" + "\n".join(get_lines())

The main reason the first pattern is slow is because each iteration through the loop allocates a new string for s in addition to allocating each of the lines. The second one only allocates a single extra string for the whole join function. A single allocation is cheap, but if the number of allocations increases linearly, it can end up adding a big chunk to your runtime.

In rust, you can mutate strings, and strings amortize reallocations (in the same way dicts do in python) which means that

let mut s = String::from("Something\n");
for line in get_lines() {
    s.push_str(&line);
    s.push('\n')
}

is relatively performant, as it only has to reallocate roughly every time the string doubles in size.

If you know the size of all your inputs, you can often make it even faster by allocating the right size up front for the string:

let lines = get_lines();
let mut s = String::with_capacity(lines.iter().map(|s| s.len()).sum() + "Something\n".len());
s.push_str("Something\n");
for line in lines {
     s.push_str(&line);
     s.push('\n');
}

In this case the string never needs to reallocate, since it never outgrows its initial allocation. As you said, readability for python programmers is a high concern, so the performance boost of the with_capacity version may not be worth the added bookkeeping. Your call. The point is that rust gives you more tools for managing allocations, which can help with performance.

afetisov commented 2 years ago

impl<'a> ValLineError<'a> {
    pub fn with_prefix_location(mut self, location: &Location) -> Self {
        if self.location.is_empty() {
            self.location = location.clone();
        } else {
            // TODO we could perhaps instead store "reverse_location" in the ValLineError, then reverse it in
            // `PyLineError` so we could just extend here.
            self.location = [location.clone(), self.location].concat();
        }
        self
    }
}

This one uses needless clones. Looking at the code, it seems that almost all usages involve taking a vec![single_item], with a single vec![first, second] in apply_validator.rs. I suggest doing away with taking a slice entirely. Instead, make it a function which takes a single LocItem by value:

impl<'a> ValLineError<'a> {
    pub fn with_prefix_location(mut self, location: LocItem) -> Self {
        self.location.insert(0, location);
        self
    }
}

The old code involved unconditionally cloning a vector, which is a strong hint that you should have taken it by value (cloning a vector also involves cloning each of its elements, and LocItem can contain a String, so that's actually 2-3 needless allocations per call). The new approach eliminates those temporary vectors entirely.

If you expect that in the future you may need vectors longer than 2, or even ones of indeterminate length, you can consider taking an ArrayVec instead (see arrayvec, smallvec or tinyvec crates, I recommend tinyvec since it has no unsafe code).

While I have not benchmarked the crate and don't know of specific issues, in general excessive allocations can significantly harm the performance. For that reason I would try to avoid all allocations unless necessary: collecting into vectors, cloning Box'es, needlessly returning String's.

For example, Validator::get_name returns a String, even though all its uses just take its return value by reference, so could use a &str. Moreover, many impls of that method just create a String out of a static slice, needlessly allocating. I suggest making that method return Cow<'_, str>.

impl Clone for Box<dyn Validator> is a huge performance trap. Not only does it allocate new memory for the validator (this can actually be reasonably fast on modern allocators), but it also calls clone recursively on the validator (which in turn will call clones on inner validators, causing new allocations and clones). This is exponential in tree depth and should be strongly avoided. I would suggest using Arc<dyn Validator> where you are using Box<dyn Validator>, but unfortunately Validator::set_ref takes &mut self. This prevents calling it on Arc<dyn Validator>. While we could use Arc<RwLock<dyn Validator>> and acquire locks as needed, it would still likely be incorrect since currently all cloned validators are entirely independent, while with the Arc approach modifying the validator will also modifies everything which references it.

Unfortunately, there is no simple solution to this problem. It's something that should have been considered during the design phase.

You are also returning a lot of boxed trait objects. Besides extra allocations, those also cause extra indirection on method calls. Some of those trait objects probably can't be removed entirely, but some (where there are just a few implementing types, e.g. DictInput) can be converted into enumerations. This would allow to do away both with excessive boxing and with indirect trait calls, and it would also somewhat simplify the API.

If you want, I can make a PR with my suggestions.

nnethercote commented 2 years ago

For general performance ideas The Rust Performance Book may be useful. It's pretty short, and full of suggestions about things like how to avoid unnecessary allocations.

karolzlot commented 2 years ago

From readme:

https://github.com/samuelcolvin/pydantic-core/blob/e50eecc6853047adc478d82b7c2992fdc1a6c3d3/README.md?plain=1#L103-L104

This is not true at least for jsonschema-rs. It can load dict and also json string (this lib has python bindings).

jsonschema-rs is also quite performant, so I think it may worth it to check how it achieves it.

Also please take a look at this issue from jsonschema-rs repo: simd support? (many ideas in it!)

Stranger6667 commented 2 years ago

jsonschema-rs is also quite performant, so I think it may worth it to check how it achieves it.

As the author of jsonschema-rs, I am inclined to think that fast validation is about having a packed, cache-friendly data layout that enables fast iteration and pre-computes as much as possible. The current approach in jsonschema-rs is building a serde_json::Value-like structure that contains all data needed for validation (e.g. unwrapping Value::Number, using bitvecs for type combos, etc). The downside is that having such a fragmented layout with many boxes leads to cache misses. The current version doesn't resolve references upfront, which leads to synchronization overhead when locks are used :( I'd expect that a WIP version I haven't published yet might yield nice improvements because it stores the graph in the compressed sparse row format (so, 3 vectors total) and basically gives a nondeterministic finite automaton to execute the validation process. All references are also resolved upfront + partially inlined (the idea is borrowed from ajv)

Not sure how much it is applicable to this library, though it might be helpful.

samuelcolvin commented 2 years ago

Thanks both for your input, I'll definitely review the simd-support issue and think more about what you've said here @Stranger6667.

@karolzlot, in terms of using jsonschema-rs (or any other JSON Schema library) directly in pydantic-core, I don't think it would ever be possible for the other reasons listed in the readme:

https://github.com/samuelcolvin/pydantic-core/blob/e50eecc6853047adc478d82b7c2992fdc1a6c3d3/README.md?plain=1#L96-L101

Most of all:

pydantic does coercion as well as validation
pydantic mostly works with python objects as input, not JSON

karolzlot commented 2 years ago

@samuelcolvin Yes, I agree with those points you mentioned, directly using JSON Schema is not good idea.

What I mean is that it's possible that jsonschema-rs implemented this well ( "load dict and also load json string"), so it may be worth to check the code.

I don't know rust enough to give any more specific tip unfortunately.

samuelcolvin commented 2 years ago

Yes, agreed, thank you.

Stranger6667 commented 2 years ago

What I mean is that it's possible that jsonschema-rs implemented this well ( "load dict and also load json string"), so it may be worth to check the code.

I'd say that is suboptimal there - it converts a Python object into serde_json::Value, which sometimes takes up to 70% of all validation time. Having inputs behind an opaque wrapper that implements some trait should be much better, i.e. something like Input doesn't have that conversion overhead

samuelcolvin commented 2 years ago

I'm going to close this issue as the codebase has moved a long way since this discussion.

Thank you all for your input, it's been really useful and interesting

Any more feedback (probably best via new issues or PRs) is very welcome.

A lot of the suggestions here have now been implemented and from some quick check the code base is around 2x faster than when I created this issue (as well as significantly more complex and powerful).

pydantic / pydantic-core

Performance questions? #35