laravel / ideas

Issues board used for Laravel internals discussions.
938 stars 28 forks source link

Validator is showing it's age and needs an update #2545

Open donnysim opened 3 years ago

donnysim commented 3 years ago

I think it's worth starting a discussion around the existing Laravel validator as it's showing it's age and feels kind of monkey patched at this point to support new features and solve different problems. Few things to note:

Loose rule parameter types

Because rules are basically strings, it's really hard to tell what the end result will be - required_unless:field,null, will it check against null value? null text? if it's field,true will it check boolean type? does 1 also count?

Inconsistent

There's way too many ways to write rules or extend it and they get scattered around all over the place:

Different access levels on closure rules, Rule classes and extend ones, e.g. you have no access to Validator in Rule class unless you pass it to constructor.

Confusing

It does not respect the order of rules, you can have ['string', 'max:100', 'required'] and it will be validated as required first, while in this case it seems fine, now take it as:

Validator::make(['a' => '5'], ['a' => ['string', 'min:5', 'numeric']])

and logically thinking you'd (at least I did multiple times) expect it to validate it as a string of minimum 5 characters length and be numeric and fail, but it passes because min rule checks if numeric is in rules list and switches the validation logic.

There's also arguable confusion around how * works though I haven't found it to be the case, also there's no separation between missing value and null when dealing with custom rules. In addition, some rules accept other fields as parameters but there's no separation a reference and a value parameter, though it's also arguable,

No type checking

With strict types becoming more and more part of the PHP language, we really need strict types, not only because of PHP but also because it leaves with some nasty bugs.

Let's take the string rule format and the provided Rule class, like writing in:1,2,3 and Rule::in([1, 2, 3]) where in first case it's a looks like a string array and in the second it's an int array but in the end the class variant is also casted to string expression which leaves you with the same confusing result with what it will accept - will it only accept strings or ints? if you answered string or int then you're wrong, it will accept both, but wait there's more, it will also accept true in place of 1:

Validator::make(['a' => true], ['a' => 'integer']);
Validator::make(['a' => 1], ['a' => Rule::in([true])]);

Still passes as valid code, now try passing the value into a function and you have a big oof. Because of all this missing type checks and random data we receive because of it, we still have to do a lot of manual work, like boolean rule might return anything - true, false, 1, 0, "1", and "0". so you can't just rely on some $data['confirmed'] === 1 etc., you have to go cast the value.

Slow

While speed is not necessarily a big point for validators as they're job is to ensure data validity, Laravel validator is a way slower than it could be because it constantly deals with string rules, constant attribute format checking, rule conversion and parsing, parameter parsing, rule seeking and so on.


For update here's my thoughts:

[
    'comment' => ['nullable'],
    'note' => ['nullable'],
    ClientRules::make('client'),
]

// where ClientRules is ('client' is $prefix) etc.

"{$prefix}.client_id" => ['nullable'],
"{$prefix}.name" => ['required'],
"{$prefix}.address" => ['nullable'],

// etc.

this would allow rule composition and reuse.

For example, when validating

'roles' => ['array', 'max:10'],
'roles.*.id' => ['exists:roles,id'],

It would be best if roles.*.id is executed only if roles passes. This would short circuit the validation on the nested data where there could be hundreds or thousands of entries passed.


So what would you think would be the correct approach? or is it even worth bothering talking further about it because it will never be changed? What's are your imagined flow using the validator?

tpetry commented 3 years ago

I am personally against breaking changes as a lot of applications depend on the current Validator logic, but there are still some things which could be extended:

donnysim commented 3 years ago

Yeah, changing validator would be quite a big breaking change and not likely to happen. Though lack of type safety is really getting to me lately, as I can't trust the values I receive anymore because of typed function calls.

For existing Validator performance related problems, probably one of the biggest culprits would be that it doesn't pre-parse the rules - it parses the same rules for every value encountered in data, and to make matters worse, it does it 5 times for single rule. so having a single '*' => 'integer' parses and normalizes the rule 5 times for single item, 25k times for 5k items and so on, Though having required|integer makes it even worse, as it does it 70k times, so its around 14 times for single entry in the array.

bzelba commented 3 years ago

I did hit the same symptoms today as well, we do have an endpoint which accepts some nested arrays and the laravel validator is super slow, like easily eating up roughly 70% of the response time slow for simple constructs, isn't there a better way to handle these validations?

donnysim commented 3 years ago

Did some digging around and if we takes this as a test:

new Validator($trans, [
    'array' => range(1, 50000),
], [
    'array.*' => ['required', 'integer'],
])

it takes around 80s to validate. Now we could try improving rule parsing by preparing them beforehand only once, but because there's so many ways to pass rules, including some stringable objects like Unique, Exists rules that have hardcoded logic inside the validator, it's very complicated to do so without breaking one thing or the other. Though preparing them once we could bring it down to around 45s - did a rough implementation but few tests like Unique are failing, and to fix it would need a different approach or the "hardcoded" object rules need to be changed. That makes it almost 2x faster, but at the cost of higher memory usage because of the duplication of rules for every data entry as we aren't passing strings around anymore but a some kind of wrapper around parsed rule. There's probably more areas to improve but I find the Validator codebase extremely hard to navigate with all these type checks, conversions and edge cases that I'm not really willing to dig more. It just needs to change the way it handles things to work faster.

tpetry commented 3 years ago

Any reason why preparing them once only reduces it to 45s? I expected the rule parsing and init logic to take >90% of the time because 50.000 required and integer should not really take a second.

donnysim commented 3 years ago

When validating wildcard rules it collapses data using Arr:dot which in itself is fast <1s, but then comes explodeWildcardRules which hogs some time of around 7s, probably because of string and regex checks, array merges etc., getting rid of necessity to collapse data, duplicate rules and parse keys would improve performance and drastically improve memory usage, but this would be the hardest one to achieve. Another case is shouldStopValidating is called after each validated rule, so ['required', 'integer'] calls it 2 times per data entry, which also does some string manipulation and rule checks, if we can get rid of it, that would bump the time from 45s to around 24s with pre-parsed rules, but removing it is probably not an option so theoretically if we limit it to 1 call per data entry it probably would go down to maybe 30s? That would still be 15s gain.

I didn't dig deeper than that but these are one of the bigger culprits to watch for.

donnysim commented 3 years ago

Technically I'm not sure if it's even possible to achieve validation time of less than 1-2s with 50k using some kind of validator, I did experimentation with writing validator from scratch a while ago with same rules which kind of runs around 7s for 50k and doing some micro optimizations it might be possible to bring it down to like 6 or 5s but then you're kind of on the verge of sacrificing functionality or flexibility for speed. Though all times were measured on windows machine so times might be lower as is on linux/mac but I didn't see any reason to stress it further.