go-ozzo / ozzo-validation

An idiomatic Go (golang) validation package. Supports configurable and extensible validation rules (validators) using normal language constructs instead of error-prone struct tags.
MIT License
3.77k stars 223 forks source link

limit validation errors or spit out only first one #92

Open geekflyer opened 4 years ago

geekflyer commented 4 years ago

Hi,

first of all thanks for this fabulous library. Much nicer to use than struct tags!

Unfortunately I encountered a memory and CPU problem when using ozzo to validate large structs or slices. The problem stems from the fact that ozzo spits out all validation errors, which can be a lot in a large map or slice.

In particular we ran into an issue in production, where we were using ozzo to validate json payloads of a REST API. The expected payload in this API is a nested struct which contains slices typically containing more than 100,000 items. If the caller of this API sends a payload where each array item has a validation error (even if it's practically the same error for each item), ozzo will create 100,000 validation errors. Since each validation error in ozzo is a map, it means ozzo creates 100,000 maps which use quite a lot of memory and also spit out rather noisy and unreadable, super long validation error messages. Since validation errors are typically of very ephemeral nature, the just-created-errors will become very soon eligible for garbage collection. In our case this meant suddenly our app's CPU usage pegged to 100% and most of it was spent on creating validation errors and garbage collecting them afterwards. This actually caused some semi outage in one of our production apps after introducing ozzo.

In summary, it would be nice if ozzo has a way to limit the creation of validation errors (i.e. max 10) or some sort of short circuiting once the first validation error in a structure was encountered.

My PR #93 introduces a new rule EachWithFirstErrorOnly which introduces this capability for large maps and slices, but I think having the capability to limit validation errors a first class option in ozzo (instead of just a rule variation) would be better.

Thanks.

qiangxue commented 4 years ago

Glad to hear you like this library!

For the problem you encountered, if possible I would love to add the capability of limiting the validations as a first class option. However, this may involve API signature change and the implementation is also not trivial because the validation is done recursively.

So I'd like to limit the support to the scenario that is most likely causing the problem you described. Can we say that the problem mainly occurs when validating a big array/slice or map? And does your problem only occur when using Each?

geekflyer commented 4 years ago

Yeah, well we've only started to use ozzo recently and the problem for now only occured with large slices . I guess in theory it's possible for this to also occur with large nested structs, but not as likely as with slices.

I think enhancing Each as done in my PR solves the problem for now, but maybe long term (i.e. next major of ozzo validation) it may be worth to think about an API change and making limiting errors a first class citizen.

Btw what do you think of the API for the enhanced Each in my PR? Also I just renamed EachWithFirstErrorOnly to EachUntilFirstError.

geekflyer commented 4 years ago

actually just had a second look at the code and realized that just introducing a variation of Each won't fix my problem since it only short circuits the validation rules passed to validation.EachUntilFirstError(rules...) but not the rules implemented on the slice element type via Validate (which are actually the more expensive ones typically).

This probably requires some more changes here https://github.com/geekflyer/ozzo-validation/blob/083f0b20911750dfaf3fa0b23b3ef55bd37a3fd9/validation.go#L198

🤔

geekflyer commented 4 years ago

here's a small repro for the problem:

package main

import (
    "fmt"

    validation "github.com/go-ozzo/ozzo-validation/v3"
)

type MyItem struct {
    Name string
}

func (mi *MyItem) Validate() error {
    return validation.ValidateStruct(mi,
        validation.Field(&mi.Name, validation.Required),
    )
}

type MyData struct {
    Items []*MyItem
}

func (md *MyData) Validate() error {
    return validation.ValidateStruct(md,
        validation.Field(&md.Items, validation.NotNil, validation.Each(validation.NotNil)),
    )
}

func main() {

    const numItems = 150000

    faultyItem := &MyItem{Name: ""}

    items := make([]*MyItem, numItems)

    for i := 0; i < numItems; i++ {
        items[i] = faultyItem
    }

    myData := MyData{Items: items}

    validationErrors := myData.Validate()

    fmt.Println(validationErrors)

}

run this via:

time go run main.go

This takes on my machine (MBP 2.4 GHz Intel Core i5 quad core, 16GB RAM) about 1 minute and 45 seconds to run.

geekflyer commented 4 years ago

actually it turns out most of the memory and CPU is actually wasted when turning large validation errors into a string due to the way the validation error message is constructed.

I just rewrote the validation.Errors.Error() receiver method to use a strings.Builder and now my repro runs much faster (like a few seconds).:

// Error returns the error string of Errors.
func (es Errors) Error() string {

    if len(es) == 0 {
        return ""
    }

    keys := []string{}
    for key := range es {
        keys = append(keys, key)
    }
    sort.Strings(keys)

    var stringBuilder strings.Builder

    for i, key := range keys {
        if i > 0 {
            stringBuilder.WriteString("; ")
        }
        if errs, ok := es[key].(Errors); ok {
            fmt.Fprintf(&stringBuilder,"%v: (%v)", key, errs)
        } else {
            fmt.Fprintf(&stringBuilder, "%v: %v", key, es[key].Error())
        }
    }
    stringBuilder.WriteString(".")
    return stringBuilder.String()
}

The error message is unfortunately still not very useful and a lot of time is seemingly still spend on writing to error message to stdout. Here's a small subset of the error message:

 99972: (Name: cannot be blank.); 99973: (Name: cannot be blank.); 99974: (Name: cannot be blank.); 99975: (Name: cannot be blank.); 99976: (Name: cannot be blank.); 99977: (Name: cannot be blank.); 99978: (Name: cannot be blank.); 99979: (Name: cannot be blank.); 9998: (Name: cannot be blank.); 99980: (Name: cannot be blank.); 99981: (Name: cannot be blank.); 99982: (Name: cannot be blank.); 99983: (Name: cannot be blank.); 99984: (Name: cannot be blank.); 99985: (Name: cannot be blank.); 99986: (Name: cannot be blank.); 99987: (Name: cannot be blank.); 99988: (Name: cannot be blank.); 99989: (Name: cannot be blank.); 9999: (Name: cannot be blank.); 99990: (Name: cannot be blank.); 99991: (Name: cannot be blank.); 99992: (Name: cannot be blank.); 99993: (Name: cannot be blank.); 99994: (Name: cannot be blank.); 99995: (Name: cannot be blank.); 99996: (Name: cannot be blank.); 99997: (Name: cannot be blank.); 99998: (Name: cannot be blank.); 99999: (Name: cannot be blank.).).

That being said, if it's not easy to limit the creation of validation errors in the first place, maybe another way to tackle this problem is to use a string.Builder as above and also add a way to limit the length of the error message or have some way to recursively filter a validation error to only get / print a subset of the errors?

For reference io-ts (validation library for TypeScript) has the concept of reporters which contain the logic on how validation errors are serialized into an error message (https://github.com/gcanti/io-ts#error-reporters). Perhaps having a reporter concept in ozzo where one of the reporters may limit the number of errors being reported would be a way to solve this without changing too much on the API surface.

qiangxue commented 4 years ago

Thanks for the study! So the performance bottleneck is mainly on the error message formatting, instead of the validation rule execution? Do you have any suggestion on how to limit the length of the error message or some filtering mechanism on Errors? I will take a look at io-ts.

FYI, I just checked in the optimization that you suggested above. Thanks!

geekflyer commented 4 years ago

Hi,

yes the performance bottleneck is mainly in the error message formatting, however I think if one has even bigger slices (let's say 1 million items with 1 million errors) the error creation itself becomes also quite expensive and take multiple seconds. For now to get this quickly done I guess just fixing the formatting is a step in the right direction.

Regarding the error messages. Here's some example I came up with:

package main

import (
    "fmt"
    "sort"
    "strconv"
    "strings"

    validation "github.com/go-ozzo/ozzo-validation/v3"
)

type MyItem struct {
    Name string `json:"name"`
}

func (mi *MyItem) Validate() error {
    return validation.ValidateStruct(mi,
        validation.Field(&mi.Name, validation.Required),
    )
}

type MyData struct {
    Items []*MyItem `json:"items"`
}

func (md *MyData) Validate() error {
    return validation.ValidateStruct(md,
        validation.Field(&md.Items, validation.NotNil, validation.Each(validation.NotNil)),
    )
}

func main() {
    const numItems = 150000

    faultyItem := &MyItem{Name: ""}

    items := make([]*MyItem, numItems)

    for i := 0; i < numItems; i++ {
        items[i] = faultyItem
    }

    myData := MyData{Items: items}

    validationErrors := myData.Validate()

    if validationErrors != nil {
        fmt.Println(FirstErrorJSONPathReporter(validationErrors))
    }

}

type ValidationErrorReporter func(validationErrors error) string

func FirstErrorJSONPathReporter(validationError error) string {

    var pathParts []string

    var getFirstLeafErrorMessage func(nestedError error) string
    getFirstLeafErrorMessage = func(nestedError error) string {
        if innerErrors, ok := nestedError.(validation.Errors); ok {
            var keys []string
            for key := range innerErrors {
                keys = append(keys, key)
            }
            sort.Strings(keys)
            for _, key := range keys {
                if innerError := innerErrors[key]; innerError != nil {
                    pathParts = append(pathParts, key)
                    return getFirstLeafErrorMessage(innerError)
                }
            }
        }
        return nestedError.Error()
    }

    leafErrorMsg := getFirstLeafErrorMessage(validationError)

    // if there's no error nesting, simply return the first error part without any fancy formatting
    if len(pathParts) == 0 {
        return leafErrorMsg
    }

    var jsonPath strings.Builder

    for i, pathPart := range pathParts {
        if _, err := strconv.Atoi(pathPart); err == nil {
            // path parts which can be converted to an integer are wrapped with [ ] to be json-path compliant.
            fmt.Fprintf(&jsonPath, "[%v]", pathPart)
        } else if i == 0 {
            fmt.Fprintf(&jsonPath, "%v", pathPart)
        } else {
            fmt.Fprintf(&jsonPath, ".%v", pathPart)
        }
    }

    finalMessage := jsonPath.String() + fmt.Sprintf(": %v", leafErrorMsg)

    return finalMessage
}

As you can see I defined the type type ValidationErrorReporter func(validationErrors error) string . This could also be an interface but you get the idea.

Now FirstErrorJSONPathReporter implements this type and as the name indicates it will report the first error in the validation.Errors, or in other words will turn an instance of validation.Errors into a string.

In this particular implementation it does a DFS (easier to implement than BFS) and reports the first found leaf error together with it's json-path formatted location in the nested struct.

Concretely instead of printing a super large error message for the 150k validation errors it will print instead simply this:

items[0].name: cannot be blank

So altogether it would be nice to have a reporter type / interface and a few standard reporters like this one in ozzo :)

Flipped199 commented 2 weeks ago

Hey, is there a good solution to this problem now?