Open mtratsiuk opened 4 years ago
validate_manifest_uris is used to validate file formats pointed by groundtruth_uri and taskdata_uri fields in manifest. Currently it fetches full file first and only then applies validation.
validate_manifest_uris
groundtruth_uri
taskdata_uri
Those files could be quite large and we can improve validation performance/mem consumption by using streaming request and passing chunks into streaming json parser. Here is potential solution using ijson lib: https://github.com/hCaptcha/hmt-basemodels/blob/30-add-gt-models/basemodels/streaming_json.py
ijson
@gaieges
Neat. Lets get the validation rolled out and see how much of a pain point the non-streaming approach is.
cc: @e271828- as well.
validate_manifest_uris
is used to validate file formats pointed bygroundtruth_uri
andtaskdata_uri
fields in manifest. Currently it fetches full file first and only then applies validation.Those files could be quite large and we can improve validation performance/mem consumption by using streaming request and passing chunks into streaming json parser. Here is potential solution using
ijson
lib: https://github.com/hCaptcha/hmt-basemodels/blob/30-add-gt-models/basemodels/streaming_json.py@gaieges