hCaptcha / hmt-basemodels

Basemodels for manifest data used by hmt-escrow
MIT License
9 stars 8 forks source link

Use streaming request & parser API in validate_manifest_uris #38

Open mtratsiuk opened 4 years ago

mtratsiuk commented 4 years ago

validate_manifest_uris is used to validate file formats pointed by groundtruth_uri and taskdata_uri fields in manifest. Currently it fetches full file first and only then applies validation.

Those files could be quite large and we can improve validation performance/mem consumption by using streaming request and passing chunks into streaming json parser. Here is potential solution using ijson lib: https://github.com/hCaptcha/hmt-basemodels/blob/30-add-gt-models/basemodels/streaming_json.py

@gaieges

gaieges commented 4 years ago

Neat. Lets get the validation rolled out and see how much of a pain point the non-streaming approach is.

cc: @e271828- as well.