If metadata is required for datafiles, datasets, and manifests, it can take quite a while to instantiate datasets of many files. This is leading to users changing their data schema to avoid file validation via the twine.
Here are the results of profiling the instantiation of 5000 datafiles from @time-trader:
I have several ideas for speeding this up:
[ ] Add a flag to allow skipping of all validation or just a subset (e.g. file tag templates)
[x] Check for level > 0 before checking for metadata the filename in Dataset._instantiate_from_local_directory
[ ] In twined and the SDK: casting datasets and datafiles to primitives during manifest validation
[ ] Hash the twine and add it to manifest primitive and object to indicate previous successful validation so re-validation agains the same twine can be avoided
[ ] Check the efficiency of the Twine.validate_manifest method
[ ] Use a faster implementation of JSON validation
[x] Cache the contents of each local metadata file for a dataset to avoid it being loaded separately for each datafile
[x] Allow ignoring metadata at Manifest instantiation
If metadata is required for datafiles, datasets, and manifests, it can take quite a while to instantiate datasets of many files. This is leading to users changing their data schema to avoid file validation via the twine.
Here are the results of profiling the instantiation of 5000 datafiles from @time-trader:
I have several ideas for speeding this up:
level > 0
before checking for metadata the filename inDataset._instantiate_from_local_directory
Dataset._instantiate_from_local_directory
twined
generallytwined
and the SDK: casting datasets and datafiles to primitives during manifest validationTwine.validate_manifest
methodManifest
instantiation