Open jpmckinney opened 4 years ago
This would reduce memory but not running time. We don't presently have an issue with memory (except in rare cases when someone uploads a huge file to the DRT).
Re-opening as actually we do have an issue with memory (in Kingfisher Process, if we were to attempt to validate packages rather than individual releases/records https://github.com/open-contracting/kingfisher-process/issues/392).
Presently, the entire package needs to be loaded into memory to be validated. This of course consumes a lot of memory for larger files. https://github.com/open-contracting/lib-cove-oc4ids/issues/23
An alternative is to read the entire input twice: once to re-build the package metadata without releases/records/etc., and then to iteratively yield each release/record for validation.
To avoid rewriting a lot of code, we could perhaps stitch the results for individual releases/records back together, so that errors are still reported as being about releases/0, releases/1, etc. even though each was validated separately.
In any case, this is the only way for memory usage to not scale with input size.