Open mccorkle opened 2 years ago
Hey! Thanks for the report. Yeah datastation/dsq does sampling to get reasonable performance. Maybe it makes sense to sample a larger file but then performance is going to get much worse. Overall I don't yet have a great strategy for dealing with very large files.
Before I discovered Datastation, the way I had imagined building my own was to stream-read the file and when I see an array -- to read only the first 3 of the array's children into memory, counting but discarding all other objects in the array until I capture the last 3.
The flaw with my plan was that if there is an array child that didn't conform to the structure of the first and last 3 in the array, my report would not include them in the schema -- but it would have found this schema element that datastation/dsq is missing.
Perhaps a hybrid of your approach and mine which can be activated by an --array_depth=3
argument?
Describe the bug and expected behavior
In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.
Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?
Reproduction steps With a 11GB (or larger) file:
dsq --schema --pretty LARGE_FILE.json
Versions
dsq 0.20.2
from apt