dsq --schema missing array in 11GB file

multiprocessio / dsq

Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.

Other

3.71k stars 150 forks source link

dsq --schema missing array in 11GB file #87

Open mccorkle opened 2 years ago

mccorkle commented 2 years ago

Describe the bug and expected behavior

In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.

Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?

Reproduction steps With a 11GB (or larger) file: dsq --schema --pretty LARGE_FILE.json

Versions

OS: Ubuntu 22.04 LTS, AMD EPYC 7R32
- Shell: bash
dsq version: dsq 0.20.2 from apt

eatonphil commented 2 years ago

Hey! Thanks for the report. Yeah datastation/dsq does sampling to get reasonable performance. Maybe it makes sense to sample a larger file but then performance is going to get much worse. Overall I don't yet have a great strategy for dealing with very large files.

mccorkle commented 2 years ago

Before I discovered Datastation, the way I had imagined building my own was to stream-read the file and when I see an array -- to read only the first 3 of the array's children into memory, counting but discarding all other objects in the array until I capture the last 3.

The flaw with my plan was that if there is an array child that didn't conform to the structure of the first and last 3 in the array, my report would not include them in the schema -- but it would have found this schema element that datastation/dsq is missing.

Perhaps a hybrid of your approach and mine which can be activated by an --array_depth=3 argument?