Closed mhkeller closed 1 week ago
Hi @mhkeller ,
stats
is the heart of qsv and the main reason why I wrote it.
As you inferred, it's used by other commands to do metadata and schema inferencing, among other things.
It's also used by stats
itself to cache stats calculations. So when you try to compute stats on a very large file with expensive settings like --everything
and --infer-dates
, it checks if a previous stats calculation is available and valid, and it does that by looking at those two tiny files - the .stats.csv
is the latest stats result; and .stats.json
is the metadata of the previous stats run.
In the field I work in, where we deal primarily with large, historical CSVs, this is helpful as the files are typically static once exported from transaction systems, as stats
will return instantaneously with results if valid cached stats results are available.
Anyway, I'll add an option to suppress generating these cache files. I'll also add some logic to only cache results if the potential savings are too small (say less than 5 seconds) to bother caching them.
Thanks for the quick and thorough reply! I figured it had to do with some internal usage. That makes a lot of sense. An option to skip writing out the files would be great for my use case.
With the --cache-threshold
strategy, I would set the threshold to a high number that it would likely never reach, I'm guessing?
Thanks in general for your work on this library. I have a more general question that I'll post over in Discussions.
No worries... big fan of the data journalism work you and your team are doing at NYTimes BTW... 💯
FYI, during the first few days of the pandemic, I wrote a Selenium scraper to retrieve data from NYTimes and petitioned to have it released as open data instead, to which the team responded quickly. 😄
Anyways, as for the new --cache-threshold
option - it will have a default of 5000 ms. And when set to zero, it will suppress cache generation altogether, so you don't have to guesstimate a high threshold.
Ah thank you – that's so nice of you to say. And I'm glad you were able to get the data you were after – that tracking was a huge effort. (I was not involved but was a great admirer.)
That option for the flag makes sense and I'm looking forward to trying it out! I was looking for a fast, portable way to check csv types so qsv is perfect.
Thanks for merging this so quickly!
Heh, this is perfect timing for me. I'm working with a directory of csv files that sort of works like an auto-load folder. If a CSV file is dropped into the directory, a process picks it up and tries to load it. I was hoping to find a way to disable the cache files as I only need to run qsv stats
once per file and likely won't need it ever again afterward, so the cache isn't all that helpful for me.
That's good to know @chadbaldwin !
You may be interested to know that I added a new negative setting to --cache-threshold
for your use case.
For example, If you set --cache-threshold -10005
, stats
will automatically create an index (which unlocks parallel processing and makes stats run at least 2-3x faster) when the input file size is greater than 10,005 bytes.
Further, after the stats run, it will auto-delete the index and the stats cache files as the --cache-threshold
ends with 5.
Describe the bug
When running
qsv stats --typesonly my_file.csv
, I get the stats in stdout but it also writes two files next to the file I am reading in:I would prefer to not write any files.
To Reproduce Steps to reproduce the behavior:
qsv stats iris.csv --typesonly
Expected behavior
I'm not sure if this is a bug but the behavior is surprising and it would be great if there were an option to not write out any files.
The docs describe an
--output
flag to write output. I would expect this function to only create output if set via a flag.If these files are necessary for other qsv commands, it would be helpful to include a flag to optionally not write them.
Screenshots/Backtrace/Sample Data If applicable, add screenshots/backtraces/sample data to help explain your problem.
Desktop (please complete the following information):
qsv 0.127.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.622;python-3.12.3 (main, Apr 9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)];to;polars-0.39.2;self_update-10-10;12.80 GiB-1016.88 MiB-4.13 GiB-16.00 GiB (aarch64-apple-darwin compiled with Rust 1.78) compiled