jqnatividad / qsv

CSVs sliced, diced & analyzed.
The Unlicense
2.25k stars 62 forks source link

`stats` command writes output file even when `--output` is not set #1794

Closed mhkeller closed 1 week ago

mhkeller commented 1 week ago

Describe the bug

When running qsv stats --typesonly my_file.csv, I get the stats in stdout but it also writes two files next to the file I am reading in:

my_file.stats.csv
my_file.stats.csv.json

I would prefer to not write any files.

To Reproduce Steps to reproduce the behavior:

  1. Download this file iris.csv
  2. Run this command qsv stats iris.csv --typesonly
  3. See that the files have been written to disk

Expected behavior

I'm not sure if this is a bug but the behavior is surprising and it would be great if there were an option to not write out any files.

The docs describe an --output flag to write output. I would expect this function to only create output if set via a flag.

If these files are necessary for other qsv commands, it would be helpful to include a flag to optionally not write them.

Screenshots/Backtrace/Sample Data If applicable, add screenshots/backtraces/sample data to help explain your problem.

Desktop (please complete the following information):

qsv 0.127.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.622;python-3.12.3 (main, Apr 9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)];to;polars-0.39.2;self_update-10-10;12.80 GiB-1016.88 MiB-4.13 GiB-16.00 GiB (aarch64-apple-darwin compiled with Rust 1.78) compiled

jqnatividad commented 1 week ago

Hi @mhkeller , stats is the heart of qsv and the main reason why I wrote it.

As you inferred, it's used by other commands to do metadata and schema inferencing, among other things.

It's also used by stats itself to cache stats calculations. So when you try to compute stats on a very large file with expensive settings like --everything and --infer-dates, it checks if a previous stats calculation is available and valid, and it does that by looking at those two tiny files - the .stats.csv is the latest stats result; and .stats.json is the metadata of the previous stats run.

In the field I work in, where we deal primarily with large, historical CSVs, this is helpful as the files are typically static once exported from transaction systems, as stats will return instantaneously with results if valid cached stats results are available.

Anyway, I'll add an option to suppress generating these cache files. I'll also add some logic to only cache results if the potential savings are too small (say less than 5 seconds) to bother caching them.

mhkeller commented 1 week ago

Thanks for the quick and thorough reply! I figured it had to do with some internal usage. That makes a lot of sense. An option to skip writing out the files would be great for my use case.

With the --cache-threshold strategy, I would set the threshold to a high number that it would likely never reach, I'm guessing?

Thanks in general for your work on this library. I have a more general question that I'll post over in Discussions.

jqnatividad commented 1 week ago

No worries... big fan of the data journalism work you and your team are doing at NYTimes BTW... 💯

FYI, during the first few days of the pandemic, I wrote a Selenium scraper to retrieve data from NYTimes and petitioned to have it released as open data instead, to which the team responded quickly. 😄

Anyways, as for the new --cache-threshold option - it will have a default of 5000 ms. And when set to zero, it will suppress cache generation altogether, so you don't have to guesstimate a high threshold.

mhkeller commented 1 week ago

Ah thank you – that's so nice of you to say. And I'm glad you were able to get the data you were after – that tracking was a huge effort. (I was not involved but was a great admirer.)

That option for the flag makes sense and I'm looking forward to trying it out! I was looking for a fast, portable way to check csv types so qsv is perfect.

mhkeller commented 1 week ago

Thanks for merging this so quickly!

chadbaldwin commented 1 week ago

Heh, this is perfect timing for me. I'm working with a directory of csv files that sort of works like an auto-load folder. If a CSV file is dropped into the directory, a process picks it up and tries to load it. I was hoping to find a way to disable the cache files as I only need to run qsv stats once per file and likely won't need it ever again afterward, so the cache isn't all that helpful for me.

jqnatividad commented 3 days ago

That's good to know @chadbaldwin !

You may be interested to know that I added a new negative setting to --cache-threshold for your use case.

https://github.com/jqnatividad/qsv/blob/15d00727866aeb3596280fa838d56170be0e7b47/src/cmd/stats.rs#L143-L153

For example, If you set --cache-threshold -10005, stats will automatically create an index (which unlocks parallel processing and makes stats run at least 2-3x faster) when the input file size is greater than 10,005 bytes.

Further, after the stats run, it will auto-delete the index and the stats cache files as the --cache-threshold ends with 5.