jqnatividad / qsv

CSVs sliced, diced & analyzed.
https://qsv.dathere.com
The Unlicense
2.46k stars 70 forks source link

`infer`: new command to infer additional dataset characteristics based on summary stats/frequency table #2184

Open jqnatividad opened 5 days ago

jqnatividad commented 5 days ago

Date/Datetime formats

Location

Email

Using the same approach above (looking at summary stats min, max, median, modes), also infer:

Also add -F, --infer-all-formats convenience option.

If a CSV is indexed and --format-sample <sample_size> option is used, randomly sample the CSV to further verify if the inferred format using the summary stats is correct.

jqnatividad commented 5 days ago

For qsv pro, add the option to infer custom formats using luau or python scripts. These scripts will have the added ability to lookup reference data maintained in https://data.dathere.com (e.g. ISO code tables, congressional district, school district, Census geoid, etc.), other CKAN instances, and internal databases/data sources.

jqnatividad commented 3 days ago

make this a new "smart" command instead, so we don't overload stats with options..

Still, the format command will just add a format column to stats's output or the stats cache.