Open jqnatividad opened 5 days ago
For qsv pro, add the option to infer custom formats using luau or python scripts. These scripts will have the added ability to lookup reference data maintained in https://data.dathere.com (e.g. ISO code tables, congressional district, school district, Census geoid, etc.), other CKAN instances, and internal databases/data sources.
make this a new "smart" command instead, so we don't overload stats
with options..
Still, the format
command will just add a format
column to stats
's output or the stats cache.
Date/Datetime formats
--infer-date
is enabled,format
should be set to the format usedmin
,max
,median
andmodes
to see if they match one of the 19 date formats recognized by qsv-dateparserLocation
--infer-location
flagmin
,max
,median
andmodes
to see if they match common location formats - https://www.maptools.com/tutorials/lat_lon/formatslatitude
formatlongitude
formatEmail
--infer-email
flagmin
,max
,median
andmodes
to see if they match common email formats using the email_address crateUsing the same approach above (looking at summary stats min, max, median, modes), also infer:
--infer-hostnames
option--infer-ipaddress
option, for bothipv4
andipv6
formats--infer-phoneno
option--infer-currency
option, adding currency symbol metadata to the format entry - e.g. "currency - USD ( $ )", "currency - JPY (¥)", "currency = PHP (₱)", "currency - ? ($)", etc.As some currency symbols like the $ is used in several countries, it will use "?" instead of the three-letter ISO 4217 code if it cannot infer it.
Also add
-F, --infer-all-formats
convenience option.If a CSV is indexed and
--format-sample <sample_size>
option is used, randomly sample the CSV to further verify if the inferred format using the summary stats is correct.