Open simonw opened 1 year ago
CSV/TSV will be a very worthy, and perhaps essential, feature. It raises a few design questions, such whether we should convert fields to rich types where possible and how to handle for formats that inherently encode multiple records.
For sqlite-utils insert
the way I handle type detection from CSV is with a class called TypeTracker
: https://sqlite-utils.datasette.io/en/stable/python-api.html#detecting-column-types-using-typetracker
As the CSV file is processed it tracks the possible type of each column (looking for \d+
or the equivalent for floating point numbers for example) - then at the end if a column was ALL values that matched just integers it runs a database schema change to convert the column to INTEGER
.
Does thisthat
work by loading the entire data into memory before converting it? If so then this approach should work fine - if it's a streaming tool then you'd have to do two passes to detect the types first.
As for complex nested objects in CSV... I've been serializing those as JSON. I figure a correctly escaped JSON string embedded in a CSV value is Good Enough.
I find myself wanting to convert data from CSV into other formats pretty often - it's one of the features in my
sqlite-utils memory
tool, but this tool looks like it might be faster and more appropriate.The simplest mapping from CSV to JSON is to produce an array of JSON objects - one per row, using the headings from the first line of CSV as the object properties.
Going from JSON to CSV only really makes sense if the JSON is an array of objects where each object has the same keys.
This is hard, because you have to decide what to do if a later object introduces a key that hasn't been seen before. You can load the entire file into memory first or do a sequential scan to check for different keys, but neither answer is particularly elegant.