timClicks / thisthat

Data conversion utility
Other
37 stars 2 forks source link

CSV and TSV support? #2

Open simonw opened 1 year ago

simonw commented 1 year ago

I find myself wanting to convert data from CSV into other formats pretty often - it's one of the features in my sqlite-utils memory tool, but this tool looks like it might be faster and more appropriate.

The simplest mapping from CSV to JSON is to produce an array of JSON objects - one per row, using the headings from the first line of CSV as the object properties.

Going from JSON to CSV only really makes sense if the JSON is an array of objects where each object has the same keys.

This is hard, because you have to decide what to do if a later object introduces a key that hasn't been seen before. You can load the entire file into memory first or do a sequential scan to check for different keys, but neither answer is particularly elegant.

timClicks commented 1 year ago

CSV/TSV will be a very worthy, and perhaps essential, feature. It raises a few design questions, such whether we should convert fields to rich types where possible and how to handle for formats that inherently encode multiple records.

simonw commented 1 year ago

For sqlite-utils insert the way I handle type detection from CSV is with a class called TypeTracker: https://sqlite-utils.datasette.io/en/stable/python-api.html#detecting-column-types-using-typetracker

As the CSV file is processed it tracks the possible type of each column (looking for \d+ or the equivalent for floating point numbers for example) - then at the end if a column was ALL values that matched just integers it runs a database schema change to convert the column to INTEGER.

Does thisthat work by loading the entire data into memory before converting it? If so then this approach should work fine - if it's a streaming tool then you'd have to do two passes to detect the types first.

As for complex nested objects in CSV... I've been serializing those as JSON. I figure a correctly escaped JSON string embedded in a CSV value is Good Enough.