johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.98k stars 216 forks source link

Miller does not handle multiple CSVs with different headers well #1200

Open aragilar opened 1 year ago

aragilar commented 1 year ago

It seems miller isn't able to concatenate CSV files with a varying number of columns. The best description of this is this stack overflow question: https://stackoverflow.com/questions/68090301/merging-multiple-csvs-with-different-columns

What appears to happen is if Year2002.csv is first, then the headers of the later files are included as if you ran cat, rather than Year2002.csv having blank columns.

aragilar commented 1 year ago

Ah, the trick is to use unsparsify, it would be good if this was more widely mentioned (and possibly if this was an option when passing multiple files).

aborruso commented 1 year ago

Ah, the trick is to use unsparsify, it would be good if this was more widely mentioned (and possibly if this was an option when passing multiple files)

Hi @aragilar it's not a trick, it's a standard feature and you have the option to use it with multiple files.

Miller manages natively the Record-heterogeneity, and its standard format is not rectangular.

aborruso commented 1 year ago

@aragilar I think you can close this

aragilar commented 1 year ago

I think the main issue is it's kind of a foot gun in that the ordering of files results in the files being parsed differently. Ideally there'd be some warning that the headers are inconsistent, and to use unsparsify to clean up the initial set of files, but at least having a callout about this in the CSV sections (and also presumably for similar formats which assume homogeneity) would be better than users flailing about and questioning if miller is working correctly.

aborruso commented 1 year ago

I suggested to close it, because you had found out how to do it.

Ok, but you are thinking about a feature request, then I'm tagging @johnkerl