pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.56k stars 17.89k forks source link

Reading/writing of W3C-style embeded metadata in CSV, TSV files #25379

Open nedclimaterisk opened 5 years ago

nedclimaterisk commented 5 years ago

Related to #2485

The W3C Tabular Data Model recommendation that include arbitrary text data, as well as column-specific metadata, such as column data types.

It would be very nice if Pandas could read metadata like this. There is a section with an example of CSV/TSV meader metadata that might make a good starting point. The full recommendation seems somewhat vague, but perhaps that means that Pandas could help to define some more specific standards.

Perhaps a YAML header behind # characters, where some known variable names (e.g. datatype) are captured for use in reading the rest of the file, where remaining unused YAML data is added to a df.metadata dictionary?

WillAyd commented 5 years ago

Thanks - I wasn't even aware of this. I think this is an interesting idea and would agree that the datatype annotations seems like a logical starting point.

PRs are always welcome if you have an idea on how to implement

jbrockmendel commented 4 years ago

how common is this format in the wild?

naught101 commented 4 years ago

Probably not very at all, but it's a recommended spec, CSV metadata management is a real PITA, and this seems to solve it. Getting it added to the most popular CSV manipulation library around would really help make it more common, I reckon.

There are also potential side-benefits, for example the #datatype declaration would allow immediate inference of datatypes without having to scan the first 100 lines of the CSV.