swiftcsv / SwiftCSV

CSV parser for Swift
MIT License
947 stars 190 forks source link

Support comments/metadata #135

Closed armcknight closed 7 months ago

armcknight commented 7 months ago

I have a need to store some schema versioning information in my CSVs, and I found this document published by the W3C that recommends using `#- prefixed comments to store such metadata. I don't know if that's the "standard" for CSVs; I also found RFC 4180 which states:

Surprisingly, while this format is very common, it has never been formally documented

So I don't know if this is really a standard way of doing this, or other ways to do it, if any.

As of now, when I parse a CSV containing something like

#comment:hi!
Field1,Field2
1,2
3,4
5,6

the data as an EnumeratedCSV and call enumerateAsDict, the first key value pair I get is "#comment:hi!": "Field1", whereas I expected to get "Field1": "1" etc. So it appears this is parsing the metadata as a field name. I'd expect it to skip past any lines starting with #.

lardieri commented 7 months ago

I've never seen CSV files that had comments, nor have I heard of any such thing until reading the W3C document you linked.

These comment lines seem like something that is intended for data envelopes that wrap CSV but are intended to be seen by humans, rather than something intended for a standalone CSV file intended purely for programmatic consumption.

If you are working with datasets that include such comments, I recommend that you write Swift code to preprocess the input:

Also, you mentioned that you want to annotate the file with schema versioning information. A CSV file's schema is generally described completely by its header line, with the obvious caveat that type information (e.g. dates) must be inferred. If column "Foo" was introduced in schema version 2, and you're looking at a file whose header line doesn't include column "Foo," well then it must be version 1.

Finally, you should consider that if you need very precise representation of your data, and its schema needs to be self-documenting, then you may need to look at other data formats. XML, for example, seems very old-fashioned to the younger programmers nowadays, but it is specifically designed to store structured data with arbitrary attributes, and techniques for describing schemas and validating conformance are well-established.

armcknight commented 7 months ago

I appreciate your detailed response. It sounds like preprocessing the contents is the way to go! I don't have anything against XML, I agree with everything you said about it, but in my use case I want to be able to open the CSV file directly into a spreadsheet program, which Numbers.app does nicely.