johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.98k stars 216 forks source link

supporting double quotes #4

Closed ftrotter closed 9 years ago

ftrotter commented 9 years ago

I love the idea of Miller. It is clearly a needed tool that is missing from the standard unix toolbox.

However, you really cannot say you have a tool that is designed to support csv, without supporting csv.

CSV is a standard file format, and has an RFC: https://tools.ietf.org/html/rfc4180

Not supporting double quotes is the same thing as saying that you do not support csv, since double quotes are central to the way that the standard handles other characters... comma being just one example. Your tool is young enough that supporting the standard now will make later development much simpler. This will prevent the situation years from now where you have a 'normal mode' and a 'standards mode'. If you make the change now you can just have the one correct mode.

You have an ambitious work-list, but I would suggest taking a pause and thinking about how you will support the RFC version of the file format.

People like me (open data advocates) spend alot of time trying to ensure that organizations that release csv do so under the standard format, rather than releasing unparsable garbage. Having a library like yours that supported the standard too would be a huge boon.. I could say things like:

"See by using the RFC for your data output, all kinds of open tools will work out of the box on your data... like Miller (link)"

Thank you for working on such a clever tool...

Regards, -FT

johnkerl commented 9 years ago

Nice. Thanks for the RFC link. For me in my daily work -- where Miller is pretty stable featurewise -- CSV is a secondary format (hence its not being the default). But I suspected (and it now seems rightly so) that CSV will be the primary for most folks.

ve3ied commented 9 years ago

I just found out about miller on the weekend.. Swiss Army knife and looking forward to getting to know it!

I would suggest that if supporting double quotes comes at a performance cost, to have a light-weight csv (current operation) option and the full csv support. 99.9% of the time, I don't need double quote support and I'm sure many others are in the same boat. Mostly need it for processing numbers from spreadsheet colleagues and the current operation is just fine as is. Thanks!!

vapniks commented 9 years ago

johnkerl: how easy/difficult would it be to add support for double quotes? Which bits of the code need to be changed?

SikhNerd commented 9 years ago

I came here to request the same, adding support for quoted fields would make this tool awesomely usable for a lot of use cases.

johnkerl commented 9 years ago

ftrotter is spot-on -- RFC-compliant CSV is the way to go. Not only on a tool-by-tool basis but because it encourages good data citizenship on the part of producers as well as consumers.

ve3ied is (I think) also right -- I suspect a fully compliant CSV parser will run a bit slower than the handles-simple-cases code I have now (which suffices for my personal use cases). I think the right thing to do is make --csv handle RFC-compliant CSV. Then, unless the performance is absolutely as good as the current csv-lite code (which would surprise me), have a --csvlite option which will invoke the current reader.

That said, RFC-compliant CSV is going to be a bit of a project, while double quotes per se, as a small step toward that goal, is going to be easy. The RFC states that if double quotes are present, they must wrap the entire field. So it'll be pretty simple to (a) add a state to the state machines in the c/input/lrec_reader*csv.c files to ignore FS (e.g. comma) if inside quotes, (b) having delimited the field, if the start of field is a double-quote, then it's an error if the end of field isn't; (c) ++ the start-of-field pointer and null-poke the end-of-field char; (d) wrap with double-quote on output if there's an instance of FS in the field.

It's clear I've led a charmed life by having CSV data which is almost entirely double-quote-free. In light of multiple feedbacks, the minimal double-quote-handling feature sketched above is definitely top-of-list.

johnkerl commented 9 years ago

I've split out https://github.com/johnkerl/miller/issues/16 for the more general case; I'll close this when double-quote handling is in place.

johnkerl commented 9 years ago

Primary question at this point: how/when to double-quote on output?

I'm OK with both options. Question is if there's any desire for some sort of middle ground, e.g. if someone wants fields which were quoted on input to remain quoted on output. That's a bit more awkward. For transform operations which pass records through, e.g. mlr cat or mlr sort, one could put a was_quoted_on_input bit in each record and pass that through. But it gets messier for mlr put operations, e.g. doing $c = $a . $b when field a was quoted and field b was not. Should c be quoted?

In short I'm happy to implement output-quote-when-needed and output-quote-always modes; I'm not eager to do output-quote-tracked-from-input-quotes unless it's a really important use-case.

SikhNerd commented 9 years ago

Perhaps @ftrotter knows of a more proper answer, but I just did a quick survey of some of the more common tools I use and how they handle quoting on output. I hope this adds to the conversation:

Pythons CSV Module

Provides four options, and defaults to QUOTE_MINIMAL:

With the defaults $c = $a . $b is quoted if a delimiter exists.

Perls Text::CSV

Also provides four options and the default behaviour is to quote fields if they need to - e.g. they contain the separator:

With the defaults $c = $a . $b is quoted if a delimiter exists.

Csvfix

Provides two options on quoting, and defaults to quoting all fields:

I did a quick test and the $c = $a . $b case here is quoted by default, quoted with -smq, and you can force it to be fully unquoted with -sqf.

CsvKit

Provides four options for quoting, defaults to quoting only when a separator is present.

I did not get time to test the $c = $a . $b case here, sorry. I expect it would quote if a delimiter exists based on the documentation.

My personal preferences are exactly the two options you laid out (quote minimally and quote everything) and to have $c quoted only when a separator is present.

Thank you for working on this :)

ftrotter commented 9 years ago

Actually @SikhNerd has just pointed out the "real" reason to support double quotes all the time on input, rather than supporting any 'modes'. So many of the standard CSV output systems will default to using no quotes for field exports that contain only numbers, but then automatically include quotes when something "turns alphanumeric" or "turns alphanumeric with commas". For instance a standard input file might go for years with a structure of

id,phone_number 1,1112223333 2,1112223334 .... n,7778889999

but then suddenly someone enters a phone number with an extension and then the identical export will get

id,phone_number 1,1112223333 2,"1112223334x509" .... n,7778889999

or worse id,phone_number 1,"1112223333" 2,"1112223334x509" .... n,"7778889999"

In the same way, you might have an export process that works like this:

id,sentence 1,The quick brown fox jumps over the lazy dog 2,"The quick, brown fox jumps over the lazy dog"

In this case the quotes are only exercised when the comma appears in the string So to stably use a system like miller, you really need to handle all of those cases really smoothly. Otherwise, later changes in data contents or data structure could produce confusing, hard-to-find bugs.

Again thanks for working on such a great tool!!

-FT

johnkerl commented 9 years ago

@ftrotter thank you! you've got a lot more experience with dirty CSV than I do. :)

johnkerl commented 9 years ago

Rethink; I'm duping https://github.com/johnkerl/miller/issues/16 to this one. No need for a two-task approach.

johnkerl commented 9 years ago

Had a nice chunk of time this evening to sketch out the new CSV parser. Proof of concept works. Now just a matter of integrating into the rest of the framework, and plenty of unit-test & regression-test cases, & checking against several found-in-the-wild CSV files. There are several in miller issues, I think enough to work with.

SikhNerd commented 9 years ago

If you can push up the branch you're working on, I'd be very interested in watching what you're doing, and will try and play around with it to help if I find time :)

johnkerl commented 9 years ago

it's all in master -- this evening's pushes -- every bit compiles; there are unit-tests for some stable things; all in backward-compatibility mode so nothing is reachable from the mlr executable as yet. c/experimental/csv0.c is the proof-of-concept file.

johnkerl commented 9 years ago

Progress point:

$ cat b.tmp
a,b,c
1,"2,3",4

$ mlr --csvex --odkvp --ofs ';' cat b.tmp
a=1;b=2,3;c=4

(--csvex is transitional until --csv becomes --csvlite and --csvex becomes --csv)

johnkerl commented 9 years ago

At present this CSV reader is about 3.5x slower than the CSV-lite reader. Some performance optimizations can be made after the functionality is delivered but I very much doubt it will ever be as fast as CSV-lite.

$ time mlr --csvlite cat ../data/big.csv > /dev/null

real 0m1.456s
user 0m1.407s
sys  0m0.044s

$ time mlr --csvex cat ../data/big.csv > /dev/null

real 0m6.738s
user 0m6.593s
sys  0m0.136s

$ wc -l ../data/big.csv
 1000001 ../data/big.csv
johnkerl commented 9 years ago

OK, iterating. If you use --csvex as of the most recent push you'll get the RFC-CSV reader/writer.

jungle-boogie commented 9 years ago

Hi @johnkerl,

So is this an example of your query working from master compiled mlr?:

mlr --csvlite --opprint count-distinct -f ST,ClosingDate

or

mlr --csvex --opprint count-distinct -f ST,ClosingDate

johnkerl commented 9 years ago

At the moment:

In a day or two once I do some code cleanup & add some test cases:

jungle-boogie commented 9 years ago

:+1:

ds108j commented 9 years ago

Hello.

I thank you for this update. I have csv files where the field separator is a space, and even if the --csvex works with "," filed separator, it doesn't seem to work when you want to change the FS.

Example : $ cat test 1,2,3,4 ab,ac,"a,d",ae

$ mlr --csvex --odkvp cat test 1=ab,2=ac,3=a,d,4=ae

This case is ok, but if I change my FS to space :+1: $ cat test2 1 2 3 4 ab ac "a d" ae

$mlr --csvex -p --odkvp cat test2 1=1 2=2 3=3 4=4 1=ab 2=ac 3="a 4=d" 5=ae

How can I use the --csvex with space separator ?

Thanks.

johnkerl commented 9 years ago

Right. --csvlite was, and is, the same not-really-CSV as I originally released, with programmable RS/FS (e.g. you can do TSV, or spaces), whereas --csvex (soon to be --csv) will be only RFC-4180 CSV. RS will be hardcoded to CRLF and FS will be hardcoded to comma. This is to deliver, as soon as possible, RFC-compliant CSV since @ftrotter hit that nail on the head. (As I noted above, read performance is significantly slower than CSV-lite.) Using semantic versioning, this will be Miller v2.0.0.

In the next minor, v2.1.0, I'll do a bit more:

johnkerl commented 9 years ago

I'll put the above info in the 2.0.0 release note as well.

johnkerl commented 9 years ago

Please see https://github.com/johnkerl/miller/releases/tag/v2.0.0. If you find any issues, please open a new issue as a bugfix request.

Thank you thank you thank you for the very informative feedback!!! :)

johnkerl commented 9 years ago

@ftrotter and @SikhNerd tagging you here for closure.

SikhNerd commented 9 years ago

Looks good :100: I haven't been able to break it yet.

johnkerl commented 9 years ago

:D

indera commented 8 years ago

What would you recommend to do when the input file has a broken line like line 3 below: A, B, C 1, 2, 3 "x"","abc", "cde"

Is there a way to fix it using miller commands?

johnkerl commented 8 years ago

Good question.

On @ftrotter's advice I made the CSV parser RFC-compliant. I didn't put any effort into handling non-compliant CSV (although there's a lot of it out there in the world).

I don't know how non-compliant your data is beyond the above 3-line example, so I'm unsure how to point you at a general solution, but a few ways to handle the above 3-line example are:

  1. cat yourfile.csv | sed 's/"//g' | mlr --csv ...
  2. cat yourfile.csv | sed 's/""/"/g' | mlr --csv ...
  3. mlr --csvlite put '$A =~ "\"" { $A=gsub($A,"\"", "") }' yourfile.csv

In the latter case I'm suggesting the csvlite parser which isn't RFC-compliant and treats quotes as characters like any others.

Also note that for option 3 you'll need head Miller, since I introduced the necessary "\"", as well as the pattern-action syntax, after the most recent release (3.4.0).

johnkerl commented 8 years ago

4th option: mlr --csvlite put -S '$A=gsub($A,"\"", "")' yourfile.csv

indera commented 8 years ago

@johnkerl this is awesome :+1:

--------------
mlr --version
Miller 3.4.0
---------------
mlr --icsv --opprint --rs lf cat ha.csv
mlr: syntax error: unwrapped double quote at line 2.
-----------------
mlr --csvlite put -S '$A=gsub($A,"\"", "")' ha.csv
A, B, C
1, 2, 3
x,"abc", "cde"
johnkerl commented 8 years ago

:)

ftrotter commented 8 years ago

Have not visited this thread in a while and I am thrilled to see the progress.

I do think that handling non-standard CSV is a good feature, but I almost think that barfing with an error message is almost a better feature.. that way I can do to CSV producers and say "see... get your shit together"...

Would also like to point out that the performance mentioned here is excellent for standards compliant CSV processing, just because it is "worse" does not mean it cannot also be "excellent".

It is quite a compliment to have such a complex feature that I suggested closed. I will do my best to promote this to all the right people. Moving to twitter now!!

-FT