Closed ftrotter closed 9 years ago
Nice. Thanks for the RFC link. For me in my daily work -- where Miller is pretty stable featurewise -- CSV is a secondary format (hence its not being the default). But I suspected (and it now seems rightly so) that CSV will be the primary for most folks.
I just found out about miller on the weekend.. Swiss Army knife and looking forward to getting to know it!
I would suggest that if supporting double quotes comes at a performance cost, to have a light-weight csv (current operation) option and the full csv support. 99.9% of the time, I don't need double quote support and I'm sure many others are in the same boat. Mostly need it for processing numbers from spreadsheet colleagues and the current operation is just fine as is. Thanks!!
johnkerl: how easy/difficult would it be to add support for double quotes? Which bits of the code need to be changed?
I came here to request the same, adding support for quoted fields would make this tool awesomely usable for a lot of use cases.
ftrotter is spot-on -- RFC-compliant CSV is the way to go. Not only on a tool-by-tool basis but because it encourages good data citizenship on the part of producers as well as consumers.
ve3ied is (I think) also right -- I suspect a fully compliant CSV parser will run a bit slower than the handles-simple-cases code I have now (which suffices for my personal use cases). I think the right thing to do is make --csv
handle RFC-compliant CSV. Then, unless the performance is absolutely as good as the current csv-lite code (which would surprise me), have a --csvlite
option which will invoke the current reader.
That said, RFC-compliant CSV is going to be a bit of a project, while double quotes per se, as a small step toward that goal, is going to be easy. The RFC states that if double quotes are present, they must wrap the entire field. So it'll be pretty simple to (a) add a state to the state machines in the c/input/lrec_reader*csv.c
files to ignore FS (e.g. comma) if inside quotes, (b) having delimited the field, if the start of field is a double-quote, then it's an error if the end of field isn't; (c) ++ the start-of-field pointer and null-poke the end-of-field char; (d) wrap with double-quote on output if there's an instance of FS in the field.
It's clear I've led a charmed life by having CSV data which is almost entirely double-quote-free. In light of multiple feedbacks, the minimal double-quote-handling feature sketched above is definitely top-of-list.
I've split out https://github.com/johnkerl/miller/issues/16 for the more general case; I'll close this when double-quote handling is in place.
Primary question at this point: how/when to double-quote on output?
--quote-all-on-output
flag or something.I'm OK with both options. Question is if there's any desire for some sort of middle ground, e.g. if someone wants fields which were quoted on input to remain quoted on output. That's a bit more awkward. For transform operations which pass records through, e.g. mlr cat
or mlr sort
, one could put a was_quoted_on_input
bit in each record and pass that through. But it gets messier for mlr put
operations, e.g. doing $c = $a . $b
when field a
was quoted and field b
was not. Should c
be quoted?
In short I'm happy to implement output-quote-when-needed and output-quote-always modes; I'm not eager to do output-quote-tracked-from-input-quotes unless it's a really important use-case.
Perhaps @ftrotter knows of a more proper answer, but I just did a quick survey of some of the more common tools I use and how they handle quoting on output. I hope this adds to the conversation:
Provides four options, and defaults to QUOTE_MINIMAL:
With the defaults $c = $a . $b
is quoted if a delimiter exists.
Also provides four options and the default behaviour is to quote fields if they need to - e.g. they contain the separator:
With the defaults $c = $a . $b
is quoted if a delimiter exists.
Provides two options on quoting, and defaults to quoting all fields:
I did a quick test and the $c = $a . $b
case here is quoted by default, quoted with -smq, and you can force it to be fully unquoted with -sqf.
Provides four options for quoting, defaults to quoting only when a separator is present.
I did not get time to test the $c = $a . $b
case here, sorry. I expect it would quote if a delimiter exists based on the documentation.
My personal preferences are exactly the two options you laid out (quote minimally and quote everything) and to have $c
quoted only when a separator is present.
Thank you for working on this :)
Actually @SikhNerd has just pointed out the "real" reason to support double quotes all the time on input, rather than supporting any 'modes'. So many of the standard CSV output systems will default to using no quotes for field exports that contain only numbers, but then automatically include quotes when something "turns alphanumeric" or "turns alphanumeric with commas". For instance a standard input file might go for years with a structure of
id,phone_number 1,1112223333 2,1112223334 .... n,7778889999
but then suddenly someone enters a phone number with an extension and then the identical export will get
id,phone_number 1,1112223333 2,"1112223334x509" .... n,7778889999
or worse id,phone_number 1,"1112223333" 2,"1112223334x509" .... n,"7778889999"
In the same way, you might have an export process that works like this:
id,sentence 1,The quick brown fox jumps over the lazy dog 2,"The quick, brown fox jumps over the lazy dog"
In this case the quotes are only exercised when the comma appears in the string So to stably use a system like miller, you really need to handle all of those cases really smoothly. Otherwise, later changes in data contents or data structure could produce confusing, hard-to-find bugs.
Again thanks for working on such a great tool!!
-FT
@ftrotter thank you! you've got a lot more experience with dirty CSV than I do. :)
Rethink; I'm duping https://github.com/johnkerl/miller/issues/16 to this one. No need for a two-task approach.
Had a nice chunk of time this evening to sketch out the new CSV parser. Proof of concept works. Now just a matter of integrating into the rest of the framework, and plenty of unit-test & regression-test cases, & checking against several found-in-the-wild CSV files. There are several in miller issues, I think enough to work with.
If you can push up the branch you're working on, I'd be very interested in watching what you're doing, and will try and play around with it to help if I find time :)
it's all in master -- this evening's pushes -- every bit compiles; there are unit-tests for some stable things; all in backward-compatibility mode so nothing is reachable from the mlr
executable as yet. c/experimental/csv0.c
is the proof-of-concept file.
Progress point:
$ cat b.tmp
a,b,c
1,"2,3",4
$ mlr --csvex --odkvp --ofs ';' cat b.tmp
a=1;b=2,3;c=4
(--csvex
is transitional until --csv
becomes --csvlite
and --csvex
becomes --csv
)
At present this CSV reader is about 3.5x slower than the CSV-lite reader. Some performance optimizations can be made after the functionality is delivered but I very much doubt it will ever be as fast as CSV-lite.
$ time mlr --csvlite cat ../data/big.csv > /dev/null
real 0m1.456s
user 0m1.407s
sys 0m0.044s
$ time mlr --csvex cat ../data/big.csv > /dev/null
real 0m6.738s
user 0m6.593s
sys 0m0.136s
$ wc -l ../data/big.csv
1000001 ../data/big.csv
OK, iterating. If you use --csvex
as of the most recent push you'll get the RFC-CSV reader/writer.
--csv
to --csvlite
and --csvex
to --csv
. Then we'll be able to say that Miller's CSV format is real CSV.--quote-all
, --quote-none
, --quote-minimal
, and --quote-numeric
. Not in the online help as of yet but soon.Hi @johnkerl,
So is this an example of your query working from master compiled mlr?:
mlr --csvlite --opprint count-distinct -f ST,ClosingDate
or
mlr --csvex --opprint count-distinct -f ST,ClosingDate
At the moment:
--csvlite
== --csv
== as-is stuff as in the release announcement, i.e. not RFC-CSV.--csvex
new experimental RFC-CSV stuff needing some more unit-test cases etc.In a day or two once I do some code cleanup & add some test cases:
--csvlite
will be the as-is stuff in the release announcement.--csv
will be the new RFC-CSV:+1:
Hello.
I thank you for this update. I have csv files where the field separator is a space, and even if the --csvex works with "," filed separator, it doesn't seem to work when you want to change the FS.
Example : $ cat test 1,2,3,4 ab,ac,"a,d",ae
$ mlr --csvex --odkvp cat test 1=ab,2=ac,3=a,d,4=ae
This case is ok, but if I change my FS to space :+1: $ cat test2 1 2 3 4 ab ac "a d" ae
$mlr --csvex -p --odkvp cat test2 1=1 2=2 3=3 4=4 1=ab 2=ac 3="a 4=d" 5=ae
How can I use the --csvex with space separator ?
Thanks.
Right. --csvlite
was, and is, the same not-really-CSV as I originally released, with programmable RS/FS (e.g. you can do TSV, or spaces), whereas --csvex
(soon to be --csv
) will be only RFC-4180 CSV. RS will be hardcoded to CRLF and FS will be hardcoded to comma. This is to deliver, as soon as possible, RFC-compliant CSV since @ftrotter hit that nail on the head. (As I noted above, read performance is significantly slower than CSV-lite.) Using semantic versioning, this will be Miller v2.0.0.
In the next minor, v2.1.0, I'll do a bit more:
--csv
will be still be compliant by default, but RS/FS will be progammable (as in your note above)I'll put the above info in the 2.0.0 release note as well.
Please see https://github.com/johnkerl/miller/releases/tag/v2.0.0. If you find any issues, please open a new issue as a bugfix request.
Thank you thank you thank you for the very informative feedback!!! :)
@ftrotter and @SikhNerd tagging you here for closure.
Looks good :100: I haven't been able to break it yet.
:D
What would you recommend to do when the input file has a broken line like line 3 below: A, B, C 1, 2, 3 "x"","abc", "cde"
Is there a way to fix it using miller commands?
Good question.
On @ftrotter's advice I made the CSV parser RFC-compliant. I didn't put any effort into handling non-compliant CSV (although there's a lot of it out there in the world).
I don't know how non-compliant your data is beyond the above 3-line example, so I'm unsure how to point you at a general solution, but a few ways to handle the above 3-line example are:
cat yourfile.csv | sed 's/"//g' | mlr --csv ...
cat yourfile.csv | sed 's/""/"/g' | mlr --csv ...
mlr --csvlite put '$A =~ "\"" { $A=gsub($A,"\"", "") }' yourfile.csv
In the latter case I'm suggesting the csvlite parser which isn't RFC-compliant and treats quotes as characters like any others.
Also note that for option 3 you'll need head Miller, since I introduced the necessary "\""
, as well as the pattern-action syntax, after the most recent release (3.4.0).
4th option: mlr --csvlite put -S '$A=gsub($A,"\"", "")' yourfile.csv
@johnkerl this is awesome :+1:
--------------
mlr --version
Miller 3.4.0
---------------
mlr --icsv --opprint --rs lf cat ha.csv
mlr: syntax error: unwrapped double quote at line 2.
-----------------
mlr --csvlite put -S '$A=gsub($A,"\"", "")' ha.csv
A, B, C
1, 2, 3
x,"abc", "cde"
:)
Have not visited this thread in a while and I am thrilled to see the progress.
I do think that handling non-standard CSV is a good feature, but I almost think that barfing with an error message is almost a better feature.. that way I can do to CSV producers and say "see... get your shit together"...
Would also like to point out that the performance mentioned here is excellent for standards compliant CSV processing, just because it is "worse" does not mean it cannot also be "excellent".
It is quite a compliment to have such a complex feature that I suggested closed. I will do my best to promote this to all the right people. Moving to twitter now!!
-FT
I love the idea of Miller. It is clearly a needed tool that is missing from the standard unix toolbox.
However, you really cannot say you have a tool that is designed to support csv, without supporting csv.
CSV is a standard file format, and has an RFC: https://tools.ietf.org/html/rfc4180
Not supporting double quotes is the same thing as saying that you do not support csv, since double quotes are central to the way that the standard handles other characters... comma being just one example. Your tool is young enough that supporting the standard now will make later development much simpler. This will prevent the situation years from now where you have a 'normal mode' and a 'standards mode'. If you make the change now you can just have the one correct mode.
You have an ambitious work-list, but I would suggest taking a pause and thinking about how you will support the RFC version of the file format.
People like me (open data advocates) spend alot of time trying to ensure that organizations that release csv do so under the standard format, rather than releasing unparsable garbage. Having a library like yours that supported the standard too would be a huge boon.. I could say things like:
"See by using the RFC for your data output, all kinds of open tools will work out of the box on your data... like Miller (link)"
Thank you for working on such a clever tool...
Regards, -FT