rgamble / libcsv

Fast and flexible CSV library written in pure ANSI C that can read and write CSV data.
GNU Lesser General Public License v2.1
181 stars 40 forks source link

How to disable quote processing entirely? #29

Open chrisinmtown opened 2 years ago

chrisinmtown commented 2 years ago

I parse pipe-separated data using libcsv. The data may have quotes in it, those should be completely ignored. What is the recommended practice? Maybe if I set value 1 for quote, which I guess would be the code for control-A?

    csv_set_quote(&cp, quote)

I'm asking because the following pipe-separated input line hit us recently and when libcsv processed it, the library yielded an impossibly long output record; 1,864,280,683 bytes to be precise:

ABC|jkkdf|1664550195943489|28|0|"wxyz.th|"wxyz.th|::|||17301

I consider this a bug. Making this even trickier to debug, when my program is compiled on OSX with Apple clang, libcsv processes an input file with this line as I expect, no large output. When my program is compiled on Ubuntu bullseye with gcc, that's when we see the bad behavior.

Maybe you feel I have misused the library? Here's how I am using it. First, I'm using default (not strict) checking of quotes:

    struct csv_parser cp;
    int rc = csv_init(&cp, 0);

Second, I left the parser's config at the default value of 0 for the quote character.

Thanks in advance.

p.s. I revised this issue to ask my question, no longer trying to report a bug. I am still struggling to produce a sanitized data file that reproduces the problem. The line shown above is the content that begins the impossibly long line in the output, but when processed alone in a one-line file, the library signals error immediately.

jleffler commented 2 years ago

When I see the CSV data in Gmail, I see a part where there is a pipe symbol, a double quote, then when.th, a pipe, another double quote, another when.th and another pipe. Is that what you intended to show? The alpha parts are highlighted as if they are part of a Thai domain, which is why I ask.

If that is what you intended, then that sequence is a single malformed field. It's malformed because the closing double quote isn't at the end of the field.

Now, I am not defending a gigabyte of data being returned; that is a bug of some sort. But I would be happier with a reproduction that I don't see yet. On Stack Overflow, it used to called an MCVE — Minimal Complete Verifiable Example. Elsewhere, it is known as an SSCCE (https://sscce.org) — Short, Self-Contained, Complete Example. Can you provide such an example? The CSV strong can be hard-coded in the program, for example.

On Sun, Oct 16, 2022 at 05:26 Chris Lott @.***> wrote:

The following pipe-separated input line hit us recently and when libcsv processed it, the library yielded an impossibly long output record; 1,864,280,683 bytes to be precise:

ABC|jkkdf|1664550194943489|28|0|"when.th|"when.th|::|||17301

I consider this a bug. Making this even trickier to debug, when my program is compiled on OSX with Apple clang, libcsv processes an input file with this line as I expect, no large output. When my program is compiled on Ubuntu bullseye with gcc, that's when we see the bad behavior.

Maybe you feel I have misused the library? Here's how I am using it. First, I'm using strict checking of quotes:

struct csv_parser cp;
int rc = csv_init(&cp, 0);

Second, I left the parser's config at the default value of 0 for the quote character.

If this is not a bug, please advise how to be robust to this kind of input. In our case it would be acceptable to disable all quote processing, possibly using this function?

csv_set_quote(&cp, quote)

Thanks in advance.

— Reply to this email directly, view it on GitHub https://github.com/rgamble/libcsv/issues/29, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCAHBXZUNJLYMWPHZVADTTWDPQ4VANCNFSM6AAAAAARGJ7LPE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Jonathan Leffler @.***> #include Guardian of DBD::Informix - v2018.1031 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."

chrisinmtown commented 2 years ago

Thanks @jleffler for the follow-up. Please see https://github.com/rgamble/libcsv/issues/23 for a minimal example of code. I have not yet figured out if the line I posted above from the data file is sufficient to trigger the behavior, or if one of the successive lines in that file is also required. And unfortunately the file is both huge (2 GB) and proprietary.

jleffler commented 2 years ago

The program from issue #23 seems to need some tweaking before it will work on your pipe-separated (rather than comma-separated) data. The error reporting leaves quite a lot to be desired, too — I'm not sure whether that's an issue in the CSV library or the use of it. There is no obvious way of identifying and reporting where the problem is detected or what the problem really is. I think it would be helpful if you can whittle the problem data down from multi-gigabyte size to, say, less than 10 KiB. I have a program called "garble" which will replace vowels at random with vowels, consonants with consonants, and digits with digits, preserving case and leaving punctuation untouched. You could apply that program to your data which would remove the proprietary information while still (one hopes) reproducing the problem. Shout if you want a copy of "garble" to play with.

On Sun, Oct 16, 2022 at 8:10 AM Chris Lott @.***> wrote:

Thanks @jleffler https://github.com/jleffler for the follow-up. Please see #23 https://github.com/rgamble/libcsv/issues/23 for a minimal example of code. I have not yet figured out if the line I posted above from the data file is sufficient to trigger the behavior, or if one of the successive lines in that file is also required. And unfortunately the file is both huge (2 GB) and proprietary.

— Reply to this email directly, view it on GitHub https://github.com/rgamble/libcsv/issues/29#issuecomment-1279977869, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCAHBSHG6KRHZPRERXF7LDWDQEE5ANCNFSM6AAAAAARGJ7LPE . You are receiving this because you were mentioned.Message ID: @.***>

-- Jonathan Leffler @.***> #include Guardian of DBD::Informix - v2018.1031 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."

chrisinmtown commented 2 years ago

Thanks for the offer of a scramble program. I am trying frantically to chop the file down. I tested parts of the file until I found that records 1..23,993,658 are processed fine; the failure occurs when record 23,993,659 is present. Unfortunately for me that last record has absolutely nothing special about it. I tried processing that record alone, the last 1 million records, the last 10 million etc. without triggering the problem. I'm baffled.

bobhairgrove commented 2 years ago

Just a shot in the dark... does your input sometimes contain text generated on an IBM mainframe using EBCDIC encoding? Then you might need to install a custom line ending character (see the libcsv docs for csv_set_term_func() as well as this page: https://en.wikipedia.org/wiki/Newline for details).

Otherwise, csv_parse() will look for 0x0A and/or 0x0C, and not seeing that, will treat all following bytes as one single line up to the end of the file.

bobhairgrove commented 2 years ago

If the newline character is not an issue, there might be some kind of overflow issue. Make sure the total number of bytes in your file will fit in the buffer passed to csv_parse(), and if not, you'll have to break up the import file into chunks which can be processed without overflow issues.

chrisinmtown commented 2 years ago

Wow @bobhairgrove I have not heard EBCDIC for a long time! I don't suspect newline problems, there is no Big Blue iron in this loop anywhere :)

I think you're much closer to the mark when you suggest the CSV library will take all bytes thru the end of the file. I concocted the following example and fed it to the example code I posted in issue #23:

ABC,xyydd,001,2,0,wxyz.th,wxyz.th,::,,,1
ABC,xyydd,002,2,0,"wxyz.th,"wxyz.th,::,,,2
ABC,xyydd,003,2,0,wxyz.th,wxyz.th,::,,,3
ABC,xyydd,004,2,0,wxyz.th,wxyz.th,::,,,4

You can see it emit the first record as expected, then the second record field number 5 has all the content thru the end of that little file:

Row 0:
    Field 0: ABC
    Field 1: xyydd
    Field 2: 001
    Field 3: 2
    Field 4: 0
    Field 5: wxyz.th
    Field 6: wxyz.th
    Field 7: ::
    Field 8: 
    Field 9: 
    Field 10: 1
Row 1:
    Field 0: ABC
    Field 1: xyydd
    Field 2: 002
    Field 3: 2
    Field 4: 0
    Field 5: wxyz.th,"wxyz.th,::,,,2
ABC,xyydd,003,2,0,wxyz.th,wxyz.th,::,,,3
ABC,xyydd,004,2,0,wxyz.th,wxyz.th,::,,,4

I generated this example output from code compiled on my mac with clang. I believe libcsv is searching for the closing quote, reaches the end of file before finding it, and behaves ok. Next is to build the example program on an ubuntu.

jleffler commented 2 years ago

You said:

I am trying frantically to chop the file down. I tested parts of the file

until I found that records 1..23,993,658 are processed fine; the failure occurs when record 23,993,659 is present. Unfortunately for me that last record has absolutely nothing special about it. I tried processing that record alone, the last 1 million records, the last 10 million etc. without triggering the problem. I'm baffled.

That immediately elicits an "Ugh!" response. It isn't obvious what could be wrong when that happens.

One possibility is that there's a memory problem. Have you got Valgrind? If so, use it. Can you use gcc with -fsanitize=address and/or -fsanitize=leak. If so, try them. But 23M operations OK and then failing is an exasperating sort of problem. Is your processing program accumulating information from the records or is it processing one at a time and discarding the evidence of processing each record after it is done?

I assume you included a few thousand records after record 23,993,658 — just in case your reporting is running into buffered messaging, or something like that.

I don't know how feasible it is for you, but have you tried editing out one or more of the columns from the data, and then rerunning the code (or an appropriately modified version of the code) to see whether that changes where the problem occurs, or eliminates the problem altogether?

Have you looked for unexpected characters in the data? Is it supposed to be pure ASCII? If so, look for characters outside the ASCII (7-bit range). Look for control characters other than tab and newline? If it is UTF-8, have you validated that there are no misencoded characters? Etc.

I hope I'm not teaching Grandma to suck eggs here. I'm just trying to touch all the bases I can think of to help. I know it's easy in the fury of the fight to overlook some test that might help solve the problem.

On Mon, Oct 17, 2022 at 8:25 AM Chris Lott @.***> wrote:

Thanks for the offer of a scramble program. I am trying frantically to chop the file down. I tested parts of the file until I found that records 1..23,993,658 are processed fine; the failure occurs when record 23,993,659 is present. Unfortunately for me that last record has absolutely nothing special about it. I tried processing that record alone, the last 1 million records, the last 10 million etc. without triggering the problem. I'm baffled.

— Reply to this email directly, view it on GitHub https://github.com/rgamble/libcsv/issues/29#issuecomment-1280946752, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCAHBRGVJ24VVKY75RVWS3WDVOWDANCNFSM6AAAAAARGJ7LPE . You are receiving this because you were mentioned.Message ID: @.***>

-- Jonathan Leffler @.***> #include Guardian of DBD::Informix - v2018.1031 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."

chrisinmtown commented 2 years ago

Right @jleffler ugh is what I thought also, perhaps a few stronger words also! My file has 27 million records, and the problem occurs as long as it is at least 23,993,660 records long. The content is pure ascii, not even a single high code-page UTF-8 character; I checked via grep --color='auto' -P -n "[\x80-\xFF]" myfile. I have not tried discarding a column. My little preprocessor that uses libcsv simply dumps out each record as it goes, plucks out the interesting columns, saves absolutely nothing.

I think what's going on is very simple: after the problematic mis-quoted line occurs (on line 199,611 if you care), libcsv reads ahead millions of bytes trying to find the closing quotation mark, until eventually something bad happens. The problem does not occur when the data size is small. When a few other fires are burning less brightly I will return to this, try things you suggest like valgrind and gcc compile flags.

Until then, perhaps you could please answer my revised question: what is the safest way to disable quote processing entirely in libcsv? My interim solution of setting the quote char to decimal 1 (I think that's control-a) is working for now.

jleffler commented 2 years ago

If it works for you, and the data is pure ASCII, then setting 'quote' to control-A (decimal 1) is effective. Any other unused character could be used as the quote.

Does the problem occur when EOF is more than 2 GiB (2^31 - 1) bytes beyond the end of the problematic record (line 199,611)? If so, could a 4-byte integer be being used instead of an 8-byte integer for some key piece of data relating to memory size? May I assume you're using a 64-bit machine where size_t is a 64-bit quantity?

I wonder if libcsv needs a heuristic "maximum quoted field limit" of say 1 MiB, configurable if you really are working with files with larger quoted strings? If the code that looks for the matching quote deals with more than that new limit, then it reports an error on the line where the quote starts, finds the next newline, and resumes parsing after that (or, optionally, fails altogether). That's fiddly to deal with.

On Mon, Oct 17, 2022 at 1:43 PM Chris Lott @.***> wrote:

Right @jleffler https://github.com/jleffler ugh is what I thought also, perhaps a few stronger words also! My file has 27 million records, and the problem occurs as long as it is at least 23,993,660 records long. The content is pure ascii, not even a single high code-page UTF-8 character; I checked via grep --color='auto' -P -n "[\x80-\xFF]" myfile. I have not tried discarding a column. My little preprocessor that uses libcsv simply dumps out each record as it goes, plucks out the interesting columns, saves absolutely nothing.

I think what's going on is very simple: after the problematic mis-quoted line occurs (on line 199,611 if you care), libcsv reads ahead millions of bytes trying to find the closing quotation mark, until eventually something bad happens. The problem does not occur when the data size is small. When a few other fires are burning less brightly I will return to this, try things you suggest like valgrind and gcc compile flags.

Until then, perhaps you could please answer my revised question: what is the safest way to disable quote processing entirely in libcsv? My interim solution of setting the quote char to decimal 1 (I think that's control-a) is working for now.

— Reply to this email directly, view it on GitHub https://github.com/rgamble/libcsv/issues/29#issuecomment-1281386646, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCAHBRV73JS5MKNMDVZPSTWDWT4JANCNFSM6AAAAAARGJ7LPE . You are receiving this because you were mentioned.Message ID: @.***>

-- Jonathan Leffler @.***> #include Guardian of DBD::Informix - v2018.1031 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."

jleffler commented 2 years ago

If my guess about 2 GiB is correct, then manufacturing an example should be easier.

You need one malformed line, maybe as line 2 in the CSV data, with an unbalanced quote in it. You then need enough extra well-formed lines without quotes to use more than 2 GiB more data. I'd probably make the lines different, but even that isn't really necessary. You just need the volume.

awk 'BEGIN { OFS="|"; for (i = 1; i <= 20000000; i++) print i, "codswallop", "20 digit number", "other information", "::", "etc"; }' > junk-20000000.data

Adapt to suit — get the field count right, and the general shape (right types in each field).

On Mon, Oct 17, 2022 at 2:33 PM Jonathan Leffler @.***> wrote:

If it works for you, and the data is pure ASCII, then setting 'quote' to control-A (decimal 1) is effective. Any other unused character could be used as the quote.

Does the problem occur when EOF is more than 2 GiB (2^31 - 1) bytes beyond the end of the problematic record (line 199,611)? If so, could a 4-byte integer be being used instead of an 8-byte integer for some key piece of data relating to memory size? May I assume you're using a 64-bit machine where size_t is a 64-bit quantity?

I wonder if libcsv needs a heuristic "maximum quoted field limit" of say 1 MiB, configurable if you really are working with files with larger quoted strings? If the code that looks for the matching quote deals with more than that new limit, then it reports an error on the line where the quote starts, finds the next newline, and resumes parsing after that (or, optionally, fails altogether). That's fiddly to deal with.

On Mon, Oct 17, 2022 at 1:43 PM Chris Lott @.***> wrote:

Right @jleffler https://github.com/jleffler ugh is what I thought also, perhaps a few stronger words also! My file has 27 million records, and the problem occurs as long as it is at least 23,993,660 records long. The content is pure ascii, not even a single high code-page UTF-8 character; I checked via grep --color='auto' -P -n "[\x80-\xFF]" myfile. I have not tried discarding a column. My little preprocessor that uses libcsv simply dumps out each record as it goes, plucks out the interesting columns, saves absolutely nothing.

I think what's going on is very simple: after the problematic mis-quoted line occurs (on line 199,611 if you care), libcsv reads ahead millions of bytes trying to find the closing quotation mark, until eventually something bad happens. The problem does not occur when the data size is small. When a few other fires are burning less brightly I will return to this, try things you suggest like valgrind and gcc compile flags.

Until then, perhaps you could please answer my revised question: what is the safest way to disable quote processing entirely in libcsv? My interim solution of setting the quote char to decimal 1 (I think that's control-a) is working for now.

— Reply to this email directly, view it on GitHub https://github.com/rgamble/libcsv/issues/29#issuecomment-1281386646, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCAHBRV73JS5MKNMDVZPSTWDWT4JANCNFSM6AAAAAARGJ7LPE . You are receiving this because you were mentioned.Message ID: @.***>

-- Jonathan Leffler @.***> #include Guardian of DBD::Informix - v2018.1031 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."

-- Jonathan Leffler @.***> #include Guardian of DBD::Informix - v2018.1031 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."

chrisinmtown commented 2 years ago

@jleffler I'm working strictly in docker containers, pretty sure everything is 64b. My problematic file is 2.3GB. Could very well be something deep inside libcsv is using an int instead of a long; I have not dug in to check.

I think you made an excellent suggestion for a new defensive feature - limit the forward scan for the next double-quote char to some reasonable value, which should be configurable, and back off to the most recent newline if that's not found within the limit.

bobhairgrove commented 2 years ago

It's mostly size_t (which is unsigned) for things like buffer lengths, etc. int is used mostly for return codes, etc. And the source code is published, after all... :)

However, size_t is defined differently in C89 vs. C99, although mostly it is a typedef for long unsigned int these days... at least that is what it says on Ubuntu 18.04 using GCC 7.5.0. It might only hold 32 bits, although then you must be using a VERY old compiler and OS. There is an interesting discussion about it here.

Might want to see what value your compiler returns for sizeof size_t?

chrisinmtown commented 2 years ago

@bobhairgrove no fears of an old compiler or o/s, I'm using an Ubuntu bullseye base docker image with gcc v10, that reports as follows:

root@17270b6b54eb:/# gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@17270b6b54eb:/# echo '#include <stddef.h>' | gcc -xc -E - | grep size_t
typedef long unsigned int size_t;