vincentlaucsb / csv-parser

A high-performance, fully-featured CSV parser and serializer for modern C++.
MIT License
897 stars 151 forks source link

Incorrect values read for first column in new chunk #180

Open CrustyAuklet opened 3 years ago

CrustyAuklet commented 3 years ago

Background

I have a CSV file that is 10,090,688 bytes and the first column of the first row in a new section is reading incorrectly. In my case this column in null (it is "" in the whole CSV) and the is_null() function returns false for this one row and when read returns a value halfway through the line.

As a short explanation this first column is the dataset name, and there are multiple datasets per file in theory (but we never do that). If the name is null then the name should be the sensor serial number from column three.

This bug causes the null check to fail on that first row in a new chunk, then I try to read the value I get the bad result. All other values in the row seem to read fine, and all other rows in the chunk read correctly. This results in all the data in the one properly named dataset, and one lone dataset with a single value and the strange name.

Investigation

Using the debugger I have tracked it down to the CSVRow::get_field function on line 7683: in this section:

CSV_INLINE csv::string_view CSVRow::get_field(size_t index) const
{
        // lines omitted for brevity..

        const size_t field_index = this->fields_start + index;
        auto& field = this->data->fields[field_index];
        auto field_str = csv::string_view(this->data->data).substr(this->data_start + field.start);

For the offending row, when accessing the first column, the field struct retrieved from has an incorrect value in the start member. In this case it is 138 when it should be zero, but that isn't consistent when I mess with the chunk size to make the problem appear more. This results in the wrong substring in field_str. Everything else is working as expected as far as I can tell.

I have also found that if I change constexpr size_t ITERATION_CHUNK_SIZE = 10000000; to a small value I get many more single value datasets with random names. The number is proportional to the change in that value.

Platforms:

MSVC 19.29.30132.0 on windows 10 GCC 10.3.0 on Ubuntu 20.04

csv version

Started with 2.1.0 release, but have been working with the "single include" header from master since I found the bug.

CrustyAuklet commented 3 years ago

After some more experiments, I have discovered this only happens when constructing the reader with an std::stringstream, and not if I use the memory mapped version by passing in a filename.

If I am testing with a small value for ITERATION_CHUNK_SIZE though, the mio version also fails when it throws in CSVRow::get_field for index out of bounds (the index is correct, but the CSVRow data seems to start with a newline).

If you have any idea where I can poke around I am happy to take a stab at this.

jimbeveridge commented 2 years ago

+1. I am seeing this same bug under Visual Studio 2019 with a recent update. Same behavior - works with memory mapped files, breaks with stringstream. Also a large-ish file (at least several megabytes.)

MichaelSteffens commented 2 years ago

+1. Same with 2.1.3, g++ 9.3.0, and parsing a file stream. The issue is not exposed if a new chunk starts immediately after a delimiter, but in all other cases.