vincentlaucsb / csv-parser

A high-performance, fully-featured CSV parser and serializer for modern C++.
MIT License
901 stars 150 forks source link

Infinite loop when parsing lines longer than 10MB / ITERATION_CHUNK_SIZE #218

Open sjoubert opened 9 months ago

sjoubert commented 9 months ago

If a line is longer than ITERATION_CHUNK_SIZE (corner case, but unfortunately large dataset happen in the wild), CSVReader::read_row will loop forever, spawning read_csv thread infinitely.

For now I've patched this using:

diff --git a/single_include/csv.hpp b/single_include/csv.hpp
index 9cebfc2..e24ddff 100644
--- a/single_include/csv.hpp
+++ b/single_include/csv.hpp
@@ -7517,6 +7517,7 @@ namespace csv {
      *
      */
     CSV_INLINE bool CSVReader::read_row(CSVRow &row) {
+        bool readRequested = false;
         while (true) {
             if (this->records->empty()) {
                 if (this->records->is_waitable())
@@ -7530,7 +7531,15 @@ namespace csv {
                     if (this->read_csv_worker.joinable())
                         this->read_csv_worker.join();

+                    // More reading was already requested and not successful
+                    if (readRequested && this->records->empty())
+                        // Other error, e.g. line is too large for ITERATION_CHUNK_SIZE
+                        throw std::runtime_error(
+                            "End of file not reached and no more records parsed, this may be due to a line longer than "
+                            + std::to_string(internals::ITERATION_CHUNK_SIZE) + " bytes.");
+
                     this->read_csv_worker = std::thread(&CSVReader::read_csv, this, internals::ITERATION_CHUNK_SIZE);
+                    readRequested = true;
                 }
             }
             else if (this->records->front().size() != this->n_cols &&

This avoids spawning a new thread if the previous one was unsuccessful (here because the line is too long for the chunk and can't be parsed).

It looks like this should be handle better/elsewhere. But for now this seems to work for me.

vincentlaucsb commented 6 months ago

I'll see what I can do. I appreciate the example patch you've provided.

I didn't expect people to try to use this library on embedded devices (where ITERATION_CHUNK_SIZE is too large) nor did I expect people to try to parse CSVs where there are single rows that that up more space than ITERATION_CHUNK_SIZE.