ruby / csv

CSV Reading and Writing
https://ruby.github.io/csv/
BSD 2-Clause "Simplified" License
178 stars 113 forks source link

Fix case where an \r\n row separator will be split when reading a chunk #221

Closed jeremyevans closed 2 years ago

jeremyevans commented 2 years ago

In this case, read one more character.

This is a suboptimal fix, as it doesn't fix handling of row separators that aren't two characters and starting with \r. A better fix would handle all multibyte row separators. However, as \r\n is one of the most common row separators, I think it's useful to merge this until a more generic solution is developed.

Fixes [Bug #18245]

kou commented 2 years ago

Thanks! But this is not a real fix of this problem. There is a problem in keep_start/keep_drop over @scanner switch. I'll fix it later.

Anyway, we need to improve this case. But we can specify row separator explicitly:

diff --git a/lib/csv/parser.rb b/lib/csv/parser.rb
index 0d8a157..2d76316 100644
--- a/lib/csv/parser.rb
+++ b/lib/csv/parser.rb
@@ -85,9 +85,10 @@ class CSV
     # If there is no more data (eos? = true), it returns "".
     #
     class InputsScanner
-      def initialize(inputs, encoding, chunk_size: 8192)
+      def initialize(inputs, encoding, row_separator, chunk_size: 8192)
         @inputs = inputs.dup
         @encoding = encoding
+        @row_separator = row_separator
         @chunk_size = chunk_size
         @last_scanner = @inputs.empty?
         @keeps = []
@@ -233,7 +234,7 @@ class CSV
           @last_scanner = @inputs.empty?
           true
         else
-          chunk = input.gets(nil, @chunk_size)
+          chunk = input.gets(@row_separator, @chunk_size)
           if chunk
             raise InvalidEncoding unless chunk.valid_encoding?
             @scanner = StringScanner.new(chunk)
@@ -737,6 +738,7 @@ class CSV
         chunk_size = ENV["CSV_PARSER_SCANNER_TEST_CHUNK_SIZE"] || "1"
         InputsScanner.new(inputs,
                           @encoding,
+                          @row_separator,
                           chunk_size: Integer(chunk_size, 10))
       end
     else
@@ -763,7 +765,7 @@ class CSV
             StringIO.new(sample)
           end
           inputs << @input
-          InputsScanner.new(inputs, @encoding)
+          InputsScanner.new(inputs, @encoding, @row_separator)
         end
       end
     end
kou commented 2 years ago

I pushed the row separator fix. It fixes a problem with the reported data. But it's not a real fix. We should fix it later: #230