ruby / csv

CSV Reading and Writing
https://ruby.github.io/csv/
BSD 2-Clause "Simplified" License
181 stars 114 forks source link

Duplicated last line in CSV.foreach #279

Closed GabrielNagy closed 1 year ago

GabrielNagy commented 1 year ago

I found a bug that only reproduces with a very specific set of prerequisites (all of the following must be true):

The following example (where original.csv is a file containing the line AAAA1234567890 ~2500 times):

CSV.foreach('original.csv', strip: true, skip_lines: /\A,+\n?\z/) do |data|
  puts data[0]
end

will print the last line duplicated:

AAAA1234567890
AAAA1234567890
AAAA1234567890
AAAA1234567890
AAAA1234567890AAAA1234567890
...

As a workaround I used CSV.parse(File.read, ...) with the same options, but I still wanted to flag this issue.

tmatsuura1 commented 1 year ago

@kou I could reproduce this bug at hand.

kou commented 1 year ago

Thanks for your report. It may be a bug of internal chunk based stream parser: https://github.com/ruby/csv/blob/22e62bc6e07a83f2aef11158138e046191473731/lib/csv/parser.rb#L299-L308

@abcdefg-1234567 Great! Do you want to work on fixing this problem? Could you share a script that produces a CSV that reproduces this problem as the first step?

tmatsuura1 commented 1 year ago

@kou Yes, I would like to. I used a CSV file made by a tool for spreadsheet and IDE to reproduce this problem. So, I will make script that produces a CSV.

tmatsuura1 commented 1 year ago

@kou I came up with the following code that creates a csv.

CSV.open('test.csv', 'w') do |csv|
  2500.times do
    csv << ['AAAA1234567890']
  end
end

However, I do not know how to make the following state by code.

I have confirmed that the bug is reproduced when the following conditions are set manually using the IDE. Could you please give me any ideas?

GabrielNagy commented 1 year ago

hey @abcdefg-1234567 you can use the following:

File.open('test.csv', 'w') do |f|
  2499.times do
    f.print("AAAA1234567890\r\n")
  end
  f.print("AAAA1234567890")
end
tmatsuura1 commented 1 year ago

@GabrielNagy Thank you!

kou commented 1 year ago

OK. Let's reduce the reproducible CSV size as much as possible as the next step for easy to debug. If it's difficult, we can start debugging with the reproducible CSV. https://github.com/ruby/csv/issues/279#issuecomment-1547004607 may help you.

tmatsuura1 commented 1 year ago

I have confirmed that the bug will not reproduce if the csv is less than 2048 rows.

tmatsuura1 commented 1 year ago

I have confirmed the following.

The result of "value = parse_column_value (line 1030 of parser.rb)" when @ lineno=2048 is "AAAA1234567890AAAAA1234567890".

I am also wondering if changes are needed around the adjust_last_keep method. @kou Could you please explain the role of this method?

kou commented 1 year ago

Sure.

#adjust_last_keep was introduced for fixing https://bugs.ruby-lang.org/issues/18245 .

InputsScanner acts as logically one StringScanner with multiple inputs. (StringScanner can't work with multiple strings.)

CSV::Parser may want to push back read data. For example, if skip_lines is specified, CSV::Parser may push back read data. CSV::Parser reads a line from its scanner (CSV::Parser::Scanner or CSV::Parser::InputsScanner) to check whether the line should be skipped. If the read line isn't skip target, CSV::Parser pushes back the read line and parses the line as a CSV line. keep_start/keep_drop/keep_back/keep_end are for it.

adjust_last_keep is related to these keep_* methods.InputsScanner processes multiple inputs. So the target data (for example, one line for skip_lines) may exist in multiple inputs. For example, "# a", "bc" and "\n" are one line but they are 3 inputs. adjust_last_keep is for the situation. If we need to concatenate data from multiple inputs, adjust_last_keep does it.

I hope that this explanation helps you.

tmatsuura1 commented 1 year ago

Thank you for your detailed explanation! I will refer to this and continue the investigation.

kou commented 1 year ago

Including line number in line contents will helpful:

File.open('/tmp/test.csv', 'w') do |f|
  lines = 2500.times.collect do |i|
    "A%013d" % i
  end
  f.print(lines.join("\r\n"))
end

Output with the test file:

...
A0000000002497
A0000000002498
A0000000002499A0000000002499

It seems that the last line was used twice.

kou commented 1 year ago

I cloud reproduce this with the script:

ENV["CSV_PARSER_SCANNER_TEST"] = "yes"

require "csv"

csv = CSV.new("a\r\nb", row_sep: "\r\n", strip: true, skip_lines: /\A *\z/)
csv.each do |row|
  pp row
end
["a"]
["bb"]
GabrielNagy commented 1 year ago

Thanks for fixing this!