Closed GabrielNagy closed 1 year ago
@kou I could reproduce this bug at hand.
Thanks for your report. It may be a bug of internal chunk based stream parser: https://github.com/ruby/csv/blob/22e62bc6e07a83f2aef11158138e046191473731/lib/csv/parser.rb#L299-L308
@abcdefg-1234567 Great! Do you want to work on fixing this problem? Could you share a script that produces a CSV that reproduces this problem as the first step?
@kou Yes, I would like to. I used a CSV file made by a tool for spreadsheet and IDE to reproduce this problem. So, I will make script that produces a CSV.
@kou I came up with the following code that creates a csv.
CSV.open('test.csv', 'w') do |csv|
2500.times do
csv << ['AAAA1234567890']
end
end
However, I do not know how to make the following state by code.
I have confirmed that the bug is reproduced when the following conditions are set manually using the IDE. Could you please give me any ideas?
hey @abcdefg-1234567 you can use the following:
File.open('test.csv', 'w') do |f|
2499.times do
f.print("AAAA1234567890\r\n")
end
f.print("AAAA1234567890")
end
@GabrielNagy Thank you!
OK. Let's reduce the reproducible CSV size as much as possible as the next step for easy to debug. If it's difficult, we can start debugging with the reproducible CSV. https://github.com/ruby/csv/issues/279#issuecomment-1547004607 may help you.
I have confirmed that the bug will not reproduce if the csv is less than 2048 rows.
I have confirmed the following.
The result of "value = parse_column_value (line 1030 of parser.rb)" when @ lineno=2048 is "AAAA1234567890AAAAA1234567890".
I am also wondering if changes are needed around the adjust_last_keep method. @kou Could you please explain the role of this method?
Sure.
#adjust_last_keep
was introduced for fixing https://bugs.ruby-lang.org/issues/18245 .
InputsScanner
acts as logically one StringScanner
with multiple inputs. (StringScanner
can't work with multiple strings.)
CSV::Parser
may want to push back read data. For example, if skip_lines
is specified, CSV::Parser
may push back read data. CSV::Parser
reads a line from its scanner (CSV::Parser::Scanner
or CSV::Parser::InputsScanner
) to check whether the line should be skipped. If the read line isn't skip target, CSV::Parser
pushes back the read line and parses the line as a CSV line. keep_start
/keep_drop
/keep_back
/keep_end
are for it.
adjust_last_keep
is related to these keep_*
methods.InputsScanner
processes multiple inputs. So the target data (for example, one line for skip_lines
) may exist in multiple inputs. For example, "# a"
, "bc"
and "\n"
are one line but they are 3 inputs. adjust_last_keep
is for the situation. If we need to concatenate data from multiple inputs, adjust_last_keep
does it.
I hope that this explanation helps you.
Thank you for your detailed explanation! I will refer to this and continue the investigation.
Including line number in line contents will helpful:
File.open('/tmp/test.csv', 'w') do |f|
lines = 2500.times.collect do |i|
"A%013d" % i
end
f.print(lines.join("\r\n"))
end
Output with the test file:
...
A0000000002497
A0000000002498
A0000000002499A0000000002499
It seems that the last line was used twice.
I cloud reproduce this with the script:
ENV["CSV_PARSER_SCANNER_TEST"] = "yes"
require "csv"
csv = CSV.new("a\r\nb", row_sep: "\r\n", strip: true, skip_lines: /\A *\z/)
csv.each do |row|
pp row
end
["a"]
["bb"]
Thanks for fixing this!
I found a bug that only reproduces with a very specific set of prerequisites (all of the following must be true):
CSV.foreach
strip: true
skip_lines: /\A,+\n?\z/
The following example (where
original.csv
is a file containing the lineAAAA1234567890
~2500 times):will print the last line duplicated:
As a workaround I used
CSV.parse(File.read, ...)
with the same options, but I still wanted to flag this issue.