tilo / smarter_csv

Ruby Gem for convenient reading and writing of CSV files. It has intelligent defaults, and auto-discovery of column and row separators. It imports CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, kicking-off batch jobs with Sidekiq, parallel processing, or oploading data to S3. Writing CSV Files is equally easy.
MIT License
1.46k stars 189 forks source link

Unbalanced quote_char does not raise an error #283

Open rex-remind101 opened 1 month ago

rex-remind101 commented 1 month ago

Actual output:

$ cat cats.csv
first name,last name,dogs,cats,birds,fish
     Dan,McAllister,2,1,"2,4
     Lucy,Laweless,,5,,
     Miles,O'Brian,,,,21
     Nancy,Homes,2,,1,

x = SmarterCSV.process('/Users/Me/Documents/parentsquare/parent_square/cats.csv', options: {verbose: true})
EOFError: end of file reached
from /Users/Me/.rbenv/versions/3.0.6/lib/ruby/gems/3.0.0/gems/smarter_csv-1.2.8/lib/smarter_csv/smarter_csv.rb:141:in `readline'

Expected output: EOFError: end of file reached on line 1 (assumes 0 counting)

rex-remind101 commented 1 month ago

I appeared to be on an older version but now i'm seeing a different issue

Actual output:

$ cat cats.csv
first name,last name,dogs,cats,birds,fish
     Dan,McAllister,2,1,2,"4
     Lucy,Laweless "Lawless",,5,,
     Miles,O'Brian,,,,21
     Nancy,Homes,2,,1,

pry(main)> x = SmarterCSV.process('/Users/Me/Documents/parentsquare/parent_square/cats.csv', options: {verbose: true})
=> [{:first_name=>"Dan",
  :last_name=>"McAllister",
  :dogs=>2,
  :cats=>1,
  :birds=>2,
  :fish=>"\"4\n     Lucy,Laweless \"Lawless\",,5,,\n     Miles,O'Brian,,,,21\n     Nancy,Homes,2,,1,"}]

Expected output: EOFError: end of file reached on line 1 (assumes 0 counting)

tilo commented 1 month ago

@rex-remind101 have you seen this in actual CSV data, or is this just a theoretical example? What program generated the CSV file you saw in production?

All the double-quote characters in the input file should be escaped. The file is malformed.

require 'smarter_csv'
reader = SmarterCSV::Reader.new('/tmp/cats2.csv')

reader.process
reader
 =>
#<SmarterCSV::Reader:0x00000003027f8c08
 @chunk_count=0,
 @csv_line_count=2,
 @enforce_utf8=false,
 @errors={},
 @file_line_count=5,
 @has_acceleration=true,
 @has_rails=false,
 @headerA=[:first_name, :last_name, :dogs, :cats, :birds, :fish],
 @headers=[:first_name, :last_name, :dogs, :cats, :birds, :fish],
 @input="/tmp/cats2.csv",
 @options=
  {:acceleration=>true,
   :auto_row_sep_chars=>500,
   :chunk_size=>nil,
   :col_sep=>",",
   :comment_regexp=>nil,
   :convert_values_to_numeric=>true,
   :downcase_header=>true,
   :duplicate_header_suffix=>"",
   :file_encoding=>"utf-8",
   :force_simple_split=>false,
   :force_utf8=>false,
   :headers_in_file=>true,
   :invalid_byte_sequence=>"",
   :keep_original_headers=>false,
   :key_mapping=>nil,
   :quote_char=>"\"",
   :remove_empty_hashes=>true,
   :remove_empty_values=>true,
   :remove_unmapped_keys=>false,
   :remove_values_matching=>nil,
   :remove_zero_values=>false,
   :required_headers=>nil,
   :required_keys=>nil,
   :row_sep=>"\n",
   :silence_missing_keys=>false,
   :skip_lines=>nil,
   :strings_as_keys=>false,
   :strip_chars_from_headers=>nil,
   :strip_whitespace=>true,
   :user_provided_headers=>nil,
   :value_converters=>nil,
   :verbose=>false,
   :with_line_numbers=>false},
 @raw_header="first name,last name,dogs,cats,birds,fish\n",
 @result=
  [{:first_name=>"Dan",
    :last_name=>"McAllister",
    :dogs=>2,
    :cats=>1,
    :birds=>2,
    :fish=>"\"4\n     Lucy,Laweless \"Lawless\",,5,,\n     Miles,O'Brian,,,,21\n     Nancy,Homes,2,,1,"}],
 @verbose=false,
 @warnings={}>
tilo commented 1 month ago

@rex-remind101 when the double quote characters are properly escaped, parsing is no problem:

 cat  /tmp/cats3.csv
first name,last name,dogs,cats,birds,fish
     Dan,McAllister,2,1,2,\"4
     Lucy,Laweless \"Lawless\",,5,,
     Miles,O'Brian,,,,21
     Nancy,Homes,2,,1,

data = SmarterCSV.process('/tmp/cats3.csv')     
ap data
=> 

[
    {
        :first_name => "Dan",
         :last_name => "McAllister",
              :dogs => 2,
              :cats => 1,
             :birds => 2,
              :fish => "\\\"4"
    },
    {
        :first_name => "Lucy",
         :last_name => "Laweless \\\"Lawless\\\"",
              :cats => 5
    },
    {
        :first_name => "Miles",
         :last_name => "O'Brian",
              :fish => 21
    },
    {
        :first_name => "Nancy",
         :last_name => "Homes",
              :dogs => 2,
             :birds => 1
    }
]
tilo commented 1 month ago

@rex-remind101 if this happens, then the source of your CSV files does not properly escape the double-quote characters!

It is best to fix such an issue at the source, because an unescaped double-quote is not valid.

There is a hacky work around - you can give SmarterCSV the option to set the quote_char to something different:

 > data = SmarterCSV.process('/tmp/cats.csv', quote_char: '^')
 =>
 [
    {
        :first_name => "Dan",
         :last_name => "McAllister",
              :dogs => 2,
              :cats => 1,
             :birds => "\"2",
              :fish => 4
    },
    {
        :first_name => "Lucy",
         :last_name => "Laweless",
              :cats => 5
    },
    {
        :first_name => "Miles",
         :last_name => "O'Brian",
              :fish => 21
    },
    {
        :first_name => "Nancy",
         :last_name => "Homes",
              :dogs => 2,
             :birds => 1
    }
]
tilo commented 1 month ago

This malformed CSV input should throw an error SmarterCSV::MalformedCSV : unbalanced quote_char in line 2

(independent of the verbose option)

randall-coding commented 1 month ago

I'm using the latest version 1.12.1, and instead of an EOF error my code will hang indefinitely when SmarterCSV.process is given a block.

SmarterCSV.process(temp_filename, {col_sep: @col_sep, chunk_size: 1000, headers_in_file: false, user_provided_headers: headers}) do |chunk|
     Rails.logger.info("Sanitizing chunk")
     records = chunk.map { |hash| map_row_to_record(hash.values) }
     Rails.logger.info("Insert_all")
     NdmMedicalClaim.insert_all(records)
end

Seems to be the same issue because changing the quote_char as suggested by @tilo solved it. Hopefully there are no unintended side effects with that hack.

MattKitmanLabs commented 17 hours ago

@randall-coding We are getting this hanging indefinitely issue also.