ruby / csv

CSV Reading and Writing
https://ruby.github.io/csv/
BSD 2-Clause "Simplified" License
177 stars 113 forks source link

feature request: Warn when reading BOM text with headers option #301

Closed JunichiIto closed 3 months ago

JunichiIto commented 3 months ago

When BOM exists in CSV text, you cannot read first column by field name:

require 'csv'

text = "\u{feff}id,name\n1,Alice"
csv = CSV.parse(text, headers: true)
csv[0]['id']   #=> nil
csv[0]['name'] #=> "Alice"

I understand there are some workarounds for this problem:

But the biggest problem is it's too hard to notice the existence of BOM. Actually, when I came across this issue, I had no idea why I failed to get the id value. I guess most people never wants to keep BOM in field name.

So I'd be happy if CSV.parse or some other reading methods would warn when input text contains BOM and headers: true is specified, like this:

text = "\u{feff}id,name\n1,Alice"
csv = CSV.parse(text, headers: true)
#=> warning: first header name starts with U+FEFF (zero width no-break space, BOM)
csv = CSV.read('with-bom.csv', headers: true)
#=> warning: first header name starts with U+FEFF (zero width no-break space, BOM)

# warning is not required when encoding like "bom|utf-8" is specified
csv = CSV.read('with-bom.csv', headers: true, encoding: 'bom|utf-8')
#=> (no warnings)

By the way, why do we put BOM so often? This is because Excel cannot open UTF-8 CSV correctly without BOM! Save CSV file in UTF-8 with BOM. Fix Korean text from being corrupted… | by Hyunbin | Medium

Related issue: https://github.com/ruby/csv/issues/43

kou commented 3 months ago

I can understand the motivation but I'm not sure that this is a proper approach.

How about enabling BOM detection in CSV.open by default instead?

JunichiIto commented 3 months ago

How about enabling BOM detection in CSV.open by default instead?

Do you mean BOM is removed automatically without specifying encoding?

# like this
csv = CSV.open('with-bom.csv', headers: true)
csv[0]['id'] #=> 1

It would be much better than warning!

kou commented 3 months ago

Yes.

How about this?

diff --git a/lib/csv.rb b/lib/csv.rb
index b016b8f..b5a4fe6 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -1581,7 +1581,14 @@ class CSV
     def open(filename, mode="r", **options)
       # wrap a File opened with the remaining +args+ with no newline
       # decorator
-      file_opts = options.dup
+      file_opts = {}
+      have_encoding_options = (options.key?(:encoding) or
+                               options.key?(:external_encoding) or
+                               mode.include?(":"))
+      if not have_encoding_options and Encoding.default_external == Encoding::UTF_8
+        file_opts[:encoding] = "bom|utf-8"
+      end
+      file_opts.merge!(options)
       unless file_opts.key?(:newline)
         file_opts[:universal_newline] ||= false
       end
JunichiIto commented 3 months ago

Great! That works fine. Is it possible to apply to CSV.parse too?

require './lib/csv'

text = "\u{feff}id,name\n1,Alice"
File.write('tmp/with-bom.csv', text)

csv = CSV.open('tmp/with-bom.csv', headers: true).read
p csv[0]['id']
p csv[0]['name']

csv = CSV.read('tmp/with-bom.csv', headers: true)
p csv[0]['id']
p csv[0]['name']

CSV.foreach('tmp/with-bom.csv', headers: true) do |row|
  p csv[0]['id']
  p csv[0]['name']
end

csv = CSV.parse(File.read('tmp/with-bom.csv'), headers: true)
p csv[0]['id']
p csv[0]['name']
"1"
"Alice"
"1"
"Alice"
"1"
"Alice"
nil
"Alice"
kou commented 3 months ago

Is it possible to apply to CSV.parse too?

No. It's an user's responsibility that BOM is handled before calling CSV.parse because it receives an opened IO or String.

Why do you want to use CSV.parse(File.read) not CSV.open (or something)?

JunichiIto commented 3 months ago

I used CSV.parse for testabilty. For example, like this:

require 'csv'

class MyClass
  def do_something
    CSV.parse(my_csv_text, headers: true).map do |row|
      row['name']
    end
  end

  # extract to a method to make it easier to stub
  def my_csv_text
    File.read('my_file.csv', encoding: 'bom|utf-8')
  end
end

RSpec.describe MyClass do
  let(:my_csv_text) do
    <<~CSV
      name,age
      Alice,20
      Bob,30
    CSV
  end
  it 'does something' do
    my_class = MyClass.new
    allow(my_class).to receive(:my_csv_text).and_return(my_csv_text)
    expect(my_class.do_something).to eq(['Alice', 'Bob'])
  end
end
kou commented 3 months ago

The example works well because it uses encoding: 'bom|utf-8' for File.read, right?

I this case, both of the current CSV.parse and your implementation don't have any problem, right?

JunichiIto commented 3 months ago

Yes, but my first implementation was like this:

require 'csv'

class MyClass
  def do_something
    CSV.parse(my_csv_text, headers: true).map do |row|
      row['name']
    end
  end

  # extract to a method to make it easier to stub
  def my_csv_text
    # my first implementation (it didn't work)
    File.read('my_file.csv')
  end
end

RSpec.describe MyClass do
  let(:my_csv_text) do
    <<~CSV
      name,age
      Alice,20
      Bob,30
    CSV
  end
  it 'does something' do
    my_class = MyClass.new
    allow(my_class).to receive(:my_csv_text).and_return(my_csv_text)
    expect(my_class.do_something).to eq(['Alice', 'Bob'])
  end
end

So I was wondering why I couldn't get values by column name.

kou commented 3 months ago

I think that it's your program's problem. If your program opens a file, your program is responsible for processing BOM.

(In this case, I think that it's better that you don't use a stub for better testing.)

JunichiIto commented 3 months ago

OK, I understand your idea.

But as a library user, it's hard to notice the behavior difference between parse and open. Developers might be confused when row['name'] will not return value only when using parse. So at least the specification should be well documented.

kou commented 3 months ago

Do you have a suggested documentation change?

JunichiIto commented 3 months ago

How about this?

Please make sure if your text contains BOM or not. CSV.parse will not remove BOM automatically. You might want to remove BOM before calling CSV.parse:

# remove BOM on calling File.open
csv_table = File.open(path, encoding: 'bom|utf-8') do |file|
  CSV.parse(file, headers: true)
end
kou commented 3 months ago

It looks good to me. Could you open a PR that adds it to the CSV.parse document? https://github.com/ruby/csv/blob/73b877dec93e68f9db1ed555299bede15b54ed5e/lib/csv.rb#L1620-L1731

JunichiIto commented 3 months ago

Thank you for your review. I created PR here: https://github.com/ruby/csv/pull/305

kou commented 3 months ago

I've applied https://github.com/ruby/csv/issues/301#issuecomment-2094964029 .

JunichiIto commented 3 months ago

Thank you! 😄