UTF-8 with BOM (character 65279 at file start)

tilo / smarter_csv

Ruby Gem for convenient reading and writing of CSV files. It has intelligent defaults, and auto-discovery of column and row separators. It imports CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, kicking-off batch jobs with Sidekiq, parallel processing, or oploading data to S3. Writing CSV Files is equally easy.

MIT License

1.46k stars 189 forks source link

UTF-8 with BOM (character 65279 at file start) #27

Closed kaligrafy closed 1 year ago

kaligrafy commented 10 years ago

I was having weird problems when reading keys from a smarter_csv parsed csv file: the first key was never recognize. After a long time I tought maybe the file had a special character in it and then: boom! i found the csv file was in UTF-8 with BOM, and there was a Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF) (#65279) as the first character, so the first key always had this character at index [0] and nobody could see it...

Could you ignore BOM and delete this character if first line starts with it when reading the csv file? Thanks!

You could do this:

content = File.open('file.txt', "r:bom|utf-8")
content[0] = "" if content.start_with?(["FEFF".to_i(16)].pack("U"))

tilo commented 10 years ago

can you please upload an example CSV-file somewhere, and add the link to it here?

kaligrafy commented 10 years ago

GTFS files from Montreal STM: http://www.stm.info/sites/default/files/gtfs/gtfs_stm.zip

tilo commented 10 years ago

thanks! How did you generate these CSV files? were they produced by exporting them from somewhere?

tilo commented 10 years ago

Please Note that you can use any open file-handle instead of the filename!!

This works just nicely for me (e.g. it automatically strips the unicode character):

  > irb
 require 'smarter_csv'
 f =  File.open('trips.txt',  "r:bom|utf-8")
 data = SmarterCSV.process( f )
 f.close

 data.size
 => 152414
 data.first
 =>  {:route_id=>1, :service_id=>"13N_S", :trip_id=>"13N_13N_S_1_1_0.22917", :trip_headsign=>"Station Honoré-Beaugrand"}

I think this is really a corner-case, probably caused by the program from which you created those CSV files. I think it's not really needed to fix this in the smarter_csv gem itself.

kaligrafy commented 10 years ago

These files are public and generated by the agency itself. I did not create them.

In your example, can you check that you can call data.first[:route_id] on it? because on my system, the first key (:route_id) does include the zero-width character, so data.first[:route_id] returns nil.

Pierre-Léo Bourbonnais, Ing. Jr. Étudiant au doctorat et chargé de cours École Polytechnique de Montréal | Génie civil Étude Mobilité 514-340-4711 #4235 Local B-327 leo.bourbonnais@polymtl.ca

On 2014-02-06, at 19:37, Tilo notifications@github.com wrote:

Closed #27.

— Reply to this email directly or view it on GitHub.

ajw725 commented 6 years ago

@tilo i realize this is four years old, and your workaround works for me, but this does not seem to be a corner case. i'm seeing it now with any CSV generated by Excel on Windows.

kaligrafy commented 6 years ago

OK, thanks! Feel free to add the workaround if you think it could be useful.

ajw725 commented 6 years ago

as @tilo described / using his example:

without the workaround:

data = SmarterCSV.process('trips.txt')

data.first
=> {:route_id=>1, :service_id=>"13N_S", :trip_id=>"13N_13N_S_1_1_0.22917", :trip_headsign=>"Station Honoré-Beaugrand"}

data.first[:route_id]
=> nil

data.first.keys.first.to_s.chars
=> ["", "r", "o", ...]

with the workaround:

data = nil
File.open('trips.txt', 'r:bom|utf-8') { |f| data = SmarterCSV.process(f) }

data.first
=> {:route_id=>1, :service_id=>"13N_S", :trip_id=>"13N_13N_S_1_1_0.22917", :trip_headsign=>"Station Honoré-Beaugrand"}

data.first[:route_id]
=> 1

data.first.keys.first.to_s.chars
=> ["r", "o", ...]

9mm commented 2 years ago

Shouldnt a smarter CSV library handle this automatically? I can't open any CSV thats been exported by excel because these are all the column headers:

On top of that, look at what it parses for the partner_id:

9mm commented 2 years ago

This is definitely not a corner case, considering BOM are the bane of working with CSV's, and excel is the most popular CSV program. This being handled within the gem seems like a no brainer

tilo commented 1 year ago

looks like the previous fix for the BOM issue did not work, or there was a regression

$ hexdump -C /tmp/bom-issue.csv
00000000  ef bb bf 73 6f 6d 65 5f  69 64 2c 74 79 70 65 2c  |...some_id,type,|
00000010  66 75 7a 7a 62 6f 78 65  73 0d 0a 34 32 37 36 36  |fuzzboxes..42766|
00000020  38 30 35 2c 7a 69 7a 7a  6c 65 73 2c 31 32 33 34  |805,zizzles,1234|
00000030  0d 0a 33 38 37 35 39 31  35 30 2c 71 75 69 7a 7a  |..38759150,quizz|
00000040  65 73 2c 35 36 37 38 0d  0a                       |es,5678..|

tilo commented 1 year ago

fixed in https://github.com/tilo/smarter_csv/pull/220