janx / chardet2

Universal Encoding Detector
GNU Lesser General Public License v2.1
23 stars 9 forks source link

I got ArgumentError: invalid byte sequence in UTF-8 in ruby 1.9.3 while trying to detect a 'ISO-8859-1' encoded csv file. #7

Open orbanbotond opened 11 years ago

orbanbotond commented 11 years ago

ArgumentError: invalid byte sequence in UTF-8 from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:in =~' from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:infeed' from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:46:in `chardet'

janx commented 11 years ago

Can you attach the csv file?

orbanbotond commented 11 years ago

The file is 35 Mbytes huge. I will try it to make it smaller.

janx commented 11 years ago

@orbanbotond I cannot reproduce on my ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-linux], here's my test script:

require 'UniversalDetector'

data = File.open('Insight_Extract_11-04-2013-a.csv', 'rb').read
p UniversalDetector.chardet(data)

The output is {"encoding"=>"ISO-8859-2", "confidence"=>0.7616471388020385}.

orbanbotond commented 11 years ago

Well... at such a huge file it took me forever to run.... I haven't got any result.

How long did it take at you to get the result for the detection?

janx commented 11 years ago

I can't remember the exact number, 5-10 mins I guess.