janx / chardet2

Universal Encoding Detector
GNU Lesser General Public License v2.1
23 stars 9 forks source link

Encoding::CompatibilityError: incompatible encoding regexp match #8

Open orbanbotond opened 11 years ago

orbanbotond commented 11 years ago

Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:in =~' from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:infeed' from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:46:in `chardet' from (irb):12

saneshark commented 10 years ago

same issue testing with:

UniversalDetector.chardet("∀,∈,≠,Ω,∑,∏,ɔ,⍴,€,ζ,π,ป่")

which should return utf8 as the encoding type.

mremond commented 10 years ago

Did you solve your issue ?

orbanbotond commented 10 years ago

Hi,

It is now a deprecated project. But despite that the issue is still there. The lib didn't return me the proper encoding.

On 22 February 2014 12:54, Mickaël Rémond notifications@github.com wrote:

Did you solve your issue ?

— Reply to this email directly or view it on GitHubhttps://github.com/janx/chardet2/issues/8#issuecomment-35799973 .

mremond commented 10 years ago

Thanks ! I guess I have to find an alternative way of detecting encoding then.

orbanbotond commented 10 years ago

Well no... I tried 3 other libs and then I decided to manually specify the encoding...

On 22 February 2014 13:08, Mickaël Rémond notifications@github.com wrote:

Thanks ! I guess I have to find an alternative way of detecting encoding then.

— Reply to this email directly or view it on GitHubhttps://github.com/janx/chardet2/issues/8#issuecomment-35800192 .

orbanbotond commented 10 years ago

I think it was a hard case.

On 22 February 2014 13:09, Botond Orbán orbanbotond@gmail.com wrote:

Well no... I tried 3 other libs and then I decided to manually specify the encoding...

On 22 February 2014 13:08, Mickaël Rémond notifications@github.comwrote:

Thanks ! I guess I have to find an alternative way of detecting encoding then.

— Reply to this email directly or view it on GitHubhttps://github.com/janx/chardet2/issues/8#issuecomment-35800192 .

saneshark commented 10 years ago

I just patched rchardet, an older library. Although I'm thinking one could just write the string to a temp file and use the system:

 encoding = `file --mime-encoding string.tmp | awk '{print $2}'`.strip.upcase
 string.force_encoding(encoding)