ruby-rdf / rdf-rdfa

Ruby RDFa reader/writer for RDF.rb.
http://ruby-rdf.github.com/rdf-rdfa
The Unlicense
35 stars 11 forks source link

Invalid byte sequence in UTF-8 #12

Closed charly closed 10 years ago

charly commented 10 years ago

Hi,

I've been using rdf.rb on dppedia.fr with ruby 2.0 & the error happens here in the Format#detectmethod:

rdf-rdfa-1.0.3/lib/rdf/rdfa/format.rb:36:in `match'

the method detect receives a text sample and does some matching on it. the sample contains :

<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>...
<title>About: J. M. G. Le Cl\xE9zio</title>

I guess the culprit is \xE9 If I force the specified encoding before the match :

sample.force_encoding('ISO-8859-1') 

...it works out. Otherwise it chokes

gkellogg commented 10 years ago

Thanks, Ruby 1.9+ has made character encoding issues quite painful. Probably the right place to do this is in RDF::Format (in the RDF.rb gem), so that all gems get a consistent sample. However, the encoding should probably be UTF-8, rather than an ISO encoding.

A specific reproducible use case would be useful.

charly commented 10 years ago

I'm scrapping writers on dbpedia.fr/page because dbpedia.fr/resource sometimes gives me nothing, I don't know why... Anyway you can reproduce it like this :

RDF::Graph.load("http://fr.dbpedia.org/page/J._M._G._Le_Cl%C3%A9zio")

when the sample arrives the encoding is :

sample.encoding => #<Encoding:UTF-8>

Notice that if you go on the webpage : http://fr.dbpedia.org/page/Andr%C3%A9_Gide you'll see the text is scrambled on the 1st paragraph. It appears dbpedia's doing something wrong, mixing utf-8 & iso...

gkellogg commented 10 years ago

If the page is encoded in UTF-8, but is actually some other encoding, I don't really know how we can detect that. However, forcing the sample to ASCII might be enough to get detect to behave. But, this sounds like a problem on DBpedia.

/resource will do a redirect; it could be that OpenURI.open_uri, which is relied on to open such files, doesn't honor the 303 redirect. Other libraries typically override tis, but re-implementing using Net::HTTP might be necessary to get redirection/range-14 issues right.

gkellogg commented 10 years ago

This is fixed on the develop branch targeted for a 1.1 release. The fix is in RDF.rb, where samples are all cast to ASCII_8BIT before checking, along with numerous other fixes to character encodings.