Closed charly closed 10 years ago
Thanks, Ruby 1.9+ has made character encoding issues quite painful. Probably the right place to do this is in RDF::Format (in the RDF.rb gem), so that all gems get a consistent sample. However, the encoding should probably be UTF-8, rather than an ISO encoding.
A specific reproducible use case would be useful.
I'm scrapping writers on dbpedia.fr/page because dbpedia.fr/resource sometimes gives me nothing, I don't know why... Anyway you can reproduce it like this :
RDF::Graph.load("http://fr.dbpedia.org/page/J._M._G._Le_Cl%C3%A9zio")
when the sample arrives the encoding is :
sample.encoding => #<Encoding:UTF-8>
Notice that if you go on the webpage : http://fr.dbpedia.org/page/Andr%C3%A9_Gide you'll see the text is scrambled on the 1st paragraph. It appears dbpedia's doing something wrong, mixing utf-8 & iso...
If the page is encoded in UTF-8, but is actually some other encoding, I don't really know how we can detect that. However, forcing the sample to ASCII might be enough to get detect to behave. But, this sounds like a problem on DBpedia.
/resource will do a redirect; it could be that OpenURI.open_uri, which is relied on to open such files, doesn't honor the 303 redirect. Other libraries typically override tis, but re-implementing using Net::HTTP might be necessary to get redirection/range-14 issues right.
This is fixed on the develop branch targeted for a 1.1 release. The fix is in RDF.rb, where samples are all cast to ASCII_8BIT before checking, along with numerous other fixes to character encodings.
Hi,
I've been using rdf.rb on dppedia.fr with ruby 2.0 & the error happens here in the
Format#detect
method:the method
detect
receives a text sample and does some matching on it. the sample contains :I guess the culprit is \xE9 If I force the specified encoding before the match :
...it works out. Otherwise it chokes