venukb / any23

Automatically exported from code.google.com/p/any23
Apache License 2.0
0 stars 0 forks source link

character are wrongly encoded in rdfxml output #129

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. open file Soldering_iron_test.rdf in your browser see that all characters 
are displayed correctly
espacially look for all rdfs:label  in different languages  
2. go to any23.org
3. copy the file content into content form 
4. set output to rdfxml 

What is the expected output? What do you see instead?
In the output rdfxml rdfs:label's are wrongly encoded   

What version of the product are you using?

Please provide any additional information below.

Original issue reported on code.google.com by danielcz...@gmail.com on 11 Mar 2011 at 5:21

Attachments:

GoogleCodeExporter commented 9 years ago
Brief exploration:

1. The attached file is indeed utf-8 encoded and correctly marked as such in 
the header

2. On the command line, parsing and re-serializing it with "any23 -f rdfxml" 
produces a correctly utf-8 encoded file, no encoding problems

3. I uploaded a copy of the file here: 
http://richard.cyganiak.de/2011/test/Soldering_iron_test.rdf

4. Parsing and re-serializing this uploaded file with any23.org produces a 
correctly utf-8 encoded response, no encoding problems:
http://any23.org/any23/?format=rdfxml&uri=http%3A%2F%2Frichard.cyganiak.de%2F201
1%2Ftest%2FSoldering_iron_test.rdf

5. Copy-pasting the file's contents into the textarea on any23.org produces a 
broken double utf-8 encoded response, as indicated by the reporter

So the problem seems to be related to the processing of a submitted textarea.

Hypothesis, without having looked at the any23 servlet's code: the textarea's 
content is correctly submitted and sent over the wire as utf-8, but the servlet 
messes up the encoding before sending it to the any23 parser.

This seems relevant:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

It states that by default, POST bodies are assumed to be ISO-8859-1. It can be 
overridden by setting Content-Type on the HTTP request, but most browsers don't 
do that when submitting form posts, so it doesn't appear to be an option. The 
solution proposed there is to include a filter before the servlet that fixes 
the encoding. Apparently, ready-made code for doing that could be lifted from 
Tomcat.

Original comment by richard....@gmail.com on 11 Mar 2011 at 7:29