character are wrongly encoded in rdfxml output

Brief exploration:

1. The attached file is indeed utf-8 encoded and correctly marked as such in 
the header

2. On the command line, parsing and re-serializing it with "any23 -f rdfxml" 
produces a correctly utf-8 encoded file, no encoding problems

3. I uploaded a copy of the file here: 
http://richard.cyganiak.de/2011/test/Soldering_iron_test.rdf

4. Parsing and re-serializing this uploaded file with any23.org produces a 
correctly utf-8 encoded response, no encoding problems:
http://any23.org/any23/?format=rdfxml&uri=http%3A%2F%2Frichard.cyganiak.de%2F201
1%2Ftest%2FSoldering_iron_test.rdf

5. Copy-pasting the file's contents into the textarea on any23.org produces a 
broken double utf-8 encoded response, as indicated by the reporter

So the problem seems to be related to the processing of a submitted textarea.

Hypothesis, without having looked at the any23 servlet's code: the textarea's 
content is correctly submitted and sent over the wire as utf-8, but the servlet 
messes up the encoding before sending it to the any23 parser.

This seems relevant:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

It states that by default, POST bodies are assumed to be ISO-8859-1. It can be 
overridden by setting Content-Type on the HTTP request, but most browsers don't 
do that when submitting form posts, so it doesn't appear to be an option. The 
solution proposed there is to include a filter before the servlet that fixes 
the encoding. Apparently, ready-made code for doing that could be lifted from 
Tomcat.

Original comment by richard....@gmail.com on 11 Mar 2011 at 7:29

Changed state: Accepted

venukb / any23

character are wrongly encoded in rdfxml output #129