ruby-rdf / sparql-client

SPARQL client for Ruby.
http://rubygems.org/gems/sparql-client
The Unlicense
112 stars 58 forks source link

UTF-8 characters raise an error on sparql insert #85

Closed alexandergantikow closed 6 years ago

alexandergantikow commented 6 years ago

Dear developers,

I'm trying to upload some triples with SPARQL insert. For example i tried to insert a triple describing a 'title'. If it contains special characters or a german umlaut the following error is returned by the sparql client:

C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/sparql-client-3.0.0/lib/sparql/client.rb:351:in `block in response': Error 500: 400: Unable to parse form content (SPARQL::Client::ServerError)
...
 Processing query INSERT DATA {
<http://kb.esit4sip.eu/learning-instances/dbc6b574906c57e6f531edcfa2df82ba> <http://kb.esit4sip.eu/learning/title> "!§$%&/()=?`*áé" .
}
from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/sparql-client-3.0.0/lib/sparql/client.rb:704:in `call'
from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/sparql-client-3.0.0/lib/sparql/client.rb:704:in `block in request'

Futhermore my fuseki (3.8.0) server returns this error:

[2018-08-02 13:51:33] Fuseki     INFO  [1] POST http://localhost:3030/esit4sip-test1/update
[2018-08-02 13:51:33] Fuseki     WARN  [1] RC = 500 : 400: Unable to parse form content
org.eclipse.jetty.http.BadMessageException: 400: Unable to parse form content
        at org.eclipse.jetty.server.Request.getParameters(Request.java:376)
        at org.eclipse.jetty.server.Request.getParameterValues(Request.java:1049)
        at javax.servlet.ServletRequestWrapper.getParameterValues(ServletRequestWrapper.java:221)

... ...

Caused by: org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte F5 in state 0
        at org.eclipse.jetty.util.Utf8Appendable.appendByte(Utf8Appendable.java:247)
        at org.eclipse.jetty.util.Utf8Appendable.append(Utf8Appendable.java:157)
        at org.eclipse.jetty.util.UrlEncoded.decodeUtf8To(UrlEncoded.java:522)
        at org.eclipse.jetty.util.UrlEncoded.decodeTo(UrlEncoded.java:572)
        at org.eclipse.jetty.server.Request.extractFormParameters(Request.java:525)
        at org.eclipse.jetty.server.Request.extractContentParameters(Request.java:457)
        at org.eclipse.jetty.server.Request.getParameters(Request.java:372)
        ... 50 more
[2018-08-02 13:51:33] Fuseki     INFO  [1] 500 400: Unable to parse form content (37 ms)

Here is the ruby code I'm using:

sparql = SPARQL::Client.new(SPARQL_ENDPOINT)
graph = RDF::Graph.new
title = "a german umlaut: ä "
graph << [LEARNING_INSTANCES[resource_id], LEARNING['title'], title]
sparql.insert_data(graph)

As soon as I'm using a title without special characters everything works fine with the sparql client. Furthermore, if I'm using the fuseki web-gui, the umlaut title is accepted. So it seems that the character encoding is making some trouble. Because I'm not an expert when it comes to programming, I can't say if this error comes from the sparql-client, fuseki or the jetty server. My google research didn't bring me further too. So feel free to comment, if this error does not come from the client.

I'm using the following software:

Thank you Alexander

gkellogg commented 6 years ago

The server is complaining that part of the string isn't valid UTF-8; the sparql-client gem doesn't do anything to modify these, so it may be that the source you're getting them from isn't valid UTF-8.

From your code it seems you're trying to insert the string "a german umlaut: ä ", but the error lists a different string: "!§$%&/()=?*áé"`. What's here definitely seems to be valid UTF-8.

The only other possibility I could see is that the server is not treating the insert body as UTF-8, which could be a server configuration (albeit an odd one). This could be overridden by adding charset=utf-8 into the Content-Type header, which would require a patch.

It would be worth seeing why the error reported is inconsistent with the data you're posting; perhaps it's from a different post.

alexandergantikow commented 6 years ago

Dear Mr. Kellogg,

thank you for your reply.

You are right: In my ruby code I'm using "a german umlaut: ä" while my error displays the string "!§$%&/()=?*áé"`. This comes from testing what characters are accepted and is a mistake in my issue posting. Sorry! Nonetheless the error type stays the same for both strings.

You are talking about, that the source I'm getting my content from, maybe isn't valid UTF-8. That's what I thought too, so I experimented with "hand-typed" strings like "a german umlaut: ä". Asking for their.enconding`, ruby returned UTF-8 for their internal representation.

I thought about the POST header too and tried to modify it with the sparql client. For the insert_data method I couldn't find a way, so I tried to send options as parameter while initializing the client as described here. But it didn't work and doesn't seem to be the right way. Talking about a "patch" would mean to modify the gem itself? You propose the server configuration as a possible error source. So this would mean that my fuseki server isn't configured correctly?

alexandergantikow commented 6 years ago

I think, I found the origin of my UTF-8 encoding problem. I followed the processing of the client to its "InsertData" class in update.rb. Here the "to_s" method is used to add the statements of a graph to a string. The RDF::NTriples::Writer.buffer forces the string into another encoding - in my case it's "CP850". So it seems that the problem comes from "writer.rb" of the RDF gem (?).

As a quick and dirty solution I added query_text.encode!('UTF-8') in the clients update.rb.

class InsertData < Operation
...
def to_s
    query_text = 'INSERT DATA {'
    query_text += ' GRAPH ' + SPARQL::Client.serialize_uri(self.options[:graph]) + ' {' if self.options[:graph]
    query_text += "\n"
    #puts query_text.encoding # ==> UTF-8
    query_text += RDF::NTriples::Writer.buffer { |writer| @data.each { |d| writer << d } }
    #puts query_text.encoding # ==> CP850
    query_text += '}' if self.options[:graph]
    query_text += "}\n"
    # Temporary fix of encoding problem
    query_text.encode!('UTF-8')
end
gkellogg commented 6 years ago

The writer is probably taking the encoding from the input data, as it's not set explicitly. The logic behind setting the encoding the the writer goes back a ways, but pretty much everyone expects the default encoding to be UTF-8. This could be specified using an encoding: :utf-8 parameter to Writer.buffer, but I think the time's come to simply add this as a default to RDF::Writer#buffer, and elsewhere.

gkellogg commented 6 years ago

@alexandergantikow The RDF.rb repo was updated to change the way in which the default encoding was found. It could be that your environment affected the way that it was set, but I couldn't reproduce it. Now, it uses Writer#encoding as the default, which was otherwise nil. Please give it a try to see if it solves this issue, and I'll release an update to the RDF.rb gem.

alexandergantikow commented 6 years ago

@gkellogg I went to the repo and updated my installed gem with your files commited in August. Was this the way it was intended by you? Unfortunately it didn't solve my issue.

I followed your tip of 7. August too. I went to the writer documentation where the "Serializing RDF statements into an NTriples string with escaped UTF-8" is described. Since I am using this example, my issue is gone. This is only a temporary solution too. But it works better than my "#Temporary fix of encoding problem" proposed above. The query_text.encode!('UTF-8') sometimes prduced an unpredictable error too.

gkellogg commented 6 years ago

If you’d like me to look into it further, please give me a script and Gemfile.lock which reproduces the problem.