Open vityok opened 11 years ago
when parsed without explicit encoding, are the string literals in the store then correct, or is the text corrupt?
Yep, the interesting thing is that without explicit encoding it works just fine. The problem happens when encoding is specified. Encoding can be avoided when reading from a file, but it must be specified in order to parse Drakma http-request
byte array with Flexi-streams.
The problematic code is in the xml-util.lisp
file and is used only in one place: xml-parser.lisp
.
The function performs kind of a string trimming:
(collapse-whitespace " a b c ") => "a b c"
(collapse-whitespace "aaaa a b c ") => "aaaa a b c"
P.S. it looks like that the binary bit-wise operations are meant to detect the kind of a Unicode character/character group. i.e. see here.
I think that this code was written prior to Unicode/UTF support on primary Lisp implementations and therefore Ora had to employ these binary tricks. It can be written much easier now...
I had the same experience of @vityok.
I suspect that if we consider that the whole Wibur's external interface is weak and based on old limitations of libraries and lisp implementations, better than trying to fix the Wilbur's unicode support, we should try to replace the wilbur's parse with cl-rdfxml parse.
Something like the code below worked for me. I still have to improve it a lot and handle the blank-nodes instances created by cl-rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).
(defun puri-to-node (s)
(if (eq (type-of s) 'puri:uri)
(w:node (puri:render-uri s nil))
s))
(setf w:*db* (make-instance 'wilbur:db))
(defun parse-rdfxml (path)
(cl-rdfxml:parse-document (lambda (a b c)
(w:add-triple (w:triple (puri-to-node a) (puri-to-node b) (puri-to-node c))))
path)
What you think? Is that a good direction? Of course we will add dependences to Wilbur but that, in my opinion, is good and follow recently suggestion http://fare.livejournal.com/169346.html
good evening, alex;
On 2013-01-10, at 21:19 , Alexandre Rademaker wrote:
I suspect that if we consider that the whole Wibur's external
interface is weak and based on old limitations of libraries and
lisp implementations, better than trying to fix the Wilbur's
unicode support, we should try to replace the wilbur's parse with
cl-rdfxml parse.Something like the code below worked for me. I still have to
improve it a lot and handle the blank-nodes instances created by cl- rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).(defun puri-to-node (s) (if (eq (type-of s) 'puri:uri) (w:node
(puri:render-uri s nil)) s)) (setf w:db (make-instance
'wilbur:db)) (defun parse-rdfxml (path) (cl-rdfxml:parse-document
(lambda (a b c) (w:add-triple (w:triple (puri-to-node a) (puri-to- node b) (puri-to-node c)))) path) What you think? Is that a good direction? Of course we will add
dependences to Wilbur but that, in my opinion, is good and follow
recently suggestion http://fare.livejournal.com/169346.html
yes, in general fare is correct. the problem is, it is not always
clear which library is best.
i had tried to convince ora - way back then, that it would have been
better to use a common library, but he was not convinced.
i would suggest a different xml library to you, but it also has
dependencies and if cl-rdfxml actually supports the standard and
yields a coherent object model, then it would certainly be worth a
try. the minimum would be, that it use the current network libraries,
has portable or runtime unicode support, and permits to parse
straight to an rdf model without an intermediate dom.
what else?
— Reply to this email directly or view it on GitHub.
Currently Wilbur works with in-memory RDF databases, but I've found that there are already efforts to create a persistence layer for Wilbur (see Wiki) and there is de.setf.resource
that offers some kind of persistence for RDF classes. I guess that there are other Wilbur or RDF-related persistence and query-processing projects that can be found even on GitHub (and probably there are more in the rest of the WWW).
I guess that it would be very nice to bring some of them together to make a feature-rich RDF storage/processing engine.
P.S. here is for example Twinql, a SPARQL engine built on top of Wilbur. But the project is not actively developed (according to the description) and it is very unfortunate if it will remain so...
Can we have a solution for this issue? Actually, for me it doesn't work with or without the :external-format :utf-8
.
Sorry @vityok , I just saw your PR https://github.com/lisp/de.setf.wilbur/pull/5 for 5 years ago. It looks like this repo is abandoned, I will fork it. But how to make quicklisp updated? I opened an issue at https://github.com/quicklisp/quicklisp-projects/issues/1593
It looks like Wilbur has a problem with certain Unicode chars in certain circumstances.
Code to reproduce:
Produces error both on CCL and SBCL:
But everything works fine if the external format is not specified:
Produces:
That then can be successfully queried.
The problem is even more evident when using flexi-streams.