lisp / de.setf.wilbur

a fork of net.sourceforge.wilbur updated for mcl and sbcl
27 stars 11 forks source link

Problem with some Unicode chars #4

Open vityok opened 11 years ago

vityok commented 11 years ago

It looks like Wilbur has a problem with certain Unicode chars in certain circumstances.

Code to reproduce:

  1. download RDF/XML date from DBPedia:
wget http://dbpedia.org/data/Semantic_Web.rdf
  1. parse with external format explicitly defined:
(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input
             :external-format :utf-8))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces error both on CCL and SBCL:

> Error: Cannot decode this: (#\U+30BB #\U+30DE #\U+30F3 #\U+30C6 #\U+30A3 #\U+30C3 #\U+30AF #\U+30FB #\U+30A6 #\U+30A7 #\U+30D6)
> While executing: (:INTERNAL WILBUR::COLLAPSE WILBUR:COLLAPSE-WHITESPACE), in process listener(1).
debugger invoked on a SIMPLE-ERROR in thread
#<THREAD "main thread" RUNNING {AB2F861}>:
  Cannot decode this: (#\HANGUL_SYLLABLE_U #\HANGUL_SYLLABLE_KEU
                       #\HANGUL_SYLLABLE_RA #\HANGUL_SYLLABLE_I
                       #\HANGUL_SYLLABLE_NA)
(WILBUR:COLLAPSE-WHITESPACE "우크라이나")

But everything works fine if the external format is not specified:

(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces:

#<TEMPORARY-PARSER-DB size 157 #x1862A5C6>

That then can be successfully queried.

The problem is even more evident when using flexi-streams.

lisp commented 11 years ago

when parsed without explicit encoding, are the string literals in the store then correct, or is the text corrupt?

vityok commented 11 years ago

Yep, the interesting thing is that without explicit encoding it works just fine. The problem happens when encoding is specified. Encoding can be avoided when reading from a file, but it must be specified in order to parse Drakma http-request byte array with Flexi-streams.

vityok commented 11 years ago

The problematic code is in the xml-util.lisp file and is used only in one place: xml-parser.lisp.

The function performs kind of a string trimming:

(collapse-whitespace "    a b c ")   => "a b c"

(collapse-whitespace "aaaa    a b c ") => "aaaa a b c"

P.S. it looks like that the binary bit-wise operations are meant to detect the kind of a Unicode character/character group. i.e. see here.

I think that this code was written prior to Unicode/UTF support on primary Lisp implementations and therefore Ora had to employ these binary tricks. It can be written much easier now...

arademaker commented 11 years ago

I had the same experience of @vityok.

arademaker commented 11 years ago

I suspect that if we consider that the whole Wibur's external interface is weak and based on old limitations of libraries and lisp implementations, better than trying to fix the Wilbur's unicode support, we should try to replace the wilbur's parse with cl-rdfxml parse.

Something like the code below worked for me. I still have to improve it a lot and handle the blank-nodes instances created by cl-rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).


(defun puri-to-node (s)
  (if (eq (type-of s) 'puri:uri)
      (w:node (puri:render-uri s nil))
      s))

(setf w:*db* (make-instance 'wilbur:db))

(defun parse-rdfxml (path) 
  (cl-rdfxml:parse-document (lambda (a b c) 
                  (w:add-triple (w:triple (puri-to-node a) (puri-to-node b) (puri-to-node c))))
                path)

What you think? Is that a good direction? Of course we will add dependences to Wilbur but that, in my opinion, is good and follow recently suggestion http://fare.livejournal.com/169346.html

lisp commented 11 years ago

good evening, alex;

On 2013-01-10, at 21:19 , Alexandre Rademaker wrote:

I suspect that if we consider that the whole Wibur's external
interface is weak and based on old limitations of libraries and
lisp implementations, better than trying to fix the Wilbur's
unicode support, we should try to replace the wilbur's parse with
cl-rdfxml parse.

Something like the code below worked for me. I still have to
improve it a lot and handle the blank-nodes instances created by cl- rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).

(defun puri-to-node (s) (if (eq (type-of s) 'puri:uri) (w:node
(puri:render-uri s nil)) s)) (setf w:db (make-instance
'wilbur:db)) (defun parse-rdfxml (path) (cl-rdfxml:parse-document
(lambda (a b c) (w:add-triple (w:triple (puri-to-node a) (puri-to- node b) (puri-to-node c)))) path) What you think? Is that a good direction? Of course we will add
dependences to Wilbur but that, in my opinion, is good and follow
recently suggestion http://fare.livejournal.com/169346.html

yes, in general fare is correct. the problem is, it is not always
clear which library is best. i had tried to convince ora - way back then, that it would have been
better to use a common library, but he was not convinced. i would suggest a different xml library to you, but it also has
dependencies and if cl-rdfxml actually supports the standard and
yields a coherent object model, then it would certainly be worth a
try. the minimum would be, that it use the current network libraries,
has portable or runtime unicode support, and permits to parse
straight to an rdf model without an intermediate dom.

what else?

— Reply to this email directly or view it on GitHub.

vityok commented 11 years ago

Currently Wilbur works with in-memory RDF databases, but I've found that there are already efforts to create a persistence layer for Wilbur (see Wiki) and there is de.setf.resource that offers some kind of persistence for RDF classes. I guess that there are other Wilbur or RDF-related persistence and query-processing projects that can be found even on GitHub (and probably there are more in the rest of the WWW).

I guess that it would be very nice to bring some of them together to make a feature-rich RDF storage/processing engine.

P.S. here is for example Twinql, a SPARQL engine built on top of Wilbur. But the project is not actively developed (according to the description) and it is very unfortunate if it will remain so...

arademaker commented 6 years ago

Can we have a solution for this issue? Actually, for me it doesn't work with or without the :external-format :utf-8.

arademaker commented 6 years ago

Sorry @vityok , I just saw your PR https://github.com/lisp/de.setf.wilbur/pull/5 for 5 years ago. It looks like this repo is abandoned, I will fork it. But how to make quicklisp updated? I opened an issue at https://github.com/quicklisp/quicklisp-projects/issues/1593