lisp / de.setf.wilbur

a fork of net.sourceforge.wilbur updated for mcl and sbcl
27 stars 11 forks source link

performance : Wilbur vs RDFLib #11

Closed arademaker closed 5 years ago

arademaker commented 5 years ago

Same task, similar code. For use RDFLib I had to convert to ntriples first, but besides that, Wilbur took ~1 hour and RDFLib did the same in ~ 1 min.

  1. https://gist.github.com/arademaker/fe6b31d25f12eb307ed6cbea4395a357 and
  2. https://gist.github.com/arademaker/dcedb56952f5aa014c6729211cdb2540

Any idea? How to investigate this difference?

$ rapper -c opennlp/Dissertation.pdf.rdf
rapper: Parsing URI file:///Users/ar/work/papers/opennlp/Dissertation.pdf.rdf with parser rdfxml
rapper: Parsing returned 865058 triples

$ rapper -o ntriples -i rdfxml opennlp/Dissertation.pdf.rdf > lixo.ntriples
$ time python3.7 rdf-to-json.py lixo.ntriples lixo.json

real    0m59.568s
user    0m58.309s
sys    0m0.830s

$ sbcl --noinform --noprint --eval "(load \"rdf-to-json.lisp\")" --eval "(main (nth 1 sb-ext:*posix-argv*) (nth 2 sb-ext:*posix-argv*))" --eval "(sb-ext:quit)" opennlp/Dissertation.pdf.rdf lixo.json

real    54m37.053s
user    54m18.341s
sys    0m7.938s
lisp commented 5 years ago

good morning;

you are comparing an historic artefact with an rdf environment which is under active development. a recent comment here was that this repository has been abandoned. that perspective is in appropriate. this library is archived here, in order that it not disappear, as it is one of the first rdf implementations. it is not a production-stage library.

arademaker commented 5 years ago

Hi, thank you for the comment. But just to register, after I changed one line of code I got much better results . Using the indexed-db instead of the default one.

https://github.com/lisp/de.setf.wilbur/blob/master/src/core/rdf-parser.lisp#L165-L167


CL-USER> (time (main "opennlp/Dissertation.pdf.rdf" "lixo.json"))
Evaluation took:
  89.355 seconds of real time
  89.076178 seconds of total run time (88.162266 user, 0.913912 system)
  [ Run times consist of 1.841 seconds GC time, and 87.236 seconds non-GC time. ]
  99.69% CPU
  259,485,147,800 processor cycles
  1,581,606,192 bytes consed```
lisp commented 5 years ago

if you make a pull request for that, i can merge it.

gibsonf1 commented 5 years ago

I just wanted to mention that I am using Wilbur as a primary in memory rdf db (drawing from Allegrograph) for a production application in development: https://graphmetrix.net ( http://graphmetrix.com )

So in my case, the library is very much alive and I do appreciate any updates (and I may be offering some ideas as well as time goes on)

arademaker commented 5 years ago

@gibsonf1 thank you for let me know. Are you using the temporary-parser-db or the indexed-db? Have you had any other issues and had to adapt the code somehow? I am planning to fork this repo for starting to contribute. First priority would be to add a more robust parser, possible using the SAX from https://common-lisp.net/project/cxml/.

arademaker commented 5 years ago

@lisp as always, we never know if a Common Lisp project is abandoned or only feature complete! ;-)

lisp commented 5 years ago

if one of you works actively with this, it would make sense to move that work to a repository which you control. my attention is devoted to dydra, which is also lisp, and does rely on some of the repositories in this account, but not this one.

binghe commented 5 years ago

@lisp Hi, I hope you can still "control" all de.setf lisp packages, at least I personally found convenient to find them from your (lisp) GitHub page. Beside, merging some PRs should NOT take you a lot of time.

lisp commented 5 years ago

in the case of the wilbur repository, it seems like there are other parties who should be more directly involved and in a better position to judge changes. in that case, they should control merges.

for other repositories - those which i have in active use, i certainly would like to have pull requests, but i very very very infrequently get any.

arademaker commented 5 years ago

@lisp I have a few more changes from the last days, and after spending some time digging into it, the Wilbur code seems to be not so complicated. But many parts deserve modernization and modularization for sure. As I said before, the SAX parser could be more robust, and the internal representation of triples could be more compact. I plan to have students working on that under my supervision, and I would be happy to take care of managing the eventual PRs. Currently, the main problem is the license that may be not adequate for some projects.

Given all the above considerations, I could fork your repo and keep working on it, maybe at some moment asking for Zach to evaluate what repository should be maintained in the Quicklisp distribution (see https://github.com/quicklisp/quicklisp-projects/issues/1593) or you can transfer the repo to my account.

But we also have other points to consider. First, many people like to have your repositories as references because you did get an excellent GitHub username! ;-) Second, you have created a new package from the Wilbur source with a new name de.setf.wilbur and made changes from the Ora's source code that I haven't had a chance to investigate (I hope all changes are in the repo history). Some broken parts still intrigue me, like the function parse-db-from-file that is exported from the Wilbur package but it is not defined.

arademaker commented 5 years ago

Searching for wilbur in Github and filtering by Lisp programming language, I found four repositories https://github.com/search?l=Common+Lisp&q=wilbur&type=Repositories. At least one of them seems to have some ideas for performance improvement. Funny how the Lisp ecosystem works! ;-)

lisp commented 5 years ago

you can fork it or manage requests and marshall them to me, however you think best benefits its use. i do not expect to change the license as that i how i inherited it. (note that i have corrected the link to the original source,)