ruby-rdf / linkeddata

A metadistribution of RDF.rb including all parsing/serialization plugins.
http://rubygems.org/gems/linkeddata
The Unlicense
51 stars 9 forks source link

performance issue #13

Closed bertvannuffelen closed 4 years ago

bertvannuffelen commented 4 years ago

Hi,

we have a circleci build running and since a few days we observe that the conversion to turtle and rdf-xml format has degraded. The conversion to ntriples works fine.

Do the following in our circleci-setup (see https://github.com/Informatievlaanderen/Data.Vlaanderen.be/blob/test/.circleci/config.yml) the following:

gem install linkeddata 
...
rdf serialize --input-format turtle --output-format rdfxml /tmp/workspace/ttl/${model}.ttl -o /tmp/workspace/voc/${model}.rdf

it installs the latest version 3.1.0 of the LinkedData gem. In another variant of the circleci flow, we convert jsonld->*, and also here we face a performance degradation for converting to turtle and rdf-xml. For this variant a list of 50 files is converted to nt in 1 minute, however conversion to turtle requires 90 minutes.

Is there a change in the serializers that could explain this big difference?

I tried to get a config with an older version of the rdf-turtle gem, but I did not succeeded yet. Always the latest version 3.1.0 is present. From which I assume that that version will be used to transform to turtle.

gkellogg commented 4 years ago

The only change to the serializers has been to make small changes to satisfy Ruby 2.7 calling conventions, which should not have an impact.

The expensive part of both Turtle and RDF/XML serializers is doing a topological sort of subjects to figure out the best presentation order; this hasn’t changed in years. If you provide an example file exhibiting the problem, I can try to profile.

Note that there is a streaming Turtle serializer that is likely much faster, But may not always embed blank node statements as satisfactorily.

It would be interesting to wire to 3.0 versions to see if the difference is consistent.

bertvannuffelen commented 4 years ago

I wrote a test case script, see below. The processing is done in the circleci/ruby docker image.

I ran the script on the circleci cloud by ssh-ing into the box and then executing the test script. The results are:

2020-01-10 11:37:27 (39.1 MB/s) - ‘adres.jsonld’ saved [35978/35978]

ntriples

real    0m0.928s
user    0m0.568s
sys     0m0.360s
turtle

real    1m18.002s
user    0m32.219s
sys     0m45.780s
rdfxml

real    1m23.815s
user    0m36.275s
sys     0m47.538s

The script is

#!/bin/bash

wget https://github.com/Informatievlaanderen/OSLO-Generated/raw/test-feature-checkout/doc/vocabularium/adres/ontwerpdocument/2020-01-06/voc/adres.jsonld

echo "ntriples"
time rdf serialize --input-format jsonld --processingMode json-ld-1.1 adres.jsonld --output-format ntriples -o adres.nt
echo "turtle"
time rdf serialize --input-format jsonld --processingMode json-ld-1.1 adres.jsonld --output-format turtle -o adres.turtle
echo "rdfxml"
time rdf serialize --input-format jsonld --processingMode json-ld-1.1 adres.jsonld --output-format rdfxml -o adres.rdf
gkellogg commented 4 years ago

You may need to provide some more information, when I run it on my mac, I get the following times:

ntriples

real    0m1.001s
user    0m0.708s
sys 0m0.289s
turtle

real    0m3.509s
user    0m2.206s
sys 0m1.294s
rdfxml

real    0m3.898s
user    0m2.583s
sys 0m1.306s

In particular, what Ruby version? What are the gem versions? I ran with the following:

ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin19]

rdf 3.1.0
rdf-turtle 3.1.0
rdf-rdfxml 3.1.0

If I avoid much of the "rdf" command infrastructure, and use the script/parse file in the Turtle gem, it's even faster:

time script/parse --format ttl examples/adres.nt -o /dev/null

Parsed 464 statements in 0.177042 seconds @ 2620.847030648095 statements/second.

real    0m0.405s
user    0m0.341s
sys 0m0.060s

Some of the time may come from activating gems, presuming you've installed the 'linkeddata' gem. You might try either doing this in a loop with a warmup, or reducing your gem dependencies.

bertvannuffelen commented 4 years ago

I tested the script on 2 dockers

1. docker run -it -v /home/oslo/github/perftest:/data circleci/ruby:2.6.5-stretch bash
2. docker run -it -v /home/oslo/github/perftest:/data circleci/ruby bash

docker 1) yields version 2.6.5 docker 2) yields version 2.7.0

It turns that version 2.7.0 yields slower performance as 2.6.5.

gkellogg commented 4 years ago

I'll need to do a profile of the writer code in either Turtle or RDF/XML to see where the time is going, but it seems like TSort might be a culprit. Don't see anything in a search for 2.7 slowdowns, though. You'd expect that code like that would benefit from a JIT, but maybe not.

Although, the script/parse isn't slower for 2.7, so something else must be going on relating to the rdf command.

gkellogg commented 4 years ago

I can say that it seems to have something with the way that gems are activated in 2.7 and a pattern of autoloading vocabularies in the rdf-vocab gem. If you create a Gemfile containing gem 'linkeddata' and run this using "bundle exec", it is much faster.

ntriples

real 0m0.821s user 0m0.485s sys 0m0.140s turtle

real 0m2.405s user 0m1.673s sys 0m0.725s rdfxml

real 0m2.801s user 0m2.061s sys 0m0.734s

gkellogg commented 4 years ago

@bertvannuffelen It turns out that the issue wasn't autoload, but differences in loading files in Ruby 2.7 using require. The RDF::Vocab.each method override was loading classes on each call. Now, it forces and autoload on the first call only, which seems to get performance to where it is when using bundler.

If you can, please check it out on the features/slow_load branch of rdf-vocab. If it looks good, I'll merge and release an update.