kasei / attean

A Perl Semantic Web Framework
19 stars 10 forks source link

performance on 8.8k triples (508kb turtle) ontology #154

Open VladimirAlexiev opened 4 years ago

VladimirAlexiev commented 4 years ago

(Extracted from #153).

https://github.com/VladimirAlexiev/soml/tree/master/owl2soml/eg#schemaorg describes running the script https://github.com/VladimirAlexiev/soml/tree/master/owl2soml on schema.org renditions (508k ttl, 730k rdf, 808k jsonld). The script produces 428k yaml and takes substantial time to process: 4 minutes for ttl (Have not yet been able to make it run on jsonld and rdf).

time perl ../owl2soml.pl -voc schema schema.ttl    > schema1.yaml
real    4m9.203s
user    0m0.000s
sys     0m0.094s

My code doesn't use SPARQL (for now), just Model -> subjects/properties/objects/holds and Iter -> next/elements. What's the easiest way to profile this code?

I suspect significant time is spent converting between Attean::IRI and URI (#151). I use lazy to suspend IRI parsing, but there is no such option for URI

sub iri ($) {
  # convert string or URI (returned by URI::NamespaceMap $MAP) to Attean::IRI
  my $uri = shift or return;
  Attean::IRI->new (value => ref($uri) ? $uri->as_string : $uri, lazy => 1)
}

sub uri ($) {
  my $iri = shift;
  URI->new (ref($iri) ? $iri->as_string : $iri);
}

The Turtle file is 8.8k triples and RIOT takes 6s to convert it to ttl:

time riot -out ntriples schema.ttl | wc -l
8858

real    0m6.228s
user    0m0.169s
sys     0m0.479s

I wonder how long would Attean take on such conversion...

kasei commented 4 years ago

Doing some quick profiling suggests to me that the bulk of the time is not spent in IRI, but in Type::Tiny. This is an area where I don't have a lot of intuition behind the performance, but I'll try to take a look.

kasei commented 4 years ago

Apologies. The performance issues I was seeing in Type::Tiny are a result of my having more aggressive (opt-in) type checking turned on.

VladimirAlexiev commented 4 years ago

@kasei can I help turning some of this off, to see how much performance will improve?

kasei commented 4 years ago

Turning what off? I'd be happy to see PRs on IRI or the Attean parsers to improve performance.

kasei commented 4 years ago

That being said, I think any changes that completely bypass the IRI validity checks will have to be opt-in in an obvious way that helps to indicate that it may cause problems elsewhere in Attean or related modules.

kasei commented 4 years ago

Also, in profiling the code I noticed that a lot fo the time is spent not in the serialization but in the memory model/store code. This is an area where Attean has lagged behind RDF::Trine. Improvements to this code, or a port (or new implementation) of something like RDF::Trine::Store::DBI (based on SQLite or other) might have a bigger impact on performance than trying to avoid IRI parsing...

VladimirAlexiev commented 4 years ago

"Turn off" the optional Type checks. The IRI constructor (unlike URI) has a lazy option that I use. Re DBI or SQLite: but the number of triples in this case is very small, so a simple in-memory store should be fastest?

kasei commented 4 years ago

The IRI constructor (unlike URI) has a lazy option that I use.

The lazy option just defers IRI component parsing until anything is done with the IRI object (like use any of its accessors). This helps in cases where an IRI is constructed but never used (as in query evaluation where lots of intermediate results do not end up in the final result set), but I suspect would not help in your case where you are constructing IRIs and then accessing their contents to re-serialize.

Re DBI or SQLite: but the number of triples in this case is very small, so a simple in-memory store should be fastest?

Not necessarily. The memory store in Attean is a trivial implementation, but even though it's all in-memory and you are using a small dataset, something like SQLite might be faster just as a result of working with native datatypes (for example). I'm not guaranteeing that such an implementation would be faster, but work in this area (whether on a more optimized in-memory store, or on a bridge to something like SQLite) would certainly improve performance in this sort of use case.

VladimirAlexiev commented 3 years ago

I should add timing with a java (rdf4j) reimplenentation that we have. Afair it's 10x faster

kasei commented 3 years ago

I've pushed a beta version of AtteanX::Store::LMDB to CPAN and along with some minor performance improvements in Attean (unreleased, available via GitHub for now), saw a large performance improvement on your owl2soml.pl code. To act similarly to a memory store, initialize it like this:

use File::Temp qw(tempdir);
my $path = tempdir(CLEANUP => 1);
my $store = Attean->get_store('LMDB')->new(filename => $path, initialize => 1);

There's probably still improvements to be had with an actually lazy implementation of IRI, but I'd be interested to hear how a more performant store impacts your use cases.

VladimirAlexiev commented 3 years ago

Thanks! I'll try it soon. I have another case: on 3.5Mb of IEC CIM (ENTSOE CGMES) ontologies the current version takes 80 min.

kasei commented 3 years ago

@VladimirAlexiev is that 3.5Mb file available somewhere? I'd be happy to give it a try and profile the run to see where else might benefit from improvements.

kasei commented 3 years ago

@VladimirAlexiev following up on the mention of the LMDB store, I just noticed that it requires manually installing LMDB as a system library. I had thought it was built-in to the LMDB_File module, but that seems not to be the case. I think it's still the best solution right now for performant use, but obviously might be an issue in some environments. I'll try to have a look at some of the more portable store options for improving performance (either improving the memory store or porting the SQLite store from RDF::Trine).

kasei commented 3 years ago

@VladimirAlexiev It turns out I had the SQLite code sitting around unreleased, which I've now pushed to CPAN. So if the system library installation for LMDB is problematic, AtteanX::Store::DBI will probably be the next best option. To get a temporary SQLite store (in-memory only), do this:

our $store = Attean->get_store('DBI')->temporary_store();