opencaesar / owl-tools

A set of analysis tools for OWL
Apache License 2.0
5 stars 1 forks source link

Load OWL Database using tdbloader #58

Open dwagmuse opened 2 months ago

dwagmuse commented 2 months ago

User Story

As noted in [this ticket]https://github.com/opencaesar/owl-tools/issues/57) loading performance of the owl-load task is rather poor when the database is large. Fuseki offers the ability to load the data at startup and experiments have shown that this is several orders of magnitude faster. So what we need is a new gradle plugin that can load the database from a set of OWL files.

Detailed Description

The intent is that this step would replace owl-load in a workflow. The omlToOwl step already puts all of the OWL files in the build/owl folder. The jena tdbloader can create a tdb database (before fuseki starts) in the filesystem given a list of OWL files. So what we need this plugin adapter to do is to take the build/owl folder as one argument and perhaps the .fuseki folder as the other, it needs to enumerate all of the owl files in the given folder and then build the tdb database from those OWL files.

Our configuration uses a union graph and Maged says we need to also load data into the union graph. If the loader can't get that detail from the fuseki config file (fuseki.ttl) then it should simply be an option for the plugin (I don't want to have to call the loader twice -- one invocation should do all loading).

(I expect that this would obsolete owl-load as I can't think of any use cases where we would want to use the slow method if the fast method works and I can't think of any use cases where we might want to build the database and then load more owl files)

Acceptance Criteria

Sub-task List

dwagmuse commented 2 months ago

This works once we publish owl-tools 2.11 which should run under jre11 or jre17

dwagmuse commented 1 month ago

Verified in the clipper workflow that this works.

Note that creating the tdb database in the build folder adds several Gigs of binary file data to the build folder. We don't need/want to save this data to normalized or auxiliary branch so simplest thing to do is just delete it at end of the build. (maybe also mark this in .gitignore).

Also need to make sure that the container volumes are big enough to hold this data.