marklogic / marklogic-jena

Adapter for using MarkLogic with the Jena RDF Framework
Other
5 stars 11 forks source link

How to retrieve RDFList ? #77

Open AlexTo opened 5 years ago

AlexTo commented 5 years ago

Hi I am using Jena function model.createList(...) to create a RDFList.

Then I will have the following triples in MarkLogic.

<:subj1>                  <has-list>      <_:bnode-some-uuid-1>
<_:bnode-some-uuid-1>    <rdf:first>      <:Alex>
<_:bnode-some-uuid-1>    <rdf:rest>      <_:bnode-some-uuid-2>
<_:bnode-some-uuid-2>    <rdf:first>      <:Bob>
<_:bnode-some-uuid-2>    <rdf:rest>      <_:bnode-some-uuid-3> 
....

Now if I want to retrieve the list like I normally do in Jena TDB2, I would query as follows

val stmts = model.listStatements(subj, hasList, null)
val stmt = stmts.next()
val list = stmt.getObject.as(classOf[RDFList])

However, since list is a blank node, this does not work in MarkLogic with the following exception

org.apache.jena.shared.JenaException: Cannot convert node 99c61fc0120934c355bd035658b0cbad to RDFList

The UUID in the exception is different everytime so probably MarkLogic Jena generates it upon each query for a blank node.

So my question is how to deal with RDFList in MarkLogic Jena? I am thinking of probably converting all the blank nodes before saving to MarkLogic but I am not sure how to do it nicely with model.createList(...) because model.createList(...) will just create all the blank nodes and write to MarkLogic graph.

Thank you

Regards

ehennum commented 5 years ago

Blank nodes are convenient for in-memory data, but they pose challenges for persistent data because (as you note) blank nodes by definition don't have stable identity, which means they can't be indexed or addressed in a durable and atomic way.

For that reason, I very much agree with your idea of modelling the data in a different way that works both for in memory and persistent contexts.

One approach would be to model with multiple relations to the items that could have been modeled with a list.

If order is important, you can create a resource for each item and give the resource a literal property with its sequence number of the item, using a separate literal property for the value of the item.

That way, it's possible to find and update the value of the 2nd item (for example) independent of all other items.

The expensive operation in the sequence number approach is inserting a new item (because the sequence number for all subsequent items has to be incremented). If insertions within a list are a frequent operation, a linked-list approach with relations between items might be better. As your example shows, RDF itself takes a linked list approach for expressing an RDFList in triples but with the disadvantage of transient resource identifiers for the items.

Traversing the items in code is less convenient that with RDFList but could possibly be encapsulated within a function without incurring the need to convert to RDFList.

Could either of those models work for the data in this case?

AlexTo commented 5 years ago

Those are great approaches. I was actually thinking to either do one of the following because I only need a "bag", not a "list"

  1. Have multiple triples, one triple for one item like this

    <subj>    <hasItem>    <item1>
    <subj>    <hasItem>    <item2>

    Easy and straightforward. Not sure what are the shortcoming though.

  2. Create an in memory model. Use Jena RDFList to create the list in the in memory model. Go through all the statements and replace blank nodes with a resource node. Finally add the in memory model to MarkLogic Dataset graph.

Thanks for your quick reply

ehennum commented 5 years ago

If you don't need ordering, I completely agree with that solution.

In fact, I would expect it's a better solution for the in-memory representation, too, because it more accurately reflects the semantics.

Of course, if the data has to be modelled as RDFList for some consumer, the transformation is possible.