rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 68 forks source link

Ability to search over triples with their IDs #141

Closed AlyHdr closed 2 years ago

AlyHdr commented 2 years ago

We introduce a new feature to be able to search over the triples and get an ID of the found list or one hit of triples, as well we provide a function find(long id) over the triples which basically takes an index or ID in the list of triples and looks up in the trees in a bottom up fashion to find the given tripleID.

This code doesn't change the original hdt-api but adds two extra functions:

For the search with ID method all the modifications are over the bitmap triples iterators of the different query patterns, all of the modifications doesn't cost any performance issue because we take into account the indexes used to find the matching triple patterns and we use the index of Z (the object in this case) as an identifier of the triple or the index in the list.

P.S: There is only one case when we use BitmapTriplesIteratorZFOQ to search over a fixed object we need to do a binary search over the extra object list in order to find the position of the given object z.

Regarding the find triple by ID, we go up in the tree using the indexes and rank1 operations over the bitmaps:

1) First we use the given ID or index of the triple and find the corresponding object ID from the sequence of objects. 2) Get the position of the predicate by doing a rank1 operation over the bitmapZ using the object's position. (basically count the ones up to the given position) 3) Find the predicate ID from the inferred predicate position using the sequence of predicates. 4) Use the position of the predicate to get the position of the subject exactly by doing the same rank1 but over bitmapY 5) Get the subject from the subjects sequence using the position got in (4)

Screenshot 2022-01-13 at 16 02 58

example: find(3) in the above figure. triples list: 1 4 5, pos = 1 2 2 1, pos = 2 2 3 3, pos = 3 2 3 4, pos = 4 2 5 2, pos = 5 3 1 1, pos = 6 4 1 1, pos = 7 solution: z = seqZ.get(3-1) = 3 ( id at position 2), posY = bitmapZ.rank1(3 -1) = 2, y = seqY.get(posY) = 3, posX = bitmapY.rank1(2 - 1) = 1, x = seqX.get(posX) = 2, then => x = 2, y =3, z =3

Triple (2 3 3) is at position 3 in the list.

All those operations must be done in constant time because we just use some binary operations to access and find the positions of the triple.

A test has been added to make sure all the functionalities added are working over different triple patterns.

AlyHdr commented 2 years ago

Hello, can we merge this to 3.0.0 as well? Because we still have some other modifications based on this PR.

mielvds commented 2 years ago

Hi @AlyHdr , LGTM, but your PR removed a lot of the javadoc, was this intentional?

AlyHdr commented 2 years ago

Hello @mielvds, there was a problem with the java doc generation when compiling with java 11 I think and my IDE just reformated stuff, so it's not intentional at all...

mielvds commented 2 years ago

There's a jena4 branch which lifts everything to Java 11. I'm planning to include this in 2.3, so I will rebase the 3.0.0 branch on that branch as well. You'll have to probably align your PR's

mielvds commented 2 years ago

Branch has been rebased and is now java 11. Can you check the docs? It's mostly return types that are removed.

D063520 commented 2 years ago

I just saw that there is a branch where Javier tried already to implement this functionality:

https://github.com/rdfhdt/hdt-java/tree/triplePosition

we maybe need to compare and see what was his proposed solution .....

D063520 commented 2 years ago

In fact the API that he proposes is much cleaner ... instead of having a separate iterator:

IteratorTripleString searchWithId(CharSequence subject, CharSequence predicate, CharSequence object)

he adds a new method

public long getNextTriplePosition()

this is very elegant. I know that there might be performance problems with this. Making this logic work automatically for every iterator might be problematic. I will check Javiers implementation. Maybe he found a way to avoid performance problems

AlyHdr commented 2 years ago

We already tried first to have a method over the TripleString and TripleID classes to get the position, but that implies that we have to always find the position(i.e performance problems as @D063520 said) because the API doesn't support any flags or something, so one either have to add a new search method as it's done in the PR or do it as Javier did it in order to stay reverse compatible with the original API.

D063520 commented 2 years ago

But for example this one: https://github.com/rdfhdt/hdt-java/commit/8086aee1d473365fbf07d83531ffe282d778bee2#diff-fe04f5a5ab3ac989b85ada9c61d8b6689a5ad05b9d74c2bdede5fd07c14bcf72R214 basically it calls the search only if the method is called. I'm not 100% sure but it could work ....

AlyHdr commented 2 years ago

Yes it would be cool if this can be implemented in all the other iterators without the need to do a search...

D063520 commented 2 years ago

We skip this pull request in favour of the pull request #150 since it offers a more elegant solution on the API level with the same functionality