rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 68 forks source link

Add HDT Diff method #153

Closed ate47 closed 2 years ago

ate47 commented 2 years ago

This pull request ask to add the new methods, HDTDiff and HDTDiffBits and export the Bitmaps to the API.

We can describe the HDTDiff with HDTCat, If we say that the HDTCat of 2 HDT a and b is the HDT a U b, then the HDTDiff of 2 HDT can be described as the HDT a - b.

API Changes

HDTManager

In the HDTManager, we now have 2 methods to run HDTDiff:

public static HDT diffHDT(String hdtFileName1, String hdtFileName2, HDTOptions hdtFormat, ProgressListener listener) throws IOException;

Create the diff of the HDT in hdtFileName1 - the HDT in hdtFileName2 inside the first HDT, the new HDT can be tune with the hdtFormat parameter.

public static HDT diffHDTBit(String location, String hdtFileName, Bitmap deleteBitmap, HDTOptions hdtFormat, ProgressListener listener) throws IOException;

Create a new HDT with location location from the HDT in hdtFileName minus the triples with position (see #150) marked on the delete bitmap, the new HDT can be tune with the hdtFormat parameter.

To use this method, I've moved the Bitmap classes from the CORE to the API.

Bitmaps

We now have 3 new class:

The bitmaps are the same as the previous ones in the CORE, I've updated the Javadoc to have a description on each method. In the BitmapFactory, we keep the same methods, but with the addition of some method to control the created Bitmap:

public static Bitmap createEmptyBitmap(long size);

Create an empty unmodifiable bitmap, all the accesses are the same with a bitmap filled with 0.

public static ModifiableBitmap createRWBitmap(long size);

Create a ModifiableBitmap with only the Bitmap.access(long), Bitmap.countOnes(), Bitmap.countZeros(), ModifiableBitmap.set(long, boolean), Bitmap.getNumBits(), Bitmap.getSizeBytes() and ModifiableBitmap.append(boolean) operations, so no rank and select need to be compute, avoiding to have to build an index. Useful for the diffHDTBit that doesn't require an indexed Bitmap.

HDT Diff

HDT Diff of 2 HDTs

We build a delete bitmap for the first HDT to remove all the triples of HDT2 in HDT1, for example, if we have:

HDT1

ex:aaa ex:pred ex:aaa .
ex:aaa ex:pred ex:bbb .
ex:aaa ex:pred ex:ccc .

and

HDT2

ex:aaa ex:pred ex:aaa .
ex:ddd ex:pred ex:ccc .

We are creating the delete bitmap 100, because only the 1st triple should be removed from HDT1 ex:aaa ex:pred ex:aaa.

Then the HDTDiff is calling the HDTDiffBit with HDT1 and this new delete bitmap.

HDT Diff of a HDT and a delete bitmap

We create a new HDT from the HDT with all the triple marked to be deleted deleted. HDTDiffBit will also recompute the shared subjects/objects to remove from shared elements, elements that aren't shared anymore, the Dictionary is preserved.

For example, if we have the HDT

Dict:
  Shared: ex:aaa ex:ccc
  Subj: ex:eee
  Pred: ex:pred
  Obj:  ex:bbb ex:ddd ex:fff
Triples:
  ex:aaa ex:pred ex:aaa .
  ex:aaa ex:pred ex:bbb .
  ex:aaa ex:pred ex:ccc .
  ex:ccc ex:pred ex:ccc .
  ex:ccc ex:pred ex:ddd .
  ex:eee ex:pred ex:fff .

and the delete bitmap 0101010, we will delete the triples:

ex:aaa ex:pred ex:bbb .
ex:ccc ex:pred ex:ccc .

The result HDT would be:

Dict:
  Shared: ex:aaa
  Subj: ex:eee
  Pred: ex:pred
  Obj:  ex:ccc ex:ddd ex:fff
Triples:
  ex:aaa ex:pred ex:aaa .
  ex:aaa ex:pred ex:ccc .
  ex:ccc ex:pred ex:ddd .
  ex:eee ex:pred ex:fff .

Notice that ex:ccc isn't shared anymore.

Tests

To test this features, I've added 4 tests files:

For each test file, I'm testing the HDT method for the 4 dictionary types:

For HDTDiff test, we are also checking if the asked dictionary is the same as the result HDTs.

D063520 commented 2 years ago

Thank you for this pull request!