This pull request ask to add the new methods, HDTDiff and HDTDiffBits and export the Bitmaps to the API.
We can describe the HDTDiff with HDTCat, If we say that the HDTCat of 2 HDT a and b is the HDT a U b, then the HDTDiff of 2 HDT can be described as the HDT a - b.
API Changes
HDTManager
In the HDTManager, we now have 2 methods to run HDTDiff:
Create a new HDT with location location from the HDT in hdtFileName minus the triples with position (see #150) marked on the delete bitmap, the new HDT can be tune with the hdtFormat parameter.
To use this method, I've moved the Bitmap classes from the CORE to the API.
Bitmaps
We now have 3 new class:
org.rdfhdt.hdt.compact.bitmap.Bitmap (Previously in the CORE, moved to the API)
org.rdfhdt.hdt.compact.bitmap.ModifiableBitmap (Previously in the CORE, moved to the API)
org.rdfhdt.hdt.compact.bitmap.BitmapFactory (Previously in the CORE, moved and updated to the API)
The bitmaps are the same as the previous ones in the CORE, I've updated the Javadoc to have a description on each method. In the BitmapFactory, we keep the same methods, but with the addition of some method to control the created Bitmap:
public static Bitmap createEmptyBitmap(long size);
Create an empty unmodifiable bitmap, all the accesses are the same with a bitmap filled with 0.
public static ModifiableBitmap createRWBitmap(long size);
Create a ModifiableBitmap with only the Bitmap.access(long), Bitmap.countOnes(), Bitmap.countZeros(), ModifiableBitmap.set(long, boolean), Bitmap.getNumBits(), Bitmap.getSizeBytes() and ModifiableBitmap.append(boolean) operations, so no rank and select need to be compute, avoiding to have to build an index. Useful for the diffHDTBit that doesn't require an indexed Bitmap.
HDT Diff
HDT Diff of 2 HDTs
We build a delete bitmap for the first HDT to remove all the triples of HDT2 in HDT1, for example, if we have:
We are creating the delete bitmap 100, because only the 1st triple should be removed from HDT1 ex:aaa ex:pred ex:aaa.
Then the HDTDiff is calling the HDTDiffBit with HDT1 and this new delete bitmap.
HDT Diff of a HDT and a delete bitmap
We create a new HDT from the HDT with all the triple marked to be deleted deleted. HDTDiffBit will also recompute the shared subjects/objects to remove from shared elements, elements that aren't shared anymore, the Dictionary is preserved.
This pull request ask to add the new methods, HDTDiff and HDTDiffBits and export the Bitmaps to the API.
We can describe the HDTDiff with HDTCat, If we say that the HDTCat of 2 HDT a and b is the HDT a U b, then the HDTDiff of 2 HDT can be described as the HDT a - b.
API Changes
HDTManager
In the HDTManager, we now have 2 methods to run HDTDiff:
Create the diff of the HDT in hdtFileName1 - the HDT in hdtFileName2 inside the first HDT, the new HDT can be tune with the hdtFormat parameter.
Create a new HDT with location location from the HDT in hdtFileName minus the triples with position (see #150) marked on the delete bitmap, the new HDT can be tune with the hdtFormat parameter.
To use this method, I've moved the Bitmap classes from the CORE to the API.
Bitmaps
We now have 3 new class:
org.rdfhdt.hdt.compact.bitmap.Bitmap
(Previously in the CORE, moved to the API)org.rdfhdt.hdt.compact.bitmap.ModifiableBitmap
(Previously in the CORE, moved to the API)org.rdfhdt.hdt.compact.bitmap.BitmapFactory
(Previously in the CORE, moved and updated to the API)The bitmaps are the same as the previous ones in the CORE, I've updated the Javadoc to have a description on each method. In the BitmapFactory, we keep the same methods, but with the addition of some method to control the created Bitmap:
Create an empty unmodifiable bitmap, all the accesses are the same with a bitmap filled with 0.
Create a ModifiableBitmap with only the Bitmap.access(long), Bitmap.countOnes(), Bitmap.countZeros(), ModifiableBitmap.set(long, boolean), Bitmap.getNumBits(), Bitmap.getSizeBytes() and ModifiableBitmap.append(boolean) operations, so no rank and select need to be compute, avoiding to have to build an index. Useful for the diffHDTBit that doesn't require an indexed Bitmap.
HDT Diff
HDT Diff of 2 HDTs
We build a delete bitmap for the first HDT to remove all the triples of HDT2 in HDT1, for example, if we have:
HDT1
and
HDT2
We are creating the delete bitmap
100
, because only the 1st triple should be removed from HDT1ex:aaa ex:pred ex:aaa
.Then the HDTDiff is calling the HDTDiffBit with HDT1 and this new delete bitmap.
HDT Diff of a HDT and a delete bitmap
We create a new HDT from the HDT with all the triple marked to be deleted deleted. HDTDiffBit will also recompute the shared subjects/objects to remove from shared elements, elements that aren't shared anymore, the Dictionary is preserved.
For example, if we have the HDT
and the delete bitmap
0101010
, we will delete the triples:The result HDT would be:
Notice that ex:ccc isn't shared anymore.
Tests
To test this features, I've added 4 tests files:
org.rdfhdt.hdt.compact.bitmap.BitmapFactoryTest
, test be can use the BitmapFactory and the new EmptyBitmap class.org.rdfhdt.hdt.triples.impl.BitmapTriplesIteratorDiffTest
, test the Iterator to build the delete bitmap from 2 HDTs.org.rdfhdt.hdt.hdtDiff.HdtDiffTest
, test the HDTDiff of generated HDT.org.rdfhdt.hdt.hdtDiff.HdtDiffStaticTest
, test the HDTDiff of nt files.For each test file, I'm testing the HDT method for the 4 dictionary types:
<http://purl.org/HDT/hdt#dictionaryFour>
dictionaryMultiObj
dictionaryFourBig
<http://purl.org/HDT/hdt#dictionaryFourPsfc>
For HDTDiff test, we are also checking if the asked dictionary is the same as the result HDTs.