Open v4ss4llo opened 6 years ago
I'd like to help improving the rdf2hdt tool, in particular I'd like to improve (or make another tool conversion tool) that doesn't require terabytes of RAM to convert a file. I've read the W3 submission so I think that I kind of get it, but the document on the W3 website is not really that great, and after opening a simple .hdt file with a hex editor it's still pretty hard to figure out all the components involved in the binary format. Could anybody please point me to a clear description of the binary format? Or help me understand it?
I wonder though if there is anybody out there reading these issues?
Yes, there are people reading these issues.
Have you read these two documents?
In the FourSectionDictionary section it's written As stated, this dictionary is specified by denoting the format to <http://purl.org/HDT/hdt#dictionaryPlain> in the Control information.
Is this correct or a typo? Shouldn't it be <http://purl.org/HDT/hdt#dictionaryFour>
instead?
I've also a question about the "BitmapTriples" encoding, not the theory but the way it's encoded by rdf2hdt.
So, I've encoded a simple file with only one triple, <http://example.org/A> <http://schema.org/name> "Alice" .
There's the header and other stuff, then the last bits of the file are the "BitmapTriples".
00000740 24 48 44 54 04 3c 68 74 74 70 3a 2f 2f 70 75 72 |$HDT.<http://pur|
00000750 6c 2e 6f 72 67 2f 48 44 54 2f 68 64 74 23 74 72 |l.org/HDT/hdt#tr|
00000760 69 70 6c 65 73 42 69 74 6d 61 70 3e 00 6f 72 64 |iplesBitmap>.ord|
00000770 65 72 3d 31 3b 00 59 e9 01 81 9b 01 52 d0 16 a0 |er=1;.Y.....R...|
00000780 01 81 9b 01 52 d0 16 a0 01 01 81 f0 01 52 d0 16 |....R........R..|
00000790 a0 01 01 81 f0 01 52 d0 16 a0 |......R...|
The documentation says that here we have (after the control information)
Byte:<Format of bitmap (by default, and reserved, is "1")>
VByte:<Total number of bits stored>
<CRC8 of the previous fields>
<Bits (byte aligned, little endian)
<CRC32 of the previous bits>
for the first two bitmaps, then we have
Byte:<Format of array: Three reserved implementations: Log64, uint32 and uint64, as [1,3] >
VByte:<Total number of entries stored>
<CRC8 of the previous fields>
<Entries, codified in the specified format: In Log64, logbits of the number of entries in uint64 (byte aligned, little endian), or uint32 or uint64 entries.
<CRC32 of the previous entries>
for the two arrays.
00000770 01 81 9b 01 52 d0 16 a0 | ....R...|
00000780 01 81 9b 01 52 d0 16 a0 01 01 81 f0 01 52 d0 16 |....R........R..|
00000790 a0 01 01 81 f0 01 52 d0 16 a0 |......R...|
Now, the two bitmaps are cool... 01 = format
81 = bits stored
9b = CRC8
01 = map
52 d0 16 a0 = CRC32
as expected.
The two arrays instead... 01 01 81 f0 01 52 d0 16 a0
start with a double 01
, whereas what I was expecting was 01 81 9b 01 52 d0 16 a0
. Following the documentation, this would read 01 = format
01 81 = bit stored (129??)
f0 = CRC8
01 = entries
52 d0 16 a0 = CRC32
. This is what I don't understand... Why is it starting with a double 01
? Shouldn't it be just a single 01
? Is this a bug or am I missing something?
I am also looking at this, and if I find out whether it this byte redandunt or not, I will reply back.
For the moment, one note about VBYTE in bitmap arrays. After reading the reference in the documentation to a paper that describes the VBYTE structure and intuition, I understood the following things about the use of it in the HDT bitmap preambles:
The most significant bit of the VBYTE is a flag which tells whether there is no other VBYTE next to it or there is. If this bit is set to 1, then it means that there is no more VBYTEs next to it. The other 7 least significant bits represent the number of the bits that are in the bitmap sequence. In your example, where VBYTE = 0x81 = 129 (decimal) = 1000 0001 (binary) that means that there is no continuation and that there is only one bit in the sequence.
Now, if the bitmap holds more than 127 bit, then there is need of more VBYTE since one is not enough for representing the size. For instance, if you have a bitmap of 132 bits then there would be 2 VBYTES representing this size:
0x04 and 0x81 (hex)
0000 0100 and 1000 0001 (binary)
You remove the most significant bits (flags - notice that the second has flag 1) and since it is in byte alligned little endian form, then you get:
000 0001 and 000 0100
= (decimal) 132
After the VBYTE(s) is the 8-bit CRC8 and then a variable size of bytes which is the raw bit sequence. For 132 bits, it would take ceiling(132/8) = 17 bytes. In this representation the 3 least significant bits would be redandunt and set to 0.
Also, in the arrays, the VByte represents the number of entries (numbers) stored and not bits.
Yes this is also my understanding of the VByte. The bitmaps do check out though, 01 81 9b 01 52 d0 16 a0
. What doesn't seem to check out (to me at least) are the arrays, 01 01 81 f0 01 52 d0 16 a0
, in particular the double 01
at the beginning...
Following the documentation, I decode it like this:
01 Byte:<Format of array: Three reserved implementations: Log64, ...
01 81 VByte:<Total number of entries stored>
Since it's a VByte, I'm taking 2 bytes (because the first bit is set to 0). So, what does
this translate to?
01 81 = 000 0001 000 0001 which is 129 in decimal?? 129 entries?? I only had one!
f0 <CRC8 of the previous fields>
... other fields
If you need the entire dump that I'm using for testing...
data.nt
with only one triple inside: <http://example.org/A> <http://schema.org/name> "Alice" .
$ ./rdf2hdt data.nt /dev/stdout | hexdump -C
There's a lot of headers, the bitmaps are at the end.
For testing purposes I've created another file with 2 triples instead,
<http://example.org/A> <http://schema.org/name> "Alice" .
<http://example.org/B> <http://schema.org/surname> "Bob" .
And the bitmaps/arrays are as follow
00000780 (01 82 92 03 a5 | .....|
00000790 a0 2d 41) (01 82 92 03 a5 a0 2d 41) (01 02 82 c6 09 |.-A......-A.....|
000007a0 9d 88 cf 2a) (01 02 82 c6 09 9d 88 cf 2a) |...*........*|
It looks like it's writing the VByte twice, one time with the leading bit set to zero, and then again with the leading bit set to one. 01 81
in the first example (with one triple) and 02 82
in the second example (with two triples).
BTW the documentation is really unclear and most of the understanding must be obtained through intuition and guesswork, that is by guessing what the author is trying to say instead of actually stating exactly what he means.
For example the "DictionarySectionPlainFrontCoding" is a big question mark about how the bits are stored, "BitmapTriples" is also very light on details, then there is no definition of what "Log64"/"logbits"/"LogArray Position"/"Buffer of Blocks" mean in this context and in particular how bits are stored, and another example is "
This documentation is not aimed at helping contributors and making their life easier, it's more like a scratchpad useful only to the original author to recall key concepts (that he already knows because he was the one to define them).
The .hdt.index.v1-1
(ie. the extra indexes a part from the .hdt file) is even more obscure and undocumented. The documentation says I should expect a ‘$HDT’
cookie followed by a control byte (Unknown, Global, Header, Dictionary, Triples, Index) as integer [0-5]
. What I find instead is...
some of these generated .hdt.index.v1-1
files only have 1 index with order=1
which doesn't seem to make sense because the documentation says 1 = SPO
but the .hdt
file the index was generated from already has a SPO index (the default). And BTW, why was only 1 index generated instead of 6 (or more)?
other index files have completely weird cookies/control bytes, like $hdt
$HDt
$hDT
$hDT
$hdT
... what the heck is that?? Additionally, the documentation says that the control byte (after the cookie) should be an integer in [0-5], but next to these weird cookies I see all kind of bytes, not only 0-5...
FWIW, I'm working on a HDT decoder for RDF4J (https://github.com/eclipse/rdf4j/issues/232), and adding some javadoc about the files created by HDT-It 1.x.
(I'm aware of the rdfhdt java version... But the best way to learn about and understand a specification is by implementing it ;-)
I'm facing the same issues: the software implementation is slightly different (and more advanced) from the latest published draft and the W3C submission.
The format itself is great, by the way, and has a lot of potential, though IMHO some clarifications would indeed be helpful to make it more widely accepted.
Hello, is there any (or would it please be possible to make one) cheatsheet, graph/diagram, or tutorial explaining the HDT binary format? I mean the one used in .hdt files. Also I'm not talking about the HDT papers, I'm simply talking about "a set of instructions" that explains how a HDT file is structured. For example:
What I have in mind is give-or-take something like this.
It would be useful to people adding more HDT implementations (read/write of HDT files) without having to go through the lengthy details of the papers that explains the ins and outs of the whole HDT system.
Thank you.