rdfhdt / hdt-cpp

HDT C++ Library and Tools
117 stars 65 forks source link

HDT format, simple diagram or cheatseet #137

Open v4ss4llo opened 6 years ago

v4ss4llo commented 6 years ago

Hello, is there any (or would it please be possible to make one) cheatsheet, graph/diagram, or tutorial explaining the HDT binary format? I mean the one used in .hdt files. Also I'm not talking about the HDT papers, I'm simply talking about "a set of instructions" that explains how a HDT file is structured. For example:

What I have in mind is give-or-take something like this.

It would be useful to people adding more HDT implementations (read/write of HDT files) without having to go through the lengthy details of the papers that explains the ins and outs of the whole HDT system.

Thank you.

v4ss4llo commented 6 years ago

I'd like to help improving the rdf2hdt tool, in particular I'd like to improve (or make another tool conversion tool) that doesn't require terabytes of RAM to convert a file. I've read the W3 submission so I think that I kind of get it, but the document on the W3 website is not really that great, and after opening a simple .hdt file with a hex editor it's still pretty hard to figure out all the components involved in the binary format. Could anybody please point me to a clear description of the binary format? Or help me understand it?

v4ss4llo commented 6 years ago

I wonder though if there is anybody out there reading these issues?

akuckartz commented 6 years ago

Yes, there are people reading these issues.

akuckartz commented 6 years ago

Have you read these two documents?

v4ss4llo commented 6 years ago

In the FourSectionDictionary section it's written As stated, this dictionary is specified by denoting the format to <http://purl.org/HDT/hdt#dictionaryPlain> in the Control information.

Is this correct or a typo? Shouldn't it be <http://purl.org/HDT/hdt#dictionaryFour> instead?

v4ss4llo commented 6 years ago

I've also a question about the "BitmapTriples" encoding, not the theory but the way it's encoded by rdf2hdt.

So, I've encoded a simple file with only one triple, <http://example.org/A> <http://schema.org/name> "Alice" .

There's the header and other stuff, then the last bits of the file are the "BitmapTriples".

00000740  24 48 44 54 04 3c 68 74  74 70 3a 2f 2f 70 75 72  |$HDT.<http://pur|
00000750  6c 2e 6f 72 67 2f 48 44  54 2f 68 64 74 23 74 72  |l.org/HDT/hdt#tr|
00000760  69 70 6c 65 73 42 69 74  6d 61 70 3e 00 6f 72 64  |iplesBitmap>.ord|
00000770  65 72 3d 31 3b 00 59 e9  01 81 9b 01 52 d0 16 a0  |er=1;.Y.....R...|
00000780  01 81 9b 01 52 d0 16 a0  01 01 81 f0 01 52 d0 16  |....R........R..|
00000790  a0 01 01 81 f0 01 52 d0  16 a0                    |......R...|

The documentation says that here we have (after the control information)

Byte:<Format of bitmap (by default, and reserved, is "1")>
VByte:<Total number of bits stored>
<CRC8 of the previous fields>
<Bits (byte aligned, little endian) 
<CRC32 of the previous bits>

for the first two bitmaps, then we have

Byte:<Format of array: Three reserved implementations: Log64, uint32 and uint64, as [1,3] >
VByte:<Total number of entries stored>
<CRC8 of the previous fields>
<Entries, codified in the specified format: In Log64, logbits of the number of entries in uint64 (byte aligned, little endian), or uint32 or uint64 entries. 
<CRC32 of the previous entries>

for the two arrays.

00000770                           01 81 9b 01 52 d0 16 a0  |        ....R...|
00000780  01 81 9b 01 52 d0 16 a0  01 01 81 f0 01 52 d0 16  |....R........R..|
00000790  a0 01 01 81 f0 01 52 d0  16 a0                    |......R...|

Now, the two bitmaps are cool... 01 = format 81 = bits stored 9b = CRC8 01 = map 52 d0 16 a0 = CRC32 as expected.

The two arrays instead... 01 01 81 f0 01 52 d0 16 a0 start with a double 01, whereas what I was expecting was 01 81 9b 01 52 d0 16 a0. Following the documentation, this would read 01 = format 01 81 = bit stored (129??) f0 = CRC8 01 = entries 52 d0 16 a0 = CRC32. This is what I don't understand... Why is it starting with a double 01? Shouldn't it be just a single 01? Is this a bug or am I missing something?

dinikolop commented 6 years ago

I am also looking at this, and if I find out whether it this byte redandunt or not, I will reply back.

For the moment, one note about VBYTE in bitmap arrays. After reading the reference in the documentation to a paper that describes the VBYTE structure and intuition, I understood the following things about the use of it in the HDT bitmap preambles:

The most significant bit of the VBYTE is a flag which tells whether there is no other VBYTE next to it or there is. If this bit is set to 1, then it means that there is no more VBYTEs next to it. The other 7 least significant bits represent the number of the bits that are in the bitmap sequence. In your example, where VBYTE = 0x81 = 129 (decimal) = 1000 0001 (binary) that means that there is no continuation and that there is only one bit in the sequence. Now, if the bitmap holds more than 127 bit, then there is need of more VBYTE since one is not enough for representing the size. For instance, if you have a bitmap of 132 bits then there would be 2 VBYTES representing this size: 0x04 and 0x81 (hex) 0000 0100 and 1000 0001 (binary) You remove the most significant bits (flags - notice that the second has flag 1) and since it is in byte alligned little endian form, then you get: 000 0001 and 000 0100 = (decimal) 132

After the VBYTE(s) is the 8-bit CRC8 and then a variable size of bytes which is the raw bit sequence. For 132 bits, it would take ceiling(132/8) = 17 bytes. In this representation the 3 least significant bits would be redandunt and set to 0.

Also, in the arrays, the VByte represents the number of entries (numbers) stored and not bits.

v4ss4llo commented 6 years ago

Yes this is also my understanding of the VByte. The bitmaps do check out though, 01 81 9b 01 52 d0 16 a0. What doesn't seem to check out (to me at least) are the arrays, 01 01 81 f0 01 52 d0 16 a0, in particular the double 01 at the beginning... Following the documentation, I decode it like this:

01       Byte:<Format of array: Three reserved implementations: Log64, ...
01 81    VByte:<Total number of entries stored>
         Since it's a VByte, I'm taking 2 bytes (because the first bit is set to 0). So, what does
         this translate to?
         01 81 = 000 0001 000 0001   which is 129 in decimal?? 129 entries?? I only had one!
f0       <CRC8 of the previous fields>
... other fields
v4ss4llo commented 6 years ago

If you need the entire dump that I'm using for testing...

There's a lot of headers, the bitmaps are at the end.

v4ss4llo commented 6 years ago

For testing purposes I've created another file with 2 triples instead,

<http://example.org/A> <http://schema.org/name> "Alice" .
<http://example.org/B> <http://schema.org/surname> "Bob" .

And the bitmaps/arrays are as follow

00000780                                    (01 82 92 03 a5     |           .....|
00000790  a0 2d 41) (01 82 92 03 a5  a0 2d 41) (01 02 82 c6 09  |.-A......-A.....|
000007a0  9d 88 cf 2a) (01 02 82 c6  09 9d 88 cf 2a)            |...*........*|

It looks like it's writing the VByte twice, one time with the leading bit set to zero, and then again with the leading bit set to one. 01 81 in the first example (with one triple) and 02 82 in the second example (with two triples).

v4ss4llo commented 6 years ago

BTW the documentation is really unclear and most of the understanding must be obtained through intuition and guesswork, that is by guessing what the author is trying to say instead of actually stating exactly what he means.

For example the "DictionarySectionPlainFrontCoding" is a big question mark about how the bits are stored, "BitmapTriples" is also very light on details, then there is no definition of what "Log64"/"logbits"/"LogArray Position"/"Buffer of Blocks" mean in this context and in particular how bits are stored, and another example is "" which fails to mention that the CRC32 is not over all the previous entries, but only since the last CRC8.

This documentation is not aimed at helping contributors and making their life easier, it's more like a scratchpad useful only to the original author to recall key concepts (that he already knows because he was the one to define them).

v4ss4llo commented 6 years ago

The .hdt.index.v1-1 (ie. the extra indexes a part from the .hdt file) is even more obscure and undocumented. The documentation says I should expect a ‘$HDT’ cookie followed by a control byte (Unknown, Global, Header, Dictionary, Triples, Index) as integer [0-5]. What I find instead is...

barthanssens commented 4 years ago

FWIW, I'm working on a HDT decoder for RDF4J (https://github.com/eclipse/rdf4j/issues/232), and adding some javadoc about the files created by HDT-It 1.x.

(I'm aware of the rdfhdt java version... But the best way to learn about and understand a specification is by implementing it ;-)

I'm facing the same issues: the software implementation is slightly different (and more advanced) from the latest published draft and the W3C submission.

The format itself is great, by the way, and has a lot of potential, though IMHO some clarifications would indeed be helpful to make it more widely accepted.