rdfhdt / hdt-cpp

HDT C++ Library and Tools
115 stars 65 forks source link

Data-type forwarding to TripleIterator #279

Open gpicciuca opened 5 months ago

gpicciuca commented 5 months ago

Hello, I'm using this library together with Rasqal as the SPARQL Query Engine. Problem is that Rasqal expects the triples to be in the correct datatype format (e.g. URI, double, boolean, etc) to do its magic while HDT provides only the raw strings.

Question is whether there is a way to retrieve the datatypes stored inside the HDT file and have them passed (perhaps separately) through the iterator returned by the HDT::Search method.

Been investigating this on my own and so far I only managed to find out that CSD_PFC is where the strings are stored, which also contain the datatypes I'm interested in, in textual format (e.g. "0"^^xsd:integer) But when I retrieve the results from the iterator, even though it accesses the same (?!) methods, the portion after ^^ is gone and only "0" is returned.

Having this additional information returned would simplify the integration with Rasqal in my case. Right now I'm doing some basic string manipulation to determine which datatype might be correct, although I feel like it's error-prone and not reliable at 100% and having the real type provided directly by the library would be safer.

donpellegrino commented 5 months ago

I don't have a direct answer to the question, "whether there is a way to retrieve the datatypes stored inside the HDT file and have them passed (perhaps separately) through the iterator returned by the HDT::Search method." I would have to do some research to figure that out.

Combining HDT storage with a SPARQL Query Engine is useful work and integrating Rasqal with HDT sounds like a good approach. For reference, I have a branch of a fork of Oxigraph available that uses the Oxigraph SPARQL Query Engine and the Rust HDT Library for reading the HDT files. Since that implementation is in Rust rather than C, there will be differences in approach, but the code might show one technique for how the datatypes are handled when going from the HDT contents to the SPARQL query processing.

gpicciuca commented 5 months ago

Thanks for your feedback @donpellegrino . I had a look at your Rust HDT Library and overall, you're using the same/a similar approach as what I am doing at the moment. It's mainly string manipulation/regex. In my case, it's quite fast and so far has not given problems but the context in which my implementation will be used requires us to avoid doing this kind of magic tricks (Automotive industry).

Meanwhile, I've been digging further into the matter with the C++ library and am still far from having a real solution, but I'm starting to understand what the actual problem is.

image

In the screenshot above, the element being extracted would be the Object, required for the Triple being queried by Rasqal.

CSD_PFC::extractInBlock is called with block = 0 and o = 14. That means that we're in the first available block and we have to move forward by "14" suffixes within this block. To move forward, we have the VByte::decode function that decodes the first byte of the suffix, extract the delta (length of the suffix) and returns the amount of bytes to move forward indicating where the suffix actually starts (if I didn't misunderstand this last part).

This actually returns the correct value that we're looking for, but here's the catch: The data-type suffix I'm interested in comes right after the suffix where we stopped and this is the case in all of the common/shared prefixes as in this case.

In this particular case, at the beginning we have the "0" prefix, then move forward by 14 suffixes, extract the length which is (1) and store it in tmpStr and we stop here. Result yields the value "1". Which is correct.

extractInBlock ID: 0 pos2 1 delta 3 actual_ptr 0x7ffff71b1010 VByte::decode -> 1 delta 3 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 4 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 4 VByte::decode -> 1 delta 3 VByte::decode -> 1 delta 2 VByte::decode -> 1 delta 1 pos 14 >> "1"`

There is also another case where o = 0, meaning it won't even go into the for-loop, which makes sense because then it's the very first element and we don't need to look any further. But even here, we're losing out on the data-type suffix.

My guess is that this is an architectural issue as it depends on how the prefixes and suffixes are pooled and there's no real workaround for it.

gpicciuca commented 5 months ago

I tried to hack the code a bit, thinking that the datatypes are "always" right after the first common/shared prefix.. So I just accessed that location directly with

image

and then append this suffix to the tmpStr variable at the end of extractInBlock only if it starts with ^^:

image

add_datatype defaults to false and I set it to true only when I'm retrieving data through the Dictionary::tripleIDtoTripleString method otherwise I ended up with a duplicated data-type string attached.

It works only partially. There are some results that get the correct data-type suffix, while others get nothing at all. So it's unreliable, too, as it depends on how the data is stored in the blocks.