pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
214 stars 23 forks source link

Add HDT backend #21

Closed pchampin closed 11 months ago

pchampin commented 4 years ago

An implementation of graph::Graph using an HDT file as its underlying data would be nice.

Note that this would be a read-only graph, as HDT does not support updates.

KonradHoeffner commented 1 year ago

I am also interested in this, but there seems to be no fully functional and performant HDT library for Rust yet. There is a prototype at https://github.com/timplication/hdt-rs/ which I try to improve but I can't guarantee that I succeed with it as it is quite complex.

pchampin commented 1 year ago

Another option would be to make a wrapper around libhdt

KonradHoeffner commented 1 year ago

I've now spent some time with hdt-rs and the main problem seems to be the lack of the locate function, which translates URIs back into dictionary ids. There is a function that materializes the complete knowledge base into triples but when testing it with a 1.7 GB hdt file converted from a 27 GB ntriples file, this didn't fit into 32 GB RAM and caused Linux to kill the program. Using the Java code as a baseline (I'm not comfortable enough with C++), I've added the locate function into a fork at https://github.com/konradhoeffner/hdt-rs, however this currently only works when the string is first in a block in the dictionary section, I will try to fix this and then implement a graph:Graph implementation. The question is where this implementation should go, probably best directly into hdt-rs but as a feature that is not selected by default so that users of the library that don't use sophia don't have an extra dependency, or what do you think?

Wrapping libhdt would also be an option but I think someone else would need to do this as I have no experience wrapping C++ libraries in Rust.

pchampin commented 1 year ago

The question is where this implementation should go, probably best directly into hdt-rs but as a feature that is not selected by default so that users of the library that don't use sophia don't have an extra dependency, or what do you think?

I think it would be ok to put the Graph impl in a different crate... but if you choose to put it in hdt-rs, yes, it should probably be feature-gated.

KonradHoeffner commented 1 year ago

The tests are successful now but don't cover everything yet but I will try implementing a Sophia Graph and put that inside the hdt-rs fork just for testing right now and later I can move it into a separate crate.

KonradHoeffner commented 1 year ago

I started implementing the Sophia Graph trait, but I'm not sure how to convert the triples from the HDTReader, which are in the form (String, String, String), into a form that the Graph trait accepts. For example, a triple from the HDT reader could be ("<http://example.com/ExampleSubject>", "<http://example.com/ExamplePredicate>", "\"example literal\""). Here is what I got so far:


imp. HDTReader<'a, R> {
[...]
    pub fn triples(&mut self) -> impl Iterator<Item = (String, String, String)> {
    [...]
    }

}

[...]

struct HdtGraph<'a, R: std::io::BufRead> {
    reader: HDTReader<'a, R>,
}

impl<'a, R: BufRead> HdtGraph<'_, R> {
    fn read(r: R) -> Self {
        HdtGraph {
            reader: HDTReader::new(&mut r),
        }
    }
}                                            
impl<R: BufRead> Graph for HdtGraph<'_, R> { 
    type Triple = ByValue<[Term<String>; 3]>;
    type Error = Infallible; // infallible for now, figure out what to put here later

    fn triples(&self) -> GTripleSource<Self> { 
        Box::new(
            self.reader
                .triples()
                .map(|(s, p, o)| {
                    StreamedTriple::by_value([
                        Iri::new_unchecked(&s),
                        Iri::new_unchecked(&p),
                        Iri::new_unchecked(&o),
                    ])                          
                })
                .into_triple_source(),
        )
    }
}

However this doesn't cover the possibility that the object is a literal and it also causes the following compile error:

rror[E0271]: expected `fn(sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>) -> Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>, Infallible>` to be a fn pointer that returns `Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::term::Term<String>; 3]>>, Infallible>`, but it returns `Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>, Infallible>`
  --> src/hdt_graph.rs:74:5
   |
74 |     Box::new(self.reader.triples().map(|(s,p,o)| StreamedTriple::by_value([Iri::new_unchecked(&s),Iri::new_unchecked(&p),Iri::new_unchecked(&o)])).into_triple_source())
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected enum `sophia::term::Term`, found struct `sophia::iri::Iri`
   |
   = note: expected enum `Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::term::Term<String>; 3]>>, _>`
              found enum `Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>, _>`
   = note: required for `std::iter::Map<std::iter::Map<impl Iterator<Item = (String, String, String)>, [closure@src/hdt_graph.rs:74:40: 74:49]>, fn(sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>) -> Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>, Infallible>>` to implement `Iterator`
   = note: required for the cast from `std::iter::Map<std::iter::Map<impl Iterator<Item = (String, String, String)>, [closure@src/hdt_graph.rs:74:40: 74:49]>, fn(sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>) -> Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::iri::Iri<'_>; 3]>>, Infallible>>` to the object type `dyn Iterator<Item = Result<sophia::triple::streaming_mode::StreamedTriple<'_, sophia::triple::streaming_mode::ByValue<[sophia::term::Term

According to the documentation, Iri implements TTerm so I don't know why it doesn't accept it.

KonradHoeffner commented 1 year ago

I got it to run like this:

      Box::new(
            hdt_reader
                .triples()
                .map(|(s, p, o)| {
                    StreamedTriple::by_value([
                        Term::<String>::from(s),
                        Term::<String>::from(p),
                        Term::<String>::from(o),
                    ])
                })
                .into_triple_source(),

However when printing the first item of the iterator, the term type seems to always be literal even though they should be IRIs:

Ok(StreamedTriple { _phantom: PhantomData, wrapped: [Literal(Literal { txt: "_:b1", kind: Dt(Iri { ns: "http://www.w3.org/2001/XMLSchema#", suffix: Some("string") }) }), Literal(Literal { txt: "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1", kind: Dt(Iri { ns: "http://www.w3.org/2001/XMLSchema#", suffix: Some("string") }) }), Literal(Literal { txt: "http://data.semanticweb.org/person/barry-norton", kind: Dt(Iri { ns: "http://www.w3.org/2001/XMLSchema#", suffix: Some("string") }) })] })
pchampin commented 1 year ago

FTR, this "StreamedTriple" thing is a ugly hack that my long awaited "big refactoring" is meant to get rid of. Sorry you have to go through this...

What you need is a function that "understands" the HDT convention that:

Note also that you may want to use BoxTerm, equivalent to Term<Box<str>>, instead of Term<String> (because you don't need the mutability of String in this context).

Note also that BoxTerm or Term<String> will allocate a copy of the str you get from the HDT, while ideally you could provide a term that references that str directly. This would probably be faster and leaner. Alas, this is super-hard to achieve with the StreamedTriple hack, and will be much easier in the next version. Hopefully, I will be able to push a pre-release by the end of October...

In the meantime,if you want to make progress, you can continue with BoxTerm -- just be aware the performance will not be as good as they could be (and eventually will be).

KonradHoeffner commented 1 year ago

Version 0.0.7 of the HDT Graph is now available at https://crates.io/crates/hdt and https://github.com/KonradHoeffner/hdt. Preliminary benchmarks comparing Sophia Fastgraph, Lightgraph and HDT Graph are available at https://github.com/KonradHoeffner/sophia_benchmark/blob/master/benchmark_results.ipynb. However the base memory consumption seems a bit high at the moment, I will investigate this later. I just forked https://github.com/pchampin/sophia_benchmark/ and switched out the query part, the rest still has the old values. Unfortunately the test data stops at around 10 million, which is when the RAM savings really start getting big. I have successfully loaded 135 million triples (30 GB N-Triples, ~ 1.5 GB HDT) using this library, at which point it consumes around 2.9 GB RAM.

pchampin commented 1 year ago

Thanks and kudos for this work.

Those results are quite counter intuitive (and therefore very intersting)!

Re. loading, I am amazed by the fact that HDT is faster, and by that order of magnitude. Intuitively, creating the HDT structures seems like more work... Or am I misunderstanding what you are measuring here? Are you loading from an N-Triples file, or from a file that is alteady in HDT format (in which case, I would better understand the results).

Re. querying, any idea why HDT is still slower?

KonradHoeffner commented 1 year ago

Thanks! Yes, the latter is true, I am converting the N-Triples files to HDT beforehand and only loading the existing HDT files because the library cannot create all of the HDT structures on its own but relies on loading existing HDT files. Also I think in a real-world scenario one would prefer to use HDT files anyways because they are much smaller, so I think this is a fair comparison but I could also see arguments for the other case. If loading of N-Triples into HDT would be a feature you would like to use, I could try implementing that as well.

As for querying, I think the main problem here is that the benchmark uses an ?PO pattern, which is one of the few ones that isn't fully optimized yet. The current implementation uses the property index to first get all triple IDs (triples of integers) for the given property, then filters them by the given object and finally translates the triple IDs to strings. This is faster than the Sophia Graph trait default implementation, which I think would also use triples_with_p() and then filter, but it would filter after the translation to string because it cannot know that HDT uses integer triples. I could try parallelizing the filter operation as a quick optimization but I will also investigate if I can implement the fully optimized triples_with_po instead as this is planned anyways.

I also adapted your existing benchmark code quickly so I need to check more thoroughly if I measured everything correctly.

P.S.: Updating the figure captions from "... NT file ..." to "... NT/HDT file".

KonradHoeffner commented 1 year ago

I don't know why the memory used while allocating for the graph is so high for HDT even with the 10k triples file. The benchmark reports 219-351 MB but when I use /usr/bin/time -v or heaptrack it shows much less. Heaptrack reports 1 MB heap memory consumption. The sophia_benchmark file itself is 6.5 MB in size. Could this be related to multithreading? I'm not sure how exactly vm size in Linux is reported and will investigate. Strangely, when not using HDT, this problem does not occur.

Update: Switched VmSize to VmRSS, which is much closer to what heaptrack and /usr/bin/time -v are reporting.

KonradHoeffner commented 1 year ago

Added Jena back in and updated to version 4.6.1. See https://github.com/KonradHoeffner/sophia_benchmark/blob/master/benchmark_results.ipynb.

KonradHoeffner commented 1 year ago

I started optimizing the triples_with_po method but there are still some bugs in the new implementation, will try to fix it this week. I consider comparing some other triple patterns as well, as ?PO should be one of the worst ones for HDT as the default data layout is S-P-O.

KonradHoeffner commented 1 year ago

P.S.: Loading could be made even faster by not creating the helper structures for the object- and predicate based access and then it would use less memory as well, but be even slower to query for triple patterns where this is necessary, probably similar to the FastGraph vs LightGraph split.

KonradHoeffner commented 1 year ago

https://github.com/KonradHoeffner/hdt/commit/d596b83107fd7c05320d77c8ae6a5e8736772f56 optimizes triples_with_po using binary search, I will recreate the benchmarks when I get to that machine the next time.

KonradHoeffner commented 1 year ago

There seems to be a lot of optimization potential: Screenshot from 2022-12-22 06-47-36

Only 18% of the time is used to get the actual triple of integer IDs, while 49% of the time is taken to translate the IDs to strings and 28% for creating Sophia terms out of the strings.

Given that all results for a ?PO query share the same predicate and object, reusing those should be a big performance boost. However that would mean using references or smartpointers instead of strings and those would need to be managed as well.

KonradHoeffner commented 1 year ago

After some optimization it looks much better now: Screenshot from 2022-12-22 13-09-46

However it is still nowhere near as fast as Sophia's FastGraph and I don't think I can achieve that kind of speed.

pchampin commented 1 year ago

Thanks a lot for all this work!

pchampin commented 1 year ago

I agree that comparing NT-loading time with HDT-loading time also makes sense. Thanks for making this clearer in the figure's legend.

Regarding performances, it is very interesting to have "raw" hdt-rs in the figures as well. I'll try to find some time to review your code, especially to try to understand what creates the gap between hdt-rs and sophia_hdt in some cases...

KonradHoeffner commented 1 year ago

What you need is a function that "understands" the HDT convention that:

  • anything between <> is an IRI,
  • anything between "" is a literal
  • anything starting with _: is a blank node (I guess) and calls the appropriate constructor of Term (new_iri and following).

Thanks! This function now exists as auto_term in https://github.com/KonradHoeffner/hdt/blob/main/src/hdt_graph.rs, however it is very slow because of the string matching so I'm in the process of getting rid of it wherever possible, for example in the predicate case where it is always a URI.

KonradHoeffner commented 1 year ago

Note also that BoxTerm or Term<String> will allocate a copy of the str you get from the HDT, while ideally you could provide a term that references that str directly. This would probably be faster and leaner.

I didn't understand Rust well enough at the time so I was a bit confused about why all those different datatypes were necessary but I finally understood it now and made the Hdt and HdtGraph structures generic as well so one can now choose between Box, Rc and Arc depending on the use case. So for example with a *PO query, the P and O terms are only created once (couldn't reuse the input terms directly as I didn't manage to convert them to a format that the compiler accepted) and then they are cloned for each triple in the result. However this isn't done for the queries with given subject yet because they are all handled by the same iterator.

Referencing resources directly from within HDT with an &str isn't possible however because HDT is a compressed format and it's dictionary contains the strings in front coded form, so first the block needs to be located and then location in the block and then the string needs to be merged from to byte array slices and converted to unicode. I think for this reason it cannot be as fast as a Sophia fast graph, which as far as I understand it contains everything full indexed and uncompressed and ready to access.

KonradHoeffner commented 1 year ago

Regarding performances, it is very interesting to have "raw" hdt-rs in the figures as well. I'll try to find some time to review your code, especially to try to understand what creates the gap between hdt-rs and sophia_hdt in some cases...

Thanks! I think a good starting point would be the triples_with_po function in https://github.com/KonradHoeffner/hdt/blob/main/src/hdt_graph.rs that answers the first query of the benchmark:

/// An iterator visiting all triples with the given predicate and object.
    fn triples_with_po<'s, TP: TTerm + ?Sized, TO: TTerm + ?Sized>(
        &'s self, p: &'s TP, o: &'s TO,
    ) -> GTripleSource<'s, Self> {
        let ps = term_string(p);
        let os = term_string(o);
        let pid = self.hdt.dict.string_to_id(&ps, &IdKind::Predicate);
        let oid = self.hdt.dict.string_to_id(&os, &IdKind::Object);
        // predicate can be neither a literal nor a blank node
        let p = Term::new_iri_unchecked(ps);
        let o = auto_term(&os).unwrap();
        Box::new(
            PredicateObjectIter::new(&self.hdt.triples, pid, oid)
                .map(move |sid| -> Result<_> {
                    Ok(StreamedTriple::by_value([
                        // subject is never a literal, so we can save a lot of CPU time here by not using auto_term
                        // TODO: could subject be a blank node?
                        Term::new_iri_unchecked(self.hdt.dict.id_to_string(sid, &IdKind::Subject).unwrap()),
                        p.clone(),
                        o.clone(),
                    ]))
                })
                .filter_map(|r| r.map_err(|e| error!("{e}")).ok())
                .into_triple_source(),
        )
    }
}

I see three reasons for the performance difference here:

  1. string matching in auto_term is very slow
  2. Term::new_iri_unchecked seems to use some regular expressions as well
  3. I can't get the compiler to accept the input P and O terms as part of the output so I convert them to string and back, which has more of a relative impact when the result set is small.

To measure the performance I found perf quite useful with converting it to the Firefox profiler input.

I wonder if it makes sense to use a second thread here where the first runs hdt.triples_with_po and pipes the result in a buffer and the second thread could read the buffer asynchronously and run Term::new_iri_unchecked.

KonradHoeffner commented 1 year ago

With the current version, I think the overhead is just the IriRef::new_unchecked(MownStr::from(x)) for the ?PO pattern.

pchampin commented 1 year ago

IriRef::new_unchecked only uses regular expressions in debug mode, not in --release mode.

Measuring performances in debug code is not really relevant, because debug code is not optimize, and may contain a lot of extra code (e.g. debug_assert) for, well... debugging purposes :)

KonradHoeffner commented 1 year ago

I think I always used release mode but I will measure it again.

KonradHoeffner commented 1 year ago

You are right, I didn't find any regular expressions anymore and the overhead seems fine now!

Current Results

?PO/7 triple IDs (?, type, person)
                        time:   [118.53 ms 118.61 ms 118.93 ms]
?PO/8 str subjects (?, type, person)
                        time:   [313.49 ms 313.65 ms 314.29 ms]
?PO/9 str triples (?, type, person)
                        time:   [325.37 ms 325.42 ms 325.43 ms]
?PO/10 Sophia HDT graph triples (?, type, person)
                        time:   [356.40 ms 356.51 ms 356.96 ms]

profile of the benchmark iteration ?PO/10 Sophia HDT graph triples

group.bench_function("10 Sophia HDT graph triples (?, type, person)", |b| {   
        b.iter(|| graph.triples_matching(Any, Some(&type_term), Some(&person_term)).count()) 
    });

Screenshot from 2023-02-22 11-08-00

overhead only

Screenshot from 2023-02-22 11-10-00

KonradHoeffner commented 11 months ago

I have been using the Sophia adapter of the hdt library in production for some time, so I think the issue is resolved :-) For example, it powers LinkedSpending at http://linkedspending.aksw.org, where it previously always ran out of memory but now it only needs less than 2 GB RAM to host a ~ 1.4 GB HDT file that is ~ 30 GB in size as uncompressed N-Triples. If you have any further requests for the adapter, please let me know. Usage is documented in the README.md of https://github.com/konradhoeffner/hdt and also on https://docs.rs/hdt/latest/hdt/. There are also a few benchmarks in the paper https://joss.theoj.org/papers/10.21105/joss.05114 based on the benchmark suite of your Sophia paper.

pchampin commented 11 months ago

I have been using the Sophia adapter of the hdt library in production for some time, so I think the issue is resolved :-)

Agreed, and thanks again for the awesome work you have put in this. I just added a mention to hdt in Sophia's README.