pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
228 stars 24 forks source link

less memory than LightGraph possible? #121

Closed KonradHoeffner closed 1 year ago

KonradHoeffner commented 2 years ago

When I load an N-Triples file that contains ~ 16 million triples and that is ~ 181 MB ZIP compressed and 3.6 GB uncompressed, I need 4.9 GB of RAM with a FastGraph (according to Docker stats) and 2.6 GB of RAM with a LightGraph.

Is there a way to reduce the memory consumption further, for example to less than 1 GB? The file contains very long URIs that occur multiple times, so I assumed this would use much less memory, as Sophia seems to represent triple subjects with integers, at least that what it looks like in https://docs.rs/sophia/latest/sophia/graph/inmem/type.LightGraph.html for me as a non-expert in Rust:

pub type LightGraph = HashGraph<TermIndexMapU<u32, WeakHashSet<Weak<str>, RandomState>>>;

For example, that are the first lines of the file:

<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-cofog1> <https://openspending.org/618ac3ec98384f44a9ef142356ce476d/cofog1/09> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://www.w3.org/2000/01/rdf-schema#label> "618ac3ec98384f44a9ef142356ce476d observation 8b31551d0f8a7a86b3c885654e4b67e4a4c334af" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-economicidlevel1> "4000" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-cofog2> <https://openspending.org/618ac3ec98384f44a9ef142356ce476d/cofog2/09-6> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-economiclevel2> "Grants" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://purl.org/dc/terms/source> _:A5b0893dcX3aX14a2b3da39eX3aXX2dX5650 .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://dbpedia.org/ontology/currency> <http://dbpedia.org/resource/Armenian_dram> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/refYear> "2009"^^<http://www.w3.org/2001/XMLSchema#gYear> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-program> "" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-type> "non-personnel" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#Observation> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-economicidlevel2> "4600" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-from> <https://openspending.org/618ac3ec98384f44a9ef142356ce476d/from/106004> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-cofog> "09.6.1" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-cofog3> <https://openspending.org/618ac3ec98384f44a9ef142356ce476d/cofog3/09-6-1> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-amount> "0.0" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://purl.org/linked-data/sdmx/2009/attribute#refArea> <http://linkedgeodata.org/triplify/node1504546320> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-economiclevel3> "Capital grants to other level of public sector" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://purl.org/linked-data/cube#dataSet> <http://linkedspending.aksw.org/instance/618ac3ec98384f44a9ef142356ce476d> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-economic> <https://openspending.org/618ac3ec98384f44a9ef142356ce476d/economic/4652> .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-economiclevel1> "Running expenses" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-economicidlevel3> "4650" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/618ac3ec98384f44a9ef142356ce476d-datasetid> "25a9016d-efc0-487e-a354-3a85d6822004" .
<http://linkedspending.aksw.org/instance/observation-618ac3ec98384f44a9ef142356ce476d-8b31551d0f8a7a86b3c885654e4b67e4a4c334af> <http://linkedspending.aksw.org/ontology/refDate> "2009-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .
KonradHoeffner commented 2 years ago

P.S.: The URIs also have long common substrings, is it possible to save memory by saving the URIs in some kind of trie?

pchampin commented 2 years ago

You are correct: the string for each term is only stored once, and the triples are stored as 3 integers.

P.S.: The URIs also have long common substrings, is it possible to save memory by saving the URIs in some kind of trie?

You could do that by ensuring that all IRIs are "normalized" with a cut at the last "gen-delim" character (usually / or #). That way, the common prefix will only be stored once. Example:

    let mut graph = LightGraph::new();
    nt::parse_str(example).for_each_triple(|t| {
        graph.insert(
            &RefTerm::from(t.s()).normalized(Normalization::LastGenDelim), 
            &RefTerm::from(t.p()).normalized(Normalization::LastGenDelim), 
            &RefTerm::from(t.o()).normalized(Normalization::LastGenDelim), 
        ).unwrap();
    })?;

I never tested that this storage strategy was worth the trouble, so I am very interested in your feedback here.

Also, depending on how many distinct terms you have in your graph, it might be worth considering sophia::graph::inmem::small, which uses smaller integers.

KonradHoeffner commented 2 years ago

Unfortunately it only reduces memory usage from 2.568 GiB (2.569 in the second try) to 2.524 GiB (2.525 in the second try) or 4 MB, which could just be a measuring error. sophia::graph::inmem::small would most probably not be enough for the 16 million triples.

KonradHoeffner commented 2 years ago

Is there a way to perform a custom normalization? I think in my case adding a - character as a delimiter could help. Alternatively, I think storing the term strings in a trie would help, but I need to investigate if I could do that with a sophia graph.

pchampin commented 2 years ago

Unfortunately it only reduces memory usage from 2.568 GiB (2.569 in the second try) to 2.524 GiB (2.525 in the second try) or 4 MB, which could just be a measuring error.

Thanks for this feedback! This confirms that this idea of splitting the IRIs was premature optimization... It might be useful in specific cases (and with specific "normalization" functions, as you suggest) but probably should not be a feature of the default model. I will probably remove that from the next version.

sophia::graph::inmem::small would most probably not be enough for the 16 million triples.

Mind that the limitation is on the number of distinct terms, not the number of triples.

Is there a way to perform a custom normalization? I think in my case adding a - character as a delimiter could help.

Not with the normalized method, but nothing prevents you from making a function that takes a term, and produce a new one using Term::new_iri_suffixed_unchecked

Alternatively, I think storing the term strings in a trie would help, but I need to investigate if I could do that with a sophia graph.

This would be interesting, indeed. An implementation of TTerm should be possible, then an implementation of Graph using thiese terms internally would also be possible.

KonradHoeffner commented 2 years ago

Thanks for this feedback! This confirms that this idea of splitting the IRIs was premature optimization... It might be useful in specific cases (and with specific "normalization" functions, as you suggest) but probably should not be a feature of the default model. I will probably remove that from the next version.

For me this result was unexpected, as the 16 million triples contain around 1 million unique subjects (cut -f1 -d " " qbench2.nt| sort | uniq | wc -l), and almost all (15.9 million) of them start with http://linkedspending.aksw.org/instance/, which contains 41 characters so if that is saved 15 million times it should save 41 bytes minus the length of the pointer (8 bytes?), so I would expect it to save around 33*15 million or 495 MB or am I calculating something totally wrong here? And I guess it would save something for the properties and objects as well.

Mind that the limitation is on the number of distinct terms, not the number of triples.

Good point, but I think my data contains way less then 246 (16M/65k) triples for each subject on average, so I thought that this would mean that it doesn't fit. However to make sure I counted the number of different subjects and those alone more than 1 million, are so unfortunately 16 bit is not enough for that data.

Is there a way to perform a custom normalization? I think in my case adding a - character as a delimiter could help.

Not with the normalized method, but nothing prevents you from making a function that takes a term, and produce a new one using Term::new_iri_suffixed_unchecked

Thanks, I will try that!

Alternatively, I think storing the term strings in a trie would help, but I need to investigate if I could do that with a sophia graph.

This would be interesting, indeed. An implementation of TTerm should be possible, then an implementation of Graph using thiese terms internally would also be possible.

Thanks! As I am still learning Rust, I'm not sure if I am up to that task but I will try to find out!

pchampin commented 2 years ago

For me this result was unexpected, as the 16 million triples contain around 1 million unique subjects (cut -f1 -d " " qbench2.nt| sort | uniq | wc -l), and almost all (15.9 million) of them start with http://linkedspending.aksw.org/instance/, which contains 41 characters so if that is saved 15 million times it should save 41 bytes minus the length of the pointer (8 bytes?), so I would expect it to save around 33*15 million or 495 MB or am I calculating something totally wrong here? And I guess it would save something for the properties and objects as well.

Remember that triples do not contain terms directly, but only indexes (u32 in the case of FastGraph and LightGraph). It does not matter in how many triples an IRI occurs, it will still be stored only once.

Normalizing IRIs in (namespace, suffix) form will only be beneficial if you have many terms with the same namespace. Even with a million of them, you save 44Mb, which is not much compared to the total 2.5Gb consumed by the graph.

KonradHoeffner commented 2 years ago

Maybe a representation based on the HDT (Header, Dictionary, Triples) format could save even more memory?

KonradHoeffner commented 2 years ago

After reading what the HDT people are doing, front coding seems to be a really good way to save space when saving a large amount of strings. There is a crate called fcsd, here is what I got to compile successfully:

[...]
use fcsd::Set;

fn foo()
{
 let mut whs: WeakHashSet<Weak<Vec<u8>>, RandomState> = WeakHashSet::new();
 let mut barvec: Vec<String> = Vec::new();
 for i in 1..10000000 {
  barvec.push("blablablablablabla".to_owned()+&i.to_string());
 }
 let bars: Set = Set::new(barvec).unwrap();
 for bar in bars.iter() {
    let x: Arc<Vec<u8>> = Arc::new(bar.1);
    whs.insert(x);
 }
}

However it generates byte strings instead of the str type that is normally used by sophia, so I haven't quite figured out if it is possible to connect that to sophia and if yes, how (and if that approach even works correctly or if that somewhere clones the strings).

KonradHoeffner commented 2 years ago

I am trying out terminus_store because it is based on HDT, but that uses 8.6 GB of RAM instead of the 2.5GB of Sophia.

KonradHoeffner commented 1 year ago

There is now Sophia Graph support behind a default feature in https://github.com/konradhoeffner/hdt so I am closing this issue. However that new version is not yet on crates.io and until now only triples_with_s is optimized, the other queries may be slow. For example, when loading 27GB of ntriples as HDT file, it takes less than 5 GB of RAM and triples_with_s takes less than a millisecond but triples_with_o takes around 15 seconds.