Closed KonradHoeffner closed 1 year ago
P.S.: The URIs also have long common substrings, is it possible to save memory by saving the URIs in some kind of trie?
You are correct: the string for each term is only stored once, and the triples are stored as 3 integers.
P.S.: The URIs also have long common substrings, is it possible to save memory by saving the URIs in some kind of trie?
You could do that by ensuring that all IRIs are "normalized" with a cut at the last "gen-delim" character (usually / or #). That way, the common prefix will only be stored once. Example:
let mut graph = LightGraph::new();
nt::parse_str(example).for_each_triple(|t| {
graph.insert(
&RefTerm::from(t.s()).normalized(Normalization::LastGenDelim),
&RefTerm::from(t.p()).normalized(Normalization::LastGenDelim),
&RefTerm::from(t.o()).normalized(Normalization::LastGenDelim),
).unwrap();
})?;
I never tested that this storage strategy was worth the trouble, so I am very interested in your feedback here.
Also, depending on how many distinct terms you have in your graph, it might be worth considering sophia::graph::inmem::small
, which uses smaller integers.
Unfortunately it only reduces memory usage from 2.568 GiB (2.569 in the second try) to 2.524 GiB (2.525 in the second try) or 4 MB, which could just be a measuring error. sophia::graph::inmem::small would most probably not be enough for the 16 million triples.
Is there a way to perform a custom normalization? I think in my case adding a -
character as a delimiter could help.
Alternatively, I think storing the term strings in a trie would help, but I need to investigate if I could do that with a sophia graph.
Unfortunately it only reduces memory usage from 2.568 GiB (2.569 in the second try) to 2.524 GiB (2.525 in the second try) or 4 MB, which could just be a measuring error.
Thanks for this feedback! This confirms that this idea of splitting the IRIs was premature optimization... It might be useful in specific cases (and with specific "normalization" functions, as you suggest) but probably should not be a feature of the default model. I will probably remove that from the next version.
sophia::graph::inmem::small would most probably not be enough for the 16 million triples.
Mind that the limitation is on the number of distinct terms, not the number of triples.
Is there a way to perform a custom normalization? I think in my case adding a
-
character as a delimiter could help.
Not with the normalized
method, but nothing prevents you from making a function that takes a term, and produce a new one using Term::new_iri_suffixed_unchecked
Alternatively, I think storing the term strings in a trie would help, but I need to investigate if I could do that with a sophia graph.
This would be interesting, indeed. An implementation of TTerm
should be possible, then an implementation of Graph
using thiese terms internally would also be possible.
Thanks for this feedback! This confirms that this idea of splitting the IRIs was premature optimization... It might be useful in specific cases (and with specific "normalization" functions, as you suggest) but probably should not be a feature of the default model. I will probably remove that from the next version.
For me this result was unexpected, as the 16 million triples contain around 1 million unique subjects (cut -f1 -d " " qbench2.nt| sort | uniq | wc -l
), and almost all (15.9 million) of them start with http://linkedspending.aksw.org/instance/
, which contains 41 characters so if that is saved 15 million times it should save 41 bytes minus the length of the pointer (8 bytes?), so I would expect it to save around 33*15 million or 495 MB or am I calculating something totally wrong here? And I guess it would save something for the properties and objects as well.
Mind that the limitation is on the number of distinct terms, not the number of triples.
Good point, but I think my data contains way less then 246 (16M/65k) triples for each subject on average, so I thought that this would mean that it doesn't fit. However to make sure I counted the number of different subjects and those alone more than 1 million, are so unfortunately 16 bit is not enough for that data.
Is there a way to perform a custom normalization? I think in my case adding a
-
character as a delimiter could help.Not with the
normalized
method, but nothing prevents you from making a function that takes a term, and produce a new one usingTerm::new_iri_suffixed_unchecked
Thanks, I will try that!
Alternatively, I think storing the term strings in a trie would help, but I need to investigate if I could do that with a sophia graph.
This would be interesting, indeed. An implementation of
TTerm
should be possible, then an implementation ofGraph
using thiese terms internally would also be possible.
Thanks! As I am still learning Rust, I'm not sure if I am up to that task but I will try to find out!
For me this result was unexpected, as the 16 million triples contain around 1 million unique subjects (
cut -f1 -d " " qbench2.nt| sort | uniq | wc -l
), and almost all (15.9 million) of them start withhttp://linkedspending.aksw.org/instance/
, which contains 41 characters so if that is saved 15 million times it should save 41 bytes minus the length of the pointer (8 bytes?), so I would expect it to save around 33*15 million or 495 MB or am I calculating something totally wrong here? And I guess it would save something for the properties and objects as well.
Remember that triples do not contain terms directly, but only indexes (u32 in the case of FastGraph and LightGraph). It does not matter in how many triples an IRI occurs, it will still be stored only once.
Normalizing IRIs in (namespace, suffix) form will only be beneficial if you have many terms with the same namespace. Even with a million of them, you save 44Mb, which is not much compared to the total 2.5Gb consumed by the graph.
Maybe a representation based on the HDT (Header, Dictionary, Triples) format could save even more memory?
After reading what the HDT people are doing, front coding seems to be a really good way to save space when saving a large amount of strings. There is a crate called fcsd, here is what I got to compile successfully:
[...]
use fcsd::Set;
fn foo()
{
let mut whs: WeakHashSet<Weak<Vec<u8>>, RandomState> = WeakHashSet::new();
let mut barvec: Vec<String> = Vec::new();
for i in 1..10000000 {
barvec.push("blablablablablabla".to_owned()+&i.to_string());
}
let bars: Set = Set::new(barvec).unwrap();
for bar in bars.iter() {
let x: Arc<Vec<u8>> = Arc::new(bar.1);
whs.insert(x);
}
}
However it generates byte strings instead of the str type that is normally used by sophia, so I haven't quite figured out if it is possible to connect that to sophia and if yes, how (and if that approach even works correctly or if that somewhere clones the strings).
I am trying out terminus_store because it is based on HDT, but that uses 8.6 GB of RAM instead of the 2.5GB of Sophia.
There is now Sophia Graph support behind a default feature in https://github.com/konradhoeffner/hdt so I am closing this issue. However that new version is not yet on crates.io and until now only triples_with_s is optimized, the other queries may be slow. For example, when loading 27GB of ntriples as HDT file, it takes less than 5 GB of RAM and triples_with_s takes less than a millisecond but triples_with_o takes around 15 seconds.
When I load an N-Triples file that contains ~ 16 million triples and that is ~ 181 MB ZIP compressed and 3.6 GB uncompressed, I need 4.9 GB of RAM with a FastGraph (according to Docker stats) and 2.6 GB of RAM with a LightGraph.
Is there a way to reduce the memory consumption further, for example to less than 1 GB? The file contains very long URIs that occur multiple times, so I assumed this would use much less memory, as Sophia seems to represent triple subjects with integers, at least that what it looks like in https://docs.rs/sophia/latest/sophia/graph/inmem/type.LightGraph.html for me as a non-expert in Rust:
For example, that are the first lines of the file: