pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
210 stars 23 forks source link

How to replace Arc and Rc Terms in 0.8.0-alpha? #128

Closed KonradHoeffner closed 9 months ago

KonradHoeffner commented 1 year ago

According to https://github.com/pchampin/sophia_rs/blob/main/book/src/90_changes_since_07.md, terms with Arc and Rc are not provided anymore. However I am worrying about a drastic increase in memory consumption when switching from Rc and Arc to Box, for example for triples_with_po, where I create triples as follows:

[...]
// inside a loop
    [Term::new_iri_unchecked(self.hdt.dict.id_to_string(sid, &IdKind::Subject).unwrap()),
    p.clone(),
    o.clone()]

For example when I have 3 million triples, when p and o are Terms of Rc or Arc, the clone is cheap. However when they are based on Box, if I understand it correctly, the Strings are copied 2 million times, which will make the code much slower and use much more memory. Or am I making an incorrect assumption here? I can't use &str because in HDT the strings are compressed so there is nowhere to reference them from.

P.S.: Is this what MownStr is for? Should I use the &str Variant of Mownstr for p and o and the Box variant for s?

pchampin commented 1 year ago

Arc<str> and Rc<str> not only provide you cheap clone -- they also provide you shared ownership (and that is actually their main purpose). In my experience, this shared ownership is generally not required in Graph implementations. Most of the time, terms are stored in one place (like the Dictionnary of HDT), and then borrowed from that place. Simple references are also cheap to clone (even cheaper than Rc<str> and Arc<str>).

So yes, your best options is probably to use the &str variant of MownStr -- or &str directly, even.

pchampin commented 1 year ago

Thinking more about this, I realize that my advice above does not work for you. In HDT, your dictionnary contains the text in compressed form, so the terms you "export" can not borrow from it directly. They need to own their copy of the (uncompressed) data. So the &str variant of MownStr is not an option for you. And the box variant is, indeed, costly to clone.

So I stand corrected, this is a valid scenario where Rc<str> or Arc<str> would be useful.

Note that nothing prevents you from including your own implementation of Term based on Rc<str> or Arc<str> -- this is what traits are for... This should not be too difficult as long as you don't need to support quoted triples (which , luckily, you don't, for HDT).

But that makes me reconsider my decision. I might reintroduce the crate sophia_term, with updated versions of RcTerm and ArcTerm...

KonradHoeffner commented 1 year ago

Actually I think you were right in the first place for ?PO patterns, as I'm now using MownStr::from_str(&str) for the predicate and object and clone that on each iteration, which if I understand it correctly, is cheap to clone (so basically just return the input), and MownStr::from(String) for the each subject, which is always different.

It could theoretically be useful to have Rc and Arc for ??? and maybe S??, ?P? and ??O patterns, but I don't deduplicate those at the moment anyways because this wouldn't work well with iterators, as the iterator does the translation from ID to dictionary item.

Theoretically I could build some kind of non-iterator based way, which always returns a vector, and which has some kind of internal caching mechanism, which would be complicated and probably only be worth it for some very specific applications, so I don't plan on doing that right now.

pchampin commented 1 year ago

Actually I think you were right in the first place for ?PO patterns, as I'm now using MownStr::from_str(&str) for the predicate and object and clone that on each iteration, which if I understand it correctly, is cheap to clone (so basically just return the input), and MownStr::from(String) for the each subject, which is always different.

yes, MownStr are cheap to clone when they borrow data (as opposed to owning it). So you whould use a borrowing MownStr everytime you can.

It could theoretically be useful to have Rc and Arc for ??? and maybe S??, ?P? and ??O patterns,

Indeed

but I don't deduplicate those at the moment anyways because this wouldn't work well with iterators, as the iterator does the translation from ID to dictionary item.

See https://github.com/KonradHoeffner/hdt/pull/28/commits/fd42614be41c0dadaffa7c3ab03907a6e8c45078 , where I propose a solution that works well enough with iterators.

KonradHoeffner commented 1 year ago

Oh I didn't notice the pull request, thanks! I just tried another way of "caching" the last value with the ??O pattern (in hdt.rs first, but should be applicable for hdt_graph.rs as well), which speeds it up by ~ 21% but I think my approach is not optimal because there are lot of clones from an owned MownStr:

 (None, None, Some(o)) => {
                let mut last_t = TripleId::new(0, 0, 0); 
                let mut last_s = MownStr::from("");
                let mut last_p = MownStr::from("");
                Box::new(ObjectIter::new(&self.triples, o.1).map(move |t| {                            
                    let s = if last_t.subject_id == t.subject_id {
                        last_s.clone()
                    } else {
                        last_s = MownStr::from(self.dict.id_to_string(t.subject_id, &IdKind::Subject).unwrap());
                        last_s.clone()
                    };                                                                                 
                    let p = if last_t.predicate_id == t.predicate_id {
                        last_p.clone()
                    } else {
                        last_p =
                            MownStr::from(self.dict.id_to_string(t.predicate_id, &IdKind::Predicate).unwrap());
                        last_p.clone()
                    };
                    last_t = t;
                    (s, p, o.0.clone())
                }))
            }

I thought cloning an owned MownStr would transform it into a borrowed MownStr that borrows the data from the first one but I realize now that that's not how it works and I will check out your solution. By the way, is there any way to clone an owned MownStr into a borrowed MownStr? I guess this doesn't work because of lifetimes and that is exactly what Rc/Arc is for?

KonradHoeffner commented 1 year ago

I'm beginning to wonder whether triples_with_pattern is just the wrong abstraction for HDT because MownStr seems great for constants while Arc seems good for variables. Maybe a different return type would be better , for example a trait for "something that dereferences to a str"?

damooo commented 1 year ago

Also ArcTerm was more useful when we have to stream quads like Stream<Item = ArcTerm>. Then we need not to allocate box for every statement with same subject (common case for resource descriptions.) or same predicate (Common in listings).

KonradHoeffner commented 1 year ago

What do you think about using the Borrow trait as Borrow<str>, would that solve our problems? Or an enum that can be either an &str or an Arc/Rc?

pchampin commented 9 months ago

FTR, with 0.8.0-alpha.2, sophia_term is back:

https://docs.rs/sophia_term/0.8.0-alpha.2/sophia_term/index.html

It is very different from what it used to be in 0.7, but it does provide an Arc<str> based implementation of Term, called ArcTerm, as well as RcTerm.