pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
210 stars 23 forks source link

Handling intermediate Strings #60

Closed MattesWhite closed 4 years ago

MattesWhite commented 4 years ago

In some situations intermediate Strings are allocated and then transformed into TermData, e.g. normalization and resolving of IRIs.

Issue

I'm currently working on metis for an example for #55. I'd like to use CowTerm for my parser. This means that if an absolute IRI is parsed I have Cow::Borrowed and I would like it to remain Cow::Borrowed after resolving against the base IRI. On the other hand when a relative IRI is parsed the Cow::Borrowed should be turned into Cow::Owned after resolution. This means that I have to know if a &str comes from the original to track lifetimes or if I get a newly allocated String.

Discussion of Solutions

  1. Change signature of Resolve<Iri> (and of normalization) to:
    impl<'a, 'td, TD, TD2> Resolve<Iri<TD>, Iri<TD2>> for IriParsed<'a>
    where
        TD: 'td + TermData,
        TD2: TermData + From<&'td str> + From<String>,
    { ... }

    This, however, prevents passing in a TermFactory. So maybe it would be nice to have another Resolve.

  2. Add resolve_with():
    trait ResolveWith<S, STD, T = S, TTD = STD> {
        fn resolve_with<'td, B, O>(&self, other: &'td S, borrowed: B, owned: O) -> T 
        where
            STD: 'td + TermData,
            B: FnMut(&'td str) -> TTD,
            O: FnMut(String) -> TTD;

    This adds complexity and is maybe to much for the single use case of Cow as this is not that important for other TermData. Besides a default implementation:

    /// pseudo Rust
    impl<'td, S, STD, T, TTD> Resolve<S, T> for ResolveWith<S, STD, T, TTD>
    where
        STD: 'td,
        TTD: From<&'td str> + From<String>,
    {
        fn resolve(&self, other: &S) -> T {
            self.resolve_with(other, Into::into, Into::into)
        }
    }

    Should be possible.

What do you think? Is this case important enough to add such complexity?

pchampin commented 4 years ago

Actually, this discussion about Term<Cow<str>> has been on my mind lately, and I just pushed some work (39733c4a) in that spirit.

I think this addresses most of the points you raise above.

MattesWhite commented 4 years ago

Nice work I'll try integrate it. Maybe MownStr is worth its own crate ...?

MattesWhite commented 4 years ago

While refactoring #61 for MownStr I discovered that normalization has an equivalent issue as resolve. Should there be a similar two-way-implementation with one extra for MownStr? I think we already had a similar discussion during #49 where I switched to only Cow<str> witch you denied because it was no longer possible to use TermFactory.

pchampin commented 4 years ago

Yep, this also came to my mind. I think it would make sense to have

    pub fn normalized(&self, policy: Normalization) -> MownTerm

next to clone_normalized_with.

MattesWhite commented 4 years ago

Do you still want to have a factory in the mown-version? I think the _with(factory) methods are mostly used with TermFactory and this is another approach to avoid copies than MownStr. In addition, we would actually require two factories for MownStr: fn(&'a str) -> MownStr<'a> and fn(String) -> MownStr<'_>.

On further thinking, it may even make sense to add those two functions to TermFactory and instead of passing a closure to _with(factory) methods pass an impl TermFactory. By passing an owned String to a TermFactory it could make use of the existing allocated data to create a new Rc, Arc or whatever, saving another copy with zero costs. I think it should also be possible to then implement TermFactory for (fn(&str) -> TD, fn(String) -> TD).

Note: Requiring that fn(String) -> TD is possible would mean that &str could not be used as TD in such a context.

pchampin commented 4 years ago

Do you still want to have a factory in the mown-version?

No, of course not! I fixed it in my comment. Thanks.

it may even make sense (..) instead of passing a closure to _with(factory) methods pass an impl TermFactory

Well, not currently, because the methods in TermFactory rely on the _with(factory) methods...

By passing an owned String to a TermFactory it could make use of the existing allocated data to create a new Rc, Arc.

I thought so at first, but actually no: Rc<str>( and Arc<str>) reallocate the str data (together with its ref-counting data) (see this example).

Using Rc<String> instead of Rc<str> would make this kind of optimization possible, but might introduce additional hidden cost (fragmenting the memory more, longer deallocation...). Furthermore, the primary goal of TermFactory is to avoid allocations, and reuse an existing TermData instead, so I'm not sure about the ROI if such optimization...

pchampin commented 4 years ago

Actually, I'm currently trying some refactoring on TermFactory, which might address your concerns. Will probably push later this afternoon...

pchampin commented 4 years ago

Done, the commit referenced above, as well as aff86514, make it easier to convert terms while limiting the number of (re-)allocations. This relies largely on combining MownStr/MownTerm with the map methods you introduced in Term, which were definitely a super good idea :)

Does that fit your needs?