pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
210 stars 23 forks source link

Turtle parsing does not split namespace and suffix #106

Closed hoijui closed 7 months ago

hoijui commented 2 years ago

(sophia 0.7.0)

Using this code:

extern crate sophia;

use sophia::graph::{*, inmem::FastGraph};
use sophia::parser::turtle;
use sophia::term::TTerm;
use sophia::triple::stream::TripleSource;
use sophia::triple::Triple;

fn main() -> Result<(), Box<dyn std::error::Error>> {

    let example = r#"
        @prefix : <http://example.org/>.
        @prefix foaf: <http://xmlns.com/foaf/0.1/>.

        :alice foaf:name "Alice";
            foaf:mbox <mailto:alice@work.example> .

        :bob foaf:name "Bob".
    "#;
    let graph: FastGraph = turtle::parse_str(example).collect_triples()?;

    for triple_res in graph.triples() {
        let triple = triple_res?;
        let subj = triple.s();
        print!("{:?}\n", subj.value_raw());
    }

    Ok::<(), Box<dyn std::error::Error>>(())
}

I get this output (ns, suffix):

RawValue("http://example.org/alice", None)
RawValue("http://example.org/bob", None)
RawValue("http://example.org/alice", None)

while I would expect this:

RawValue("http://example.org/", Some("alice"))
RawValue("http://example.org/", Some("bob"))
RawValue("http://example.org/", Some("alice"))

... thanks for sophia btw.! :-) Thanks to you (helping me with an issue I had as a complete beginner in a discussion a while back), I kept on using rust the last few months, and now like it a lot! :-)

pchampin commented 2 years ago

I understand why you would expect this result, and I take that as a sign that the API should be better documented. But actually, the result you get is correct:

The TTerm trait is meant to represent terms in the abstract syntax of RDF. So for an IRI, the fact that it was parsed from a full IRI, a prefixed named, or something else (e.g. the a keyword in Turtle) is not relevant at this stage. This is reflected by the fact that term equality, for IRIs, does not consider how the raw value is actually split: an IRI encoded as RawValue("http://example.org/alice", None) is equal to another IRI encoded as RawValue("http://example.org/", Some("alice")).

So why are they encoded internally as a pair of strings, you may ask? This is to allow some implementations to reduce their memory footprint by taking advantage of the fact that many IRIs share a common prefix. This could indeed be leveraged by a Turtle parser when parsing PNAMEs, but it so happens that the Rio parser, that Sophia uses, does not do that at the moment.

I hope this makes sense. That being said, again, I perfectly understand where your initial expectation comes from. I'll leave this issue open until I improve the documentation in order to avoid this misinterpretation.

hoijui commented 2 years ago

all I read out of this, is... it currently is not spit, and this is not a bug. ... or do you also say, it should not be split (maybe for performance reasons)?

mmm... so If I'd go over to rio, and would change their implementation to do the parsing splitted, then it would also be split here (as I expected it)? is there a good reason to not split it/not have it split?

Or in other words .. how would you suggest for me to solve my issue (as I need them split)? The way I do it now, is to use a helper function/macro, which uses RawValue as is if the suffix is Some, and else splits at the last '/' or '#', each time I access the value. I guess I could transform the graph once to do that, or... change the rio implementation (if the author there would think it is a good idea, which I imagine, is not the case).

pchampin commented 2 years ago

Let me rephrase, as I was apparently not clear enough: implementors are free to split IRIs or not, depending on their own preferences. Split IRIs tend to be less memory-consuming (the namespace part can be mutualized), but sometimes it is faster to generate non-split IRIs (this is more or less why Rio does not split them). But this is implementation specific -- the contract for parsers does not impose any "splitting policy" to implementations.

I need them split

Do you mean "I need them split exactly as they were in the original Turtle file"? If this is the case, then Sophia will not help you, because it makes a clear separation between concrete syntax and abstract syntax. In other words, an implementation of Graph is not required to keep any information from the concrete syntax from which it was parsed (e.g. which syntax was used, which prefixes were declared, or how a given IRI was spelled in the original file).

On the other hand, if you need a "sensible" split, which might be different from the split in the original Turtle file, then you can use the TTerm implementation from sophia_term, which has a method normalize that can be used to force a specific internal structured to IRIs. So for each term t that gets out of your parser, running something like:

    let t2 = RefTerm::from(&t);
    let t3 = rt.normalize(Normalization::LastGenDelim);
    let split = t3.value_raw'();

will give you a raw value with a namespace and a suffix.

hoijui commented 2 years ago

If I read the rio code correctly, then it does not support namespace+suffix at all, only a single IRI string: https://docs.rs/rio_api/0.6.1/rio_api/model/struct.NamedNode.html

So it could not possibly supply it split. ... or does it have special sophia targeted extensions?

Puh... thanks for all your help! :-) but I must say .. even after having used many other crates, and generally having more rust experience, sophia is still.. cumbersome. so many types that - for a new-commer - seem to represent (almost) the same thing, contain the same data, but can not be easily converted to each other in an obvious way (probably for good reasons). For example, RawValue and SimpleIri .. I need the later, get the former from the code you showed me now. I can deconstruct one and recreate the other... but it was surely not meant that way ... do I really need a SimpleIri? ...

That book you mentioned is sorely needed, I think! :-) more examples!

pchampin commented 2 years ago

About Rio: indeed, currently, Rio can not easily expose the "original" split. A change in the code would be needed, which I considered contributing at some point... but again, I do not consider it as a strong requirement for Sophia, because of the separation of concerns (concrete syntax vs. abstract syntax).

About the complexity of Sophia: I understand that Sophia can be overwhelming, because it aims not only to provide a library to deal with RDF. Its more ambitious goal is to provide a common API to allow multiple RDF libraries to interoperate -- as well as building blocks implementing this API. Different implementations of the same trait (e.g. TTerm or Graph) provide different trade-offs (between speed, memory footprint...). You are totally right: the book should become a priority to help newcomers find their way among this mess...

About RawValue and SimpleIri: the reason why it is not trivial to build the latter from the former is that SimpleIri has a more restrictive contract than RawValue. Namely, SimpleIri must contain a valid IRI, while RawValue is just a pair of arbitrary strings. If you know that the RawValue contains a valid IRI (because you checked the kind of the term you got it from), you can still use SimpleIri::new_unchecked. The explicit use of this method is required, as it documents this assumption that you make.

Final remark: I am still not sure why you need the IRIs to be split in a particular way. Again, this splitting is an implementation detail that users of the API should not be concerned with. As indicated by the documentation, TTerm::value_raw should only be used in performance-critical code where the allocation performed (sometimes) by TTerm:value must absolutely be avoided.

In any case, thanks a lot for your feedback. It is important for me to know how people use Sophia, to improve it accordingly.

pchampin commented 7 months ago

Closing this issue: the refactoring on the Term trait in v0.8 makes a lot of the discussion moot.