tgbugs / pyontutils

python utilities for working with ontologies
MIT License
18 stars 123 forks source link

Merging some pythonutils parts up to RDFlib #92

Open nicholascar opened 3 years ago

nicholascar commented 3 years ago

Hi @tgbugs,

In the README for ttlset you state "ttlser cannot produce deterministic results without the changes added in https://github.com/RDFLib/rdflib/pull/649". Well 649's been merged for some time now.

Are you interested in getting this family of serializers into RDFlib core? If they are stable and have value over the default Turtle serializer, surely it would be good to have them as serializer options in core?

Cheers, Nick

tgbugs commented 3 years ago

@nicholascar Yes, I would love to get some of these into core.

The DeterministicTurtleSerializer has a lot to offer over the default serializer. It is the one that would be by far of most use to the community and is currently implemented in CustomTurtleSerializer https://github.com/tgbugs/pyontutils/blob/f57b7698e9ad83871401ac2c5a8dffbc583220a2/ttlser/ttlser/serializers.py#L155. The behavior is documented here https://github.com/tgbugs/pyontutils/blob/master/ttlser/docs/ttlser.md.

There is at least one known error that must be fixed before we merge. Some malformed nt graphs can induce an infinite loop here https://github.com/tgbugs/pyontutils/blob/f57b7698e9ad83871401ac2c5a8dffbc583220a2/ttlser/ttlser/serializers.py#L105. I haven't had a chance to get that fixed, when I do I will also do a bit of shuffling so that it is easier to move the implementation upstream. I'm guessing they could go in rdflib/plugins/serializers/dturtle.py if we don't want to put them directly in rdflib/plugins/serializers/turtle.py or something like that? The only reason I hesitate is because of the helper functions and classes that are used to implement the deterministic sort.

There are some other considerations that will need to be taken, such as the default value for predicateOrder.

The other one that would probably be of use is the HttpTurtleSerializer https://github.com/tgbugs/pyontutils/blob/f57b7698e9ad83871401ac2c5a8dffbc583220a2/ttlser/ttlser/serializers.py#L808. It makes it clear why newline and space are abstracted.

CompactTurtleSerializer probably should not go in. It is a completely non-standard implementation of a bad compression algorithm using the serializer that I created in order to come up with a realistic worst possible case when I was testing the performance of the (then new) trie based namespace system. Using a sane serialization with gzip is the right thing to do.

The SubClassOfTurtleSerializer is broken and not implemented correctly so is not ready to go in even if it might be of use to people who would like to have a bit more semantic ordering than purely syntactical ordering as is currently implemented by the DeterministicTurtleSerializer.

nicholascar commented 3 years ago

@tgbugs Great, well we - my company - has an interest here beyond just "make RDFlib better" in that we too use lots of version controlled turtle files as feedstock for RDF graphs and would love to see this deterministic serialiser implemented to help with that. In about 2013, I created a reasonably deterministic turtle serialiser on top of RDFlib that did some out-of-band (i.e. outside the graph) counting to order BNs etc. but it wasn't heavily tested and I've lost it! I'm happier to help with your implementation which looks miles ahead of where I got to!

Q1: can we shift from the older @prefix style to PREFIX or perhaps provide newer style as the default (PREFIX) with a param to apply older style instead?

Q2: any idea how far off an N3 serialiser this is? I just wonder two things:

  1. in RDFlib we might need to replace both the Turtle & N3 serialiser at once, since they are dependent
  2. I would like to get into using N3 but will have to learn more about what N3 can do that Turtle can't

Q3: as above but for Trig - how close are these serializers to being able to handle Trig?

tgbugs commented 3 years ago

The more people that can make use of it the better! The only way I was able to get it to work at all was to try to come up with the most pathological tests cases I could imagine, and I'm sure there are still some lurking out there.

Q1: I'm fairly certain that @prefix has to be lower case in turtle https://www.w3.org/TeamSubmission/turtle/#sec-grammar-grammar. It is certainly possible to add that as a toggle, but I don't think that would go in a pure ttl serializer. Can you point me to a document on this?

Q2: No idea. https://www.w3.org/TeamSubmission/n3/#subsets for reference. If I had to guess literal subjects, rdf paths rules and formulae are all not supported. No idea what it would take to get there since I haven't interact with n3 much at all. However where I have interacted with it is in https://github.com/RDFLib/rdflib/blob/master/test/n3/example-lots_of_graphs.n3. That definitely won't serialize. Minimally it would require an additional rule for ranking whole serialized graphs as well as expressions. Q2.1: The deterministic serializer isn't something that can replace the current default serializer, it comes with a significant performance penalty as well as a significant increase in memory usage. In theory it is possible to write a ttl file to disk that is larger than memory using the standard serializer, as implemented, the current deterministic serializer cannot do that. Q2.2: Likewise. Named graphs mainly I think, but I try to stay away from anonymous named graphs, so not much perspective there.

Q3: Off the top of my head no idea here as well. The named graphs will definitely give it trouble. https://www.w3.org/TR/trig/#grammar-ebnf

nicholascar commented 3 years ago

OK, so we add a new serialization option, format="dturtle" or similar for deterministic turtle? Certainly easy to just add another option.

See the ticket Issue #1207 that my engineer @jamiefeiss is working on. I think this would go nicely with the deterministic serializer. You can think of RDF (dturtle) files in Git and then the a Graph being made from that for use. Haven't worked out all the little bits yet but, at the very least, the new file Store would, at start up, look for any changes in Git to see whether it needed to re-deserialize the RDF files or just rely on a cached pickled version.

tgbugs commented 3 years ago

The options would seem to be dttl dettl or dturtle. I don't really have a preference since I will continue to use nifttl which is configured with predicates that are specific to the needs of our ontology. Maybe both dttl and dturtle? There will need to be a bit of additional guidance for getting good deterministic results or we will need to come up with a set of reasonable defaults so the folks don't have to figure out how to subclass and register a new serializer (which can be a big hurdle).

nicholascar commented 3 years ago

Yes, I think both dttl & dturtle since there is already ttl == turtle.

We might bypass the infinite loop issue you know about by an input requirement check? Since this will be a secondary and optional Turtle serialiser, it can have some conditions of use (i.e. not handle absolutely all, edge case, RDF)?

nicholascar commented 3 years ago

I'm fairly certain that @prefix has to be lower case ... Can you point me to a document on this?

See https://www.w3.org/TR/turtle/#h3_sec-iri

Currently RDFlib 5.0.0 supports both in parsing but serializes to @prefix only.

Trig...

I think serialising to Trig will be much simpler than catering for N3. I think the serializer just have to be context aware, i.e. if being called on a ConjunctiveGraph or Dataset as opposed to a Graph then it needs to Turtle each graph and wrap them in <GRAPH-URI> { ... } and just aggregate all PREFIX statements at the top.

nicholascar commented 3 years ago

@tgbugs how does the determanistic serializer relate/not relate to RDF normalization algorithms like https://json-ld.github.io/normalization/spec/? I always assumed that if a graph could be converted into normalized form then determanistic serialization, in Turtle or JSON-LD or RDF/XML etc would be "easy", i.e. the serialiser could just "start at the start and go to the end"

tgbugs commented 3 years ago

@nicholascar The deterministic ttl serializer was written before I was thinking about normalization or graph identity at all. I also didn't manage to become aware of the normalization algorithm until fairly recently, at which point in time I had already implemented https://github.com/tgbugs/pyontutils/blob/master/pyontutils/identity_bnode.py which provides mostly stable identities for any bnode or graph. When I was first implementing ibnode I tried to reuse the deterministic serialization code, however it was very difficult to adapt, so I wrote the identity from scratch. I haven't had an opportunity to review and compare ibnode and the deterministic serializer the rdf normalization spec as I only became aware of it when I started to dig into json-ld in the last year, but it is likely that the ibnode implementation is quite similar in some respects, but I can't say more than that without comparing the algorithms more closely.

What I can say is that just having some normalized form is no guarantee that you can serialize the graph in the way you want. The deterministic serializer works with any deterministic total ordering on the triples in the graph, which is part of why I was working on the subClassOf serializer.

The issue with trying to go from the normalized form of a graph is that it provides only one total ordering that may or may not be the one that you want. The normalization approach only deals with the normalized form of the graph and as far as I can tell doesn't provide any guidance about what to do about the compacted form. The deterministic serializer was intentionally designed to handle the compacted form since that is what humans end up seeing in version control systems and having the compact identifiers subject to expanded ordering is extremely confusing.

In terms of start at the start and go to the end, yes, if you already have everything in order already. But if it is not in the order you want you will have to reorder it. In my original use case I was accounting for the fact that any change to the prefixes could (and often does) result in a reordering. If you don't care what total order your triples are in, you could store them sorted by the normalization and then serialization would incidentally be deterministic. The issue is what happens if someone changes something in the implementation that has a side effect of changing the serialization ordering.

tgbugs commented 2 years ago

@nicholascar I finally got around to fixing the issues with the deterministic serializer and cut a release https://github.com/tgbugs/pyontutils/releases/tag/ttlser-1.1.4.

I think we can revisit the integration question now.

nicholascar commented 2 years ago

@tgbugs: @aucampia is dead keen to improve, perhaps even replace, serializers, so please communicate with him on this on. Also, I have a student working on Turtle-star processing right now (work in https://github.com/RDFLib/rdflib-rdfstar/) and there he's using a Lark-based parser for Turtle-star which then converts Turtle-star to Turtle. It's probably sensible to move to a Lark-based parser for all Turtle as well as Turtle-star.