phillord / horned-owl

GNU Lesser General Public License v3.0
62 stars 12 forks source link

RDF Parser considered harmful (or at least the iteration part is really ugly) #98

Open phillord opened 1 month ago

phillord commented 1 month ago

The RDF parser makes multiple passes through both bnode and simple triples many times. And each time it does so the code is repeated -- something like this.

for (k, v) in std::mem::take(&mut self.bnode) {
            match v.as_slice() {
                [[_, Term::RDF(VRDF::First), val],
                 [_, Term::RDF(VRDF::Rest), Term::BNode(bnode_id)],
                 // Some sequences have a Type List, some do not
                 ..
                ] => {
                    let some_seq = self.bnode_seq.remove(bnode_id);
                    if let Some(mut seq) = some_seq {
                        seq.push(val.clone());
                        self.bnode_seq.insert(k.clone(), seq);
                        extended = true;
                    } else {
                        self.bnode.insert(k, v);
                    }
                }
                _ => {
                    self.bnode.insert(k, v);
                }
            };
        }

I think we can replace this whole thing with a partition_map (or a map followed by a partition which would avoid bringing in itertools as a dependency).

Partition map looks like this:

partition_map(Iterator<A>, fn -> Either<L, R>) -> (Iterator<L>, Iterator<R>)

L would be the thing (Atom, Axiom what ever) R would be Result<A>

L would then be added to whatever, R would be added back to where ever thing in self it came from in the first place

Would have then:

let (axiom, triples) = triples.partition_map(
   |t|

   match &t {
      [blah, blah, blah] => {
          // some_ok! would need to return Left or Right(Err), which should be okay
          some_ok!(Left(Axiom(blah, blah, blah))
      }
      None => {
          // Can we take the t here? 
          Ok(Right(&t))
      }
   }
)

self.axioms.extend(axiom);
// We would have to clone the triple here because we only have a reference
self.triples.extend(triples.map(|e| e?.clone()));

// or if we want to capture errors rather than crap out which is something
// we can't do at the moment
let triples, errors = self.triples.partition(is_okay)
self.triples.extend(triples)
self.errors.extend(errors)

which is rather neater than what we have at the moment

filippodebortoli commented 1 month ago

To get myself acquainted with the rdf::reader module, I am currently refactoring a few bits that could be replaced by e.g. deriving traits or using methods already implemented in vocab (see #97 and #99).

I agree with your observation. I am also wondering if the code could be refactored to leverage the fact that the triple parser is streaming, by keeping a HashMap of triples sets indexed by subject and deserializing whenever the set contains e.g. all the triples necessary to instantiate a declaration.

phillord commented 1 month ago

https://github.com/phillord/horned-owl/tree/feature/rdf-with-partition-cleanup

I've started looking at this. There are definately some cleanups that I can do. The idea of partition_map may or may not work; so far it makes some things simpler, but a lot of loops are modifying other data structures by side-effect. I think I will re-write a few more functions so I'm clear whether it makes sense or not.

The problem with streaming, is that there are many places where you have to parse the whole stream first, to make the interpretation of later triples. This is, for example, why I parse all declarations (in the entire import closure) first, because many of the axiom patterns depend on understanding whether for example, the kind of a property. So I suspect that streaming RDF not going to bring a lot of benefits.