Datafactory errors not propagated properly by StreamParser

LaurensRietveld commented 1 year ago

If a custom datafactory throws an error then the StreamParser does not gracefully emit this error.

A unit-test to reproduce (failed to find out how to style my code to satisfy the linter, so copy/pasting it here ;) ):

(done) => {
        const factory = DataFactory
        factory.namedNode = (iri) => {throw Error("Some error")}
        Readable.from([``,`<a:a> <b:b> <c:c> .`])
          .pipe(new StreamParser({ factory: factory }))
          .on("end", () => {
            reject(new Error("expected error"))
          })
          .on("error", (e) => {
            error.should.be.an.instanceof(Error);
            error.message.should.equal("Some error");
            done()
          });

     }

Usecase: custom validation that happens in a datafactory

RubenVerborgh commented 1 year ago

Thanks, @LaurensRietveld. I think you must be the most prolific N3.js bug finder by now.

RubenVerborgh commented 1 year ago

I worry that the performance cut of wrapping error handling around factories is too much. I think we should be able to trust factories, especially given that N3.js will have pre-validated the URL. In what cases are factories erroring?

LaurensRietveld commented 1 year ago

Validation that we'd like to perform are validation of literals (based on their datatype) and validation of IRIs. Applying such validation in the datafactory instead of some other place (e.g. in a stream after parsing) allows us to use the RDF-JS ecosystem of tools by just using a different datafactory, instead of having to wrap all other RDF-JS tools with a custom pre/post process step (kind of defeating the purpose of RDF-JS ;) )

jeswr commented 1 year ago

To quickly jump in with my 2cents: I think there is definitely value in having validation; but that it should be opt-in given that there are cases in which it is safe to assume the data has already come from another validated source.

I also think there is value in having the API for toggling this validation behavior propagated to the RDFJS spec level at some stage.

LaurensRietveld commented 1 year ago

If we expect a performance decrease, then I'm happy putting this behind a toggle obviously 👍

RubenVerborgh commented 1 year ago

instead of having to wrap all other RDF-JS tools with a custom pre/post process step (kind of defeating the purpose of RDF-JS ;)

Well well, let's not exaggerate 🙂

So N3.js offers a parser. The job of the parser is to transform a syntactically valid serialization format into an in-memory representation.

RDF/JS is very useful as a standard model for data, and for component interfaces. And this thread only proves that usefulness. You can perfectly implement validation within the RDF/JS model without having to bend the interfaces.

The purpose of the RDF/JS DataFactory interface is to create RDF model terms for interoperable use by other components, such as validators. So if one wants to write an RDF/JS-compatible validator, they would write it such that it accepts RDF/JS and then performs it task. That way, it can accept input from any RDF/JS-compatible component.

we'd like to perform are validation of literals (based on their datatype)

That's possible. The N3.js parser has streaming output, in an RDF/JS compatible format. Or more specifically, it will output whatever tokens the factory generates. But let's put that aside for a second.

The logical way to do validation is not to interfere with the parser (whose job is only the transformation of Turtle into a valid RDF in-memory model, which all of its output is). Instead, chain an RDF/JS-compatible parser into an RDF/JS-compatible validator, which checks if the model-valid RDF nodes (which the parser is guaranteed to output) also meet other non-RDF validation constraints (such as whether literals have the right type).

Now if you must do this in the factory, recall that the factory is in full control of whatever it outputs. So nothing is stopping you from, for instance, outputting { valid: false, reason: 'mismatched datatype' } as a token, which you can then immediately catch afterwards.

But I see no architectural nor performance need to catch non-model errors in the parser. Attaching the validator as an immediate next step in the stream will not delay the validation in any way, given that N3.js has a fully streaming output.

validation of IRIs

N3.js will only instantiate valid IRIs (as this is an RDF model requirement). But if this validation includes more specific constraints, then indeed that goes beyond the parser's job of generating a valid model.

If we expect a performance decrease, then I'm happy putting this behind a toggle obviously 👍

The thing is, it would require different code paths. Because currently, we make assumptions (for performance reasons) about where errors can occur and where they cannot; and when errors can occur, we avoid an expensive try/catch construct that would have to run hundreds of thousands of times per second.

Simply said: you either wrap the whole thing in a try/catch block or you don't; this is not switchable except for separate code paths.

Which I could always do of course, but I just don't see a convincing case yet: having streaming non-model validation after the streaming model parser is the right separation of concerns, does not sacrifice performance, and allows us to make stronger assumptions about what can fail and what cannot.

LaurensRietveld commented 1 year ago

Now if you must do this in the factory, recall that the factory is in full control of whatever it outputs. So nothing is stopping you from, for instance, outputting { valid: false, reason: 'mismatched datatype' } as a token, which you can then immediately catch afterwards.

We'll probably settle on an option similar to this one, where we'll wrap the N3 parser interface and expose a stream with proper error handling.

N3.js will only instantiate valid IRIs (as this is an RDF model requirement). But if this validation includes more specific constraints, then indeed that goes beyond the parser's job of generating a valid model.

Nice! I did not know that validating IRIs was in scope for N3 (as we've had issues with IRIs that were invalid according to rfc 3987).

So N3.js offers a parser. The job of the parser is to transform a syntactically valid serialization format into an in-memory representation.

In general, I'd expect a parser (or any other tool that creates RDF-JS terms for me) to return errors when the input does not match the spec. I agree that's difficult in this case without increasing the scope of N3, considering we have several specifications to take into account: the turtle family, IRI spec, XSD datatypes, and possibly other datatypes. That being said, it's a different matter to prohibit these additional checks completely. Especially considering the RDF-JS spec offers a nice way (the datafactory) to include such optional validation steps to RDF-JS libraries like N3

RubenVerborgh commented 1 year ago

We'll probably settle on an option similar to this one, where we'll wrap the N3 parser interface and expose a stream with proper error handling.

Great! Do let me know if there's any issues; I definitely want to support your use case (even if in a slightly different way).

I did not know that validating IRIs was in scope for N3 (as we've had issues with IRIs that were invalid according to rfc 3987).

Do let me know; the Turtle grammar is strict as to what URLs are allowed, and N3.js follows that.

In general, I'd expect a parser (or any other tool that creates RDF-JS terms for me) to return errors when the input does not match the spec.

Agreed; we might just differ in what we consider "the spec". For me, that's RDF 1.1 and Turtle.

That being said, it's a different matter to prohibit these additional checks completely.

Well, we don't—I just think that this one particular way of validating clashes with performance goals, and that there is a better way of implementing it that offers better performance and other benefits.

Especially considering the RDF-JS spec offers a nice way (the datafactory) to include such optional validation steps to RDF-JS libraries like N3

And that's where we differ as well. RDF/JS indeed has a DataFactory interface, but it does not impose an interface for parser etc. to accept it as a constructor argument. In other words: N3.js is voluntarily offering DataFactory as an option. Everything still works without accepting a factory: the parser is an RDF/JS Source implementation that you can use at will.

where we'll wrap the N3 parser interface and expose a stream with proper error handling.

No need to wrap; just chain a validating stream after it. It's literally .pipe(new StreamParser()).pipe(validator), or if you want class ValidatingParser { constructor() { return new StreamParser().pipe(validator); }}. Performance will be significantly better than if intrusive error handling is used.

So far, I have not heard drawbacks to implementing validation after the parser—in fact, I think there are only advantages.

rdfjs / N3.js

Datafactory errors not propagated properly by StreamParser #308