Store Token position in the produces quads

rdfjs / N3.js

Lightning fast, spec-compatible, streaming RDF for JavaScript

http://rdf.js.org/N3.js/

Other

714 stars 133 forks source link

Store Token position in the produces quads #377

Open BenjaminHofstetter opened 8 months ago

BenjaminHofstetter commented 8 months ago

Why do I need that: After parsing a Turtle file, I lose all information about the source file. For better tooling support, I propose implementing some kind of "source maps" to trace back from quads to positions in the Turtle file.

For instance, in tools like https://shacl-playground.zazuko.com/, when encountering errors in SHACL validation reports, locating the error-causing triple requires human intervention. With source map information, editors could pinpoint the exact location in the Turtle file, aiding in error resolution. Implementing source maps would bridge the gap between parsed files and their source, enhancing tooling support. The tokenizer already generates tokens with line, start, and end information, laying the groundwork for this feature.

RubenVerborgh commented 8 months ago

This would be possible indeed, if the parser emits the context from the tokenizer in the quads.

We have no plans to take this up, but a pull request that puts this functionality behind a flag would be welcome, provided it has no performance impact when switched off.

faubulous commented 4 months ago

This is excactly what I need too. I am currently developing an RDF editing extension for Visual Studio Code named Mentor. For this use case I frequently need to resolve URIs and blank nodes to parsed Tokens and this feature would be extremely helpful.

I found a workaround for URIs which requires parsing the document again after loading and interpreting the Triples, but that only works for URIs and not for blank nodes. This currently blocks me from implementing SHACL support where blank node definitions of (property) shapes are quite common.

Any idea how such source maps could be implemented?

jeswr commented 4 months ago

Any idea how such source maps could be implemented?

Luckily tokens emitted by the Lexer already contain information about the line and position of each token emitted by the lexer. In the Parser you could add this information property of Terms every time a new _subject, _predicate, _object or _graph is assigned in the parser. For instance the code here would become

this._subject = this._blankNode();
if (this._recordPosition) {
  this._subject[POS] = { line: token.line, start: token.start }
}
this._saveContext('blank', this._graph,
                        this._subject, null, null);

I would recommend making POS a Symbol that is exported by N3.js, however it could also just be a property name like _internal_position.

The caveat of this approach would be that it might cause a non-negligible performance hit even when the feature is disabled; but I suspect this is something you can perf. test and optimise once the feature is implemented.

BenjaminHofstetter commented 4 months ago

I did a POC some time ago. I added it as a use case in the RDF-Star working group. Maybe in the future we can use RDF-Start to define such source maps "externally" from the source turtle. https://github.com/w3c/rdf-star/issues/285#issuecomment-2003235647

My poc is using n3 parser and exposes the tokens in the quads (not rdf-star).

faubulous commented 4 months ago

@BenjaminHofstetter Did you create a patch for N3 and publish the code of the PoC somewhere?

TallTed commented 4 months ago

Perhaps change the issue title from — Store Token position in the produces quads — to — Store original positions of Tokens in quads produced by conversion from Turtle" ?

(At least, change produces to produced.)