ruby-rdf / rdf-turtle

Turtle reader/writer for Ruby
http://rubygems.org/gems/rdf-turtle
The Unlicense
31 stars 9 forks source link

Report input position when a parse error is found #5

Closed ijdickinson closed 10 years ago

ijdickinson commented 10 years ago

As far as I can tell, there's no way to get the line number and character position when the Turtle parser detects an error. This would be helpful in the case that the text being parsed is user-generated, and we want to give feedback to the user as to the location of their error.

A specific use case would be:

r = RDF::Repository.new
begin
  r << RDF::Turtle::Reader.new( "foo", validate: true )
rescue RDF::ReaderError => e
  # here I would like to get the line number from e
end

If I can find the time, I'll investigate and generate a PR ... but I am quite busy at the moment!

gkellogg commented 10 years ago

Use the validate: true option to the parser and it will throw an error if there is any invalid input including each error found. This includes the line number, but not the character position.

It should, but doesn't now, report errors after parsing, even without the validate option.

ijdickinson commented 10 years ago

Thanks Greg. I'm already using validate:true, but I can't see which method on the event I should use to get the line number from the event object.

gkellogg commented 10 years ago

It's in EBNF::LL1::Parser::Error#lineno, which is rescued on line 317 of reader.rb, but doesn't seem to be passed along; I'll look into that later. It should also be in the @error_log instance variable within the Reader.

Looking at that, at the end of parsing, if validation is set, it raises an error using the concatenated @error_log as a message, including line number and production where the problems are found.

Definitely, error reporting could be improved.

gkellogg commented 10 years ago

This should be better in commit 2e7930b303699fae225b0a2929f25a9a763e8d36 (on develop). The EBNF doesn't yet report character position, but it does report line number and is better on using informative output for errors and warnings. Character position will require more work in the EBNF Lexer. Errors and warnings are written to $stderr regardless of validate option.

ijdickinson commented 10 years ago

Sorry, Greg, I'm being dense about this. I've pointed my Gemflie at the github release, but I'm still not seeing where my exception handler gets the line number. Here's the Rails controller method I have at the moment:

def create
  rdf = rdf_content_from_request # returns a String holding some Turtle

  store = user_rdf_store
  store.clear!

  begin
    store << RDF::Turtle::Reader.new( rdf, validate: true )
    result = {size: store.size}
  rescue RDF::ReaderError => e
    result = {
      size: 0,
      error: e.message,
      line_number: 0  # << Here I am unclear
    }
  end

  render json: result
end

I've tried inspecting e with pry, but I can't see where it gives me the error line, and it's not part of the error message as far as I can see. For example, if I send

<http://example.com/book/once_upon_a_time> a <http://example.com/book/Book>

I get back:

Error: Unexpected end of input, skipped to :eof, production = "."

which is quite correct, but doesn't tell the user which line to attend to. I'm sure I'm missing something obvious, but I can't see it at the moment!

gkellogg commented 10 years ago

Line no information isn't in RDF::ReaderError, only one error is generated with all the collected error information. If validate:true is set, the message includes the serialized errors after processing (and error recovery) is complete. Otherwise, error recovery wouldn't be possible, as the exception would end processing. If validate is false, errors are printed to $stderr, which could be captured.

Errors should also be notified through the callback with the :trace argument, and level set to 0. However, this may require setting the debug:1 option.

gkellogg commented 10 years ago

Also, make sure you're using the "develop" branch.

gkellogg commented 10 years ago

Sorry, the callback is used in RDF::Turtle::Reader#each_statement as part of the parser loop. Perhaps you could suggest a way you'd like to see this exposed further out.

Alternatively, we could change the behavior of the validate option to not collect errors, but just raise an error on the first problem found. The idea had been to allow many errors to be collected and returned at the end of processing, which is consistent with wanting use error recovery to be able to process as much data as possible and return a single error (when validating) at the end with the collected error messages. You could then parse this information to get information on each individual error.

The example you provide can't really work because of the fact the errors are collected until the end of processing.

What would you like to see happen?

gkellogg commented 10 years ago

On further thought, it seems reasonable that the first exception is returned when validating. On the develop branches of rdf, ebnf, rdf-turtle and rdf-trig, I changed this so the RDF::ReaderError has :lineno and :token attributes, and use that for N-Triples, N-Quads and FreebaseReader implementations. In EBNF, if validating, the first error raised terminates, and the lineno and token information is added to the RDF::ReaderError. Otherwise, if not validating, we go into error recovery mode and will continue to collect errors, and report on them at the end.

Let me know if this works better for you, and I'll release updates to those gems (and SPARQL).

ijdickinson commented 10 years ago

Thanks Greg, that's working great.