Parse errors are not user friendly...

dimitri commented 10 years ago

As seen in https://github.com/dimitri/pgloader/issues/56 we have problems with users trying to decipher error messages. Is it possible to do something about it? Do you have plans to improve error reporting?

scymtym commented 10 years ago

Hi,

note that the esrap repository from which quicklisp fetches is scymtym/esrap.

Regarding your question: I have been working on parse error reports like

At

  t a { a b:double a b { a@string a b } { a b > } 
                         ^ (Line 1, Column 30, Position 30)

Could not parse NAME. Expected:

     any character satisfying ALPHANUMERICP
  or the character { (LEFT_CURLY_BRACKET)

Reached via

     STATEMENT
  -> (+ (OR SCOPE DECLARATION USE))
  -> (OR SCOPE DECLARATION USE)
  -> USE
  -> NAME
  -> (+ (ALPHANUMERICP CHARACTER))
  -> (ALPHANUMERICP CHARACTER)

Do you think that would help?

@nikodemus: would do you think about this?

dimitri commented 10 years ago

It would help tremendously, yes!

scymtym commented 10 years ago

OK, the code will need some work before it can go into master, but I will try to get it done as soon as possible.

scymtym commented 10 years ago

@dimitri, I pushed a draft of the improved parse error reporting to scymtym/esrap:wip-better-errors. In addition to the error position, parse errors report

the most specific rule that failed to apply
a list of characters, strings, etc. expected at the failure position
a derivation explaining how the failing rule was reached

I tested the code by hacking an Emacs completion backend that displays the expected strings or characters: screenshot

Could you test whether it would improve the situation for you? I would be interested to hear whether it works for you and whether error messages are too verbose/too terse/just right.

After some cleanup, the code could go into esrap master, I think.

PS: The completion hack above uncovered the following potential bug in the pgloader parser: the unix: scheme visible in the screenshot leads to the following error:

:UNKNOWN fell through ECASE expression.
Wanted one of (:MYSQL :POSTGRESQL).
   [Condition of type SB-KERNEL:CASE-FAILURE]

Restarts:
 0: [*ABORT] Return to SLIME's top level.
 1: [ABORT] Abort thread (#<THREAD "worker" RUNNING {128E6B01}>)

Backtrace:
  0: (SB-KERNEL:CASE-FAILURE ECASE :UNKNOWN (:MYSQL :POSTGRESQL))
  1: ((LAMBDA (PGLOADER.PARSER::URI #:START398 #:END399) :IN "/home/jmoringe/.local/share/common-lisp/quicklisp/dists/quicklisp/software/pgloader-3.0.99/src/parser.lisp") ((:TYPE :UNKNOWN) (:USER "a" :PASS..
      Locals:
        PGLOADER.PARSER::DBNAME = "bla"
        PGLOADER.PARSER::HOST = (:HOST "c")
        PGLOADER.PARSER::PASSWORD = "b"
        PGLOADER.PARSER::PORT = NIL
        PGLOADER.PARSER::TABLE-NAME = NIL
        TYPE#1 = :UNKNOWN
        PGLOADER.PARSER::URI = ((:TYPE :UNKNOWN) (:USER "a" :PASSWORD "b") (:HOST (:HOST "c") :PORT NIL) (:DBNAME "bla") NIL)
        PGLOADER.PARSER::USER = "a"
        PGLOADER.PARSER::USER#1 = "a"
        #:WHOLE400 = (:TYPE :UNKNOWN :USER "a" :PASSWORD "b" ...)
  2: ((LAMBDA NIL :IN ESRAP::COMPILE-RULE))
  3: (ESRAP::RESULT-PRODUCTION #S(ESRAP::RESULT :%PRODUCTION #<CLOSURE (LAMBDA NIL :IN ESRAP::COMPILE-RULE) {12997B55}> :POSITION 46))
  4: ((LAMBDA NIL :IN ESRAP::COMPILE-SEQUENCE))

The case failure is in line 342 of parser.lisp (Quicklisp version of pgloader).

dimitri commented 10 years ago

Hi,

Jan Moringen notifications@github.com writes:

@dimitri, I pushed a draft of the improved parse error reporting to scymtym/esrap:wip-better-errors. In addition to the error position, parse errors report

Thanks a lot for your work on that, it's going to be awesome!

I did git clone your project and branch and am trying it by injecting random errors in my test load files and seeing what happens with the following command:

(pgloader::with-monitor () (pgloader.parser::parse-commands-from-file "/Users/dim/dev/pgloader/test/csv.load"))

As far as the error message goes, it's already way better, but as far as understanding what's wrong in the source file goes, I think we need more improvements.

Could you please try removing a comma separator in the file (I tested just after the truncate keyword), or maybe removing the d letter from the “enclosed” keyword, etc, and see what happens?

It seems that all I can get from those tests is the following error message:

At end of input

  ^ (Line 51, Column 1, Position 1400)

Could not parse DOUBLED-COLON. Expected:

  the string "::"

Reached via

     DOUBLED-COLON
  -> (AND "::")
  -> "::"

Could not parse DOUBLED-AT-SIGN. Expected:

  the string "@@"

Reached via

     DOUBLED-AT-SIGN
  -> (AND "@@")
  -> "@@"

Could not parse DSN-USER-PASSWORD. Expected:

  the character @ (COMMERCIAL_AT)

Reached via

     DSN-USER-PASSWORD
  -> (AND USERNAME (? (AND ":" (? PASSWORD))) "@")
  -> "@"

It might that my esrap grammar is just confused, tho.

After some cleanup, the code could go into esrap master, I think.

Yes please! ;-)

PS: The completion hack above uncovered the following potential bug in the pgloader parser: the unix: scheme visible in the screenshot leads to the following error:

I will see about that, thanks for reporting!

:UNKNOWN fell through ECASE expression.

My understanding is that using ecase prevents the parser to do proper backtracking here, and so we won't get to your nicer error messages in case of typo, right?

Regards,

dim

scymtym commented 10 years ago

Could you please try removing a comma separator in the file (I tested just after the truncate keyword), or maybe removing the d letter from the “enclosed” keyword, etc, and see what happens? [...] It might that my esrap grammar is just confused, tho.

This is indeed caused by a problem in the grammar. I debugged this by doing (esrap:trace-rule 'commands :recursive t). Among many other things, the output contained

7: PGLOADER.PARSER::USERNAME 440-1367 -> "/pgloader?csv (a, b, d, c)

     WITH truncate,
          skip header = 1,
          fields optionally enclosed by '\"',
          fields escaped by double-quote,
          fields terminated by ','

      SET client_encoding to 'latin1',
          work_mem to '12MB'
          standard_conforming_strings to 'on'

   BEFORE LOAD DO
    $$ drop table if exists csv; $$,
    $$ create table csv (
        a bigint,
        b bigint,
        c char(2),
        d text
       );
    $$;

Stupid useless header with a © sign
\"2.6.190.56\",\"2.6.190.63\",\"33996344\",\"33996351\",\"GB\",\"United Kingdom\"
\"3.0.0.0\",\"4.17.135.31\",\"50331648\",\"68257567\",\"US\",\"United States\"
\"4.17.135.32\",\"4.17.135.63\",\"68257568\",\"68257599\",\"CA\",\"Canada\"
\"4.17.135.64\",\"4.17.142.255\",\"68257600\",\"68259583\",\"US\",\"United States\"
\"4.17.143.0\",\"4.17.143.15\",\"68259584\",\"68259599\",\"CA\",\"Canada\"
\"4.17.143.16\",\"4.18.32.71\",\"68259600\",\"68296775\",\"US\",\"United States\"
"

i.e. the username rule consumed all remaining input (that's why your report mentions failing at "end of input"). The username rule is defined as

(defrule username (+ (or (not (or ":" "@" )) doubled-at-sign doubled-colon))
  (:text t))

meaning that it can consume anything as long as it is neither ":" nor "@". Restricting username (e.g. disallowing whitespace or whatever is appropriate) should solve the problem.

As far as the error message goes, it's already way better, but as far as understanding what's wrong in the source file goes, I think we need more improvements.

Fixing the above problem should make the error messages much more useful. One problem will probably remain, though: in typical grammars, whitespace and comments are permitted almost everywhere. However, it is not helpful if virtually every error messages contains "you know, you could just insert whitespace or a comment at the problematic position". So I suppose we need a mechanism for excluding certain rules from error messages.

My understanding is that using ecase prevents the parser to do proper backtracking here, and so we won't get to your nicer error messages in case of typo, right?

Kind of. "Normally" the parser first applies rules without calling result producing code (:lambda, :destructure, etc.). In case the parse succeeds, result producing code is called to, well, produce parse results. This implies that it is not possible to influence the parse from within :lambda, :destructure, etc. Signaling an error in one of these will simply signal an error in the result production phase, but only after a successful parse.

The parse vs. result production phase separation is not absolute, though. "Semantic predicates" can be used to steer the parse based on intermediate results. A simple example would be

(defrule integer ...) ; returns parsed integer
(defrule odd-integer (oddp integer))

When odd-integer is tried, oddp is called with the production of the integer rule and odd-integer succeeds if oddp returns true.

dimitri commented 10 years ago

I did as you propose in https://github.com/dimitri/pgloader/commit/0f3103da2d138261913e2ae4e7829e83e51e1177 and I really like the error messages I get now. It's a huge improvement, thanks a lot for it!

Please consider merging that into main esrap already, even if some improvements can be think of for later revisions, such as avoiding mentioning some parts of the syntax tree such as the comments.

dimitri commented 10 years ago

I also fixed (I think) the dsn prefix parsing bits thanks to the patch linked here.

scymtym commented 10 years ago

I will try to clean up and merge the code as quickly as I can. No promises though.

dimitri commented 10 years ago

Thanks for your efforts and interest (and code) ;-)

scymtym commented 10 years ago

@dimitri Just to keep you informed: most of the improved error reporting code is approaching acceptable quality with tests and all. However, there is one design decision (scraping the required result information from the cache, "post-mortem" vs. storing more information in results while parsing) which needs at least benchmarking and potentially more thought.

Sorry for dragging this out like this but I would hate to sacrifice (much) performance for the improved error reporting. I think that is how @nikodemus would feel about this as well.

dimitri commented 10 years ago

Thanks a lot for the update, really. I understand that some design decisions are more tricky than others and time is needed. I'm still waiting for debian packaging work to release pgloader 3.1.0, I hope that with some luck we will be able to benefit from the new error reporting as soon as 3.1.0, let's see how it goes ;-)

fade commented 9 years ago

The better errors branch has already helped me dig myself out of a PEBCAK problem with pgloader, so thanks for the work on this. IMO, it's very valuable.

scymtym commented 8 years ago

A much improved version of this code is in https://github.com/scymtym/esrap master (which is the maintained upstream repository)

nikodemus / esrap

Parse errors are not user friendly... #26

Regards,