igordejanovic / parglare

A pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/
MIT License
136 stars 32 forks source link

Problems with the use of EMPTY #112

Closed johnw3d closed 4 years ago

johnw3d commented 4 years ago

Description

I'm developing a parser for Korean based on morphological transformations of Korean text by a CNN-based analyzer. The grammar is substantially ambiguous and I've been finding that the GLR parser is working well, giving the alternate parsings I was hoping for in the case of these ambiguities.

However, as I fill out the grammar, some uses of EMPTY rules (explicitly or via '?' or '*') are causing complete parsing failures I wasn't expecting. I can apparently avoid these by recoding the EMPTY use with multiple alternate rules permuting all the optionals, but this results in a very unwieldy grammar and I'm wondering if I can get help understanding the failures with EMPTY.

The GLRParser is being used like this:

    self.grammar = Grammar.from_file(os.path.join(os.path.dirname(__file__), "./korean.pg"))
    self.parser = GLRParser(self.grammar, debug=True, build_tree=True)

What I Did

Below are working and failing versions of a segment of the full grammar. The string I am parsing is as follows (a sequence of morphemes & part-of-speech tags):

 자전거:NNG; 를:JKO; 있:VV; 어요:SEF; .:SF;

The failing version of the grammar yields this error, showing that it got all the way to the period sentence-final morpheme.

*** LEAVING ERROR REPORTING MODE.
    Tokens expected: verbToNounModifyingForm, nominalizingSuffix, clauseConnector, adnominalSuffix
    Tokens found: [<sentenceEnd(.:SF;)>]
    Error:  Error at 1:30:"; 어요:SEF;  **> .:SF; " => Expected: adnominalSuffix or clauseConnector or nominalizingSuf
                fix or verbToNounModifyingForm but found <sentenceEnd(.:SF;)>
Error at 1:30:"; 어요:SEF;  **> .:SF; " => Expected: adnominalSuffix or clauseConnector or nominalizingSuffix or verbToNounModifyingForm but found <sentenceEnd(.:SF;)>

Here's a working grammar segment; the // <-- comments point at the rule I'm concerned with, singleNounPhrase. In this version, the optional "determiner" terminal is made optional by using two alternates of the singleNounPhrase rule:

sentence:               interjection* sentence1
                    |   sentenceJoiningAdverb? sentence1;

sentence1:              subordinateClause* clause sentenceEnd;

subordinateClause:      clause clauseConnector punctuation*;

clause:                 phrase* verbPhrase
                    |   phrase* complement? copulaPhrase;

phrase:                 topic
                    |   subject
                    |   object
                    |   adjectivalPhrase
                    |   adverbialPhrase
                    |   nounPhrase;

topic:                  nounPhrase topicMarker;
subject:                nounPhrase subjectMarker;
object:                 nounPhrase objectMarker;
complement:             nounPhrase complementMarker?;

nounPhrase:             singleNounPhrase;

singleNounPhrase:       determiner noun+            // <--  
                    |   noun+;                  // <--

noun:                   simpleNoun
                    |   nominalForm
                    |   nominalizedVerb
                    |   verbModifiedToNoun;

nominalizedVerb:        clause nominalizingSuffix;
verbModifiedToNoun:     clause verbToNounModifyingForm;

adjectivalPhrase:       adjective+ nounPhrase;

adjective:              clause adnominalSuffix
                    |   possessive;

possessive:             simpleNoun+ possessiveMarker;
copulaPhrase:           adverb* copula verbSuffix* predicateEndingSuffix?;
adverbialPhrase:        nounPhrase adverbialParticle auxiliaryParticle*
                    |   verb adverbialParticle auxiliaryParticle*;

verbPhrase:             verb verbSuffix* predicateEndingSuffix?;
verb:                   simpleVerb;
interjection:           interjectionTerminal punctuation*;

terminals
    sentenceEnd:            /[^:]+:(SF);/;
    interjectionTerminal:   /[^:]+:(IC);/;
    punctuation:            /[^:]+:(SP|SS|SE|SO|SW|SWK);/;
    clauseConnector:        /[^:]+:(EC|CCF|CCMOD|CCNOM);/;
    topicMarker:            /[^:]+:(TOP);/;
    objectMarker:           /[^:]+:(JKO);/;
    subjectMarker:          /[^:]+:(JKS);/;
    complementMarker:       /[^:]+:(JKC);/;
    conjunction:            /[^:]+:(JC|CON);/;
    determiner:             /[^:]+:(MM);/;
    auxiliaryParticle:      /[^:]+:(JX);/;
    possessiveMarker:       /[^:]+:(JKG);/;
    nounModifyingSuffix:    /[^:]+:(XSN|JKV);/;      // # eg, 님, 들, 아/야 (vocative), todo: these should all have particle definitions
    nominalizingSuffix:     /[^:]+:(ETN);/;
    adnominalSuffix:        /[^:]+:(ETM);/;
    verbSuffix:             /[^:]+:(EP|TNS);/;
    predicateEndingSuffix:  /[^:]+:(SEF|EF);/;
    negative:               /[^:]+:(NEG);/;
    verbCombiner:           /고:(EC|CCF);/;
    honorificMarker:        /(으시|시):EP;/;
    verbModifier:           /[^:]+:(VMOD);/;
    verbNominal:            /[^:]+:(VNOM);/;
    adverbialParticle:      /[^:]+:(JKB);/;
    quotationSuffix:        /[^:]+:(QOT);/;
    shortQuotationSuffix:   /[^:]+:(SQOT);/;
    sentenceJoiningAdverb:  /[^:]+:MAJ;/;
    simpleNoun:             /[^:]+:(NNG|NNP|NNB|NR|SL|NP|SN);/;
    adverb:                 /[^:]+:(MAG);/;
    simpleVerb:             /[^:]+:(VV|VVD|VHV);/;
    descriptiveVerb:        /[^:]+:(VA|VCP|VCN|VAD|VHA);/;
    auxiliaryVerbConnector: /[^:]+:(EC);/;
    auxiliaryVerbForm:      /[^:]+:(EC);/;
    copula:                 /(되:VV)|([^:]+:(VCP|VCN));/;
    number:                 /[^:]+:(SN|NR);/;
    counter:                /[^:]+:(NNB|NNG);/;
    nominalForm:            /[^:]+:(NNOM);/;
    verbToNounModifyingForm: /[^:]+:(NMOD);/;
    nominalVerbForm:        /[^:]+:(VNOM);/;

Here's the failing version, using '?' for the optional determiner. All other rules are identical (the terminals are the same as above, but not repeated here):

sentence:               interjection* sentence1
                    |   sentenceJoiningAdverb? sentence1;

sentence1:              subordinateClause* clause sentenceEnd;

subordinateClause:      clause clauseConnector punctuation*;

clause:                 phrase* verbPhrase
                    |   phrase* complement? copulaPhrase;

phrase:                 topic
                    |   subject
                    |   object
                    |   adjectivalPhrase
                    |   adverbialPhrase
                    |   nounPhrase;

topic:                  nounPhrase topicMarker;
subject:                nounPhrase subjectMarker;
object:                 nounPhrase objectMarker;
complement:             nounPhrase complementMarker?;

nounPhrase:             singleNounPhrase;

singleNounPhrase:       determiner? noun+;   // <----
noun:                   simpleNoun
                    |   nominalForm
                    |   nominalizedVerb
                    |   verbModifiedToNoun;

nominalizedVerb:        clause nominalizingSuffix;
verbModifiedToNoun:     clause verbToNounModifyingForm;

adjectivalPhrase:       adjective+ nounPhrase;       
adjective:              clause adnominalSuffix
                    |   possessive;

possessive:             simpleNoun+ possessiveMarker;
copulaPhrase:           adverb* copula verbSuffix* predicateEndingSuffix?;
adverbialPhrase:        nounPhrase adverbialParticle auxiliaryParticle*
                    |   verb adverbialParticle auxiliaryParticle*;

verbPhrase:             verb verbSuffix* predicateEndingSuffix?;
verb:                   simpleVerb;
interjection:           interjectionTerminal punctuation*;
igordejanovic commented 4 years ago

Thanks for the detailed report. After a quick investigation it seems to be a bug. Will investigate further.

johnw3d commented 4 years ago

Hi Igor. In case it's useful, here are much simpler grammars & test sentence that show the difference. Let me know if there's anything else I can do to help chase this down.

Test string: 자전거:NNG; 있:VV; 어요:SEF; .:SF;

The above sentence is simpleNoun simpleVerb predicateEndingSuffix sentenceEnd

Parsing succeeds with:

sentence1:              subordinateClause* clause sentenceEnd;

subordinateClause:      clause clauseConnector;

clause:                 singleNounPhrase* verbPhrase;

// ----------
singleNounPhrase:       determiner simpleNoun  |  simpleNoun;    
// ----------

verbPhrase:             simpleVerb verbSuffix* predicateEndingSuffix?;

terminals
    sentenceEnd:            /[^:]+:(SF);/;
    clauseConnector:        /[^:]+:(EC|CCF|CCMOD|CCNOM);/;
    determiner:             /[^:]+:(MM);/;
    simpleNoun:             /[^:]+:(NNG|NNP|NNB|NR|SL|NP|SN);/;
    simpleVerb:             /[^:]+:(VV|VVD|VHV);/;
    verbSuffix:             /[^:]+:(EP|TNS);/;
    predicateEndingSuffix:  /[^:]+:(SEF|EF);/;

Parsing fails with:

sentence1:              subordinateClause* clause sentenceEnd;

subordinateClause:      clause clauseConnector;

clause:                 singleNounPhrase* verbPhrase;

// ----------
singleNounPhrase:       determiner_opt simpleNoun;  
determiner_opt:         determiner | EMPTY;
// ----------

verbPhrase:             simpleVerb verbSuffix* predicateEndingSuffix?
terminals
    sentenceEnd:            /[^:]+:(SF);/;
    clauseConnector:        /[^:]+:(EC|CCF|CCMOD|CCNOM);/;
    determiner:             /[^:]+:(MM);/;
    simpleNoun:             /[^:]+:(NNG|NNP|NNB|NR|SL|NP|SN);/;
    simpleVerb:             /[^:]+:(VV|VVD|VHV);/;
    verbSuffix:             /[^:]+:(EP|TNS);/;
    predicateEndingSuffix:  /[^:]+:(SEF|EF);/;
igordejanovic commented 4 years ago

Hi John. I'm in the process of a greater long due rework and cleanup that will solve this issue as some other reported lately and make code simpler and easier to maintain along the way. I made a good progress so far and expect to finish during the next week so will let you know if you have time to tests with your grammars/inputs. This shorter grammar will be an excellent regression test for the issue. Thanks.

igordejanovic commented 4 years ago

I've just finished the rework. It is on master branch. It should now correctly handle all context-free grammars. You can see the regression test for your issue

Thanks for the contribution. I'm closing this issue. Feel free to open if you notice anything wrong.

johnw3d commented 4 years ago

Hi Igor. The regression tests do indeed work with an install from the latest master, but a test I tried with the pglrcommand of essentially the same test, results in a runtime exception. I get a similar exception on some larger grammars run from within python scripts in the manner of your regression test.

The command I used was:

pglr trace good.pg -i '자전거:NNG; 있:VV; 어요:SEF; .:SF;'

where good.pg has the good version of the short test grammar from my recent post in this ticket, and the exception traceback I get is:

*** PARSING STARTED

Traceback (most recent call last):
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/bin/pglr", line 11, in <module>
    load_entry_point('parglare', 'console_scripts', 'pglr')()
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/jwainwright/Downloads/parglare/parglare/cli.py", line 83, in trace
    parser.parse(input)
  File "/Users/jwainwright/Downloads/parglare/parglare/glr.py", line 120, in parse
    str(start_head.context.state.state_id))
AttributeError: 'GSSNode' object has no attribute 'context'

This happens with both the good and bad short test grammars. I will try other combinations, but thought I should let you know about this soonest.

igordejanovic commented 4 years ago

Ah, yes, thanks. It is graphical tracing that need to be updated to work correctly with the new changes. I just did a quick fix on the master so it doesn't break but it still is not working fully correct, i.e. the merged nodes are not shown as such on the graph. Graphical tracing is meant more for educational purposes, as for ambiguous grammars and a slightly bigger inputs the graph becomes to large to handle. I'm thinking about introducing some other means for debugging using pglr command like printing and dumping trees to files (what is now done from code with tree_str() call on parse tree node).

igordejanovic commented 4 years ago

But if you are using it just to verify that something parses and not using the produced .dot graph then it should work fine now.

johnw3d commented 4 years ago

Thanks again, Igor, that repaired the pglr exceptions, but I am still getting some odd parsing results. In particular, here's the tail-end of a parsing run reporting a parsing fail, but the upcoming terminal is displayed in the error message as one of the expected terminals (sentenceEnd in this case.)

...
*** LEAVING ERROR REPORTING MODE.
    Tokens expected: sentenceEnd, nominalizingSuffix, verbToNounModifyingForm, clauseConnector, STOP, adnominalSuffi
                         x
    Tokens found: [<sentenceEnd(.:SF;)>]
    Error:  Error at 1:99:" ㅂ니다:SEF;  **> .:SF; " => Expected: STOP or adnominalSuffi
                x or clauseConnector or nominalizingSuffix or sentenceEnd or verbToN
                ounModifyingForm but found <sentenceEnd(.:SF;)>
Generated file parglare_trace.dot.
You can use dot viewer or generate pdf with the following command:
dot -Tpdf parglare_trace.dot -O parglare_trace.dot.pdf
Traceback (most recent call last):
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/bin/pglr", line 11, in <module>
    load_entry_point('parglare', 'console_scripts', 'pglr')()
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/jwainwright/Projects/mirinae/nlp-back-end/env/lib/python3.6/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/jwainwright/parglare/parglare/cli.py", line 83, in trace
    parser.parse(input)
  File "/Users/jwainwright/parglare/parglare/glr.py", line 193, in parse
    raise self.errors[-1]
parglare.exceptions.ParseError: Error at 1:99:" ㅂ니다:SEF;  **> .:SF; " => Expected: STOP or adnominalSuffix or clauseConnector or nominalizingSuffix or sentenceEnd or verbToNounModifyingForm but found <sentenceEnd(.:SF;)>

I'll gather more evidence and try to get a minimal example to post. If not, I'll send the full grammar and failing sentence. I am getting good parses on most sentences.

johnw3d commented 4 years ago

Here's the troublesome sentence and the grammar file is attached.

'공부하:VHV; 시:EP; 었:EP; 다고 하여서:CCF; 다:MAG; 이해하:VHV; ㄹ 수 있:VMOD; 는:ETM; 것:NNB; 은:TOP; 아니:VCN; ㅂ니다:SEF; .:SF; '

korean.pg.zip

igordejanovic commented 4 years ago

It could be an error in reporting. Do you know if the sentence is valid according to the grammar?

johnw3d commented 4 years ago

I believe the sentence is valid, although I have never had a good parse of that sentence with the grammars I have been building; it is a relatively complex sentence depending on nested left-recursion in at least a couple of places. Let me see if I can find a simpler example that I am sure is valid.

johnw3d commented 4 years ago

I've found some simpler examples of the same form that fail to parse, so perhaps the grammar is not correct for that form. Let me do some more experiments, and I'll report back...

igordejanovic commented 4 years ago

Ah, I think I found an issue in error reporting. Will provide the fix shortly but the message is now:

parglare.exceptions.ParseError: Error at 1:99:" ㅂ니다:SEF;  **> .:SF; " => Expected: adnominalSuffix or nominalizingSuffix or verbToNounModifyingForm but found <sentenceEnd(.:SF;)>
igordejanovic commented 4 years ago

Do you have a shorter sentence that fails similarly? I would like to extend the regression test and would like to have something that runs faster.

igordejanovic commented 4 years ago

The fix is on the master (without test)

johnw3d commented 4 years ago

OK, I am suspecting the grammar isn't covering that sentence form, so an error report of the kind you just listed would be expected. Here's a slightly-shorter sentence that has the same reporting error:

공부하:VHV; 는:ETM; 것:NNB; 은:TOP; 아니:VCN; ㅂ니다:SEF; .:SF;

johnw3d commented 4 years ago

And I found the fault in the grammar (I'm still developing it; Korean has a crazy syntax!) and the failing sentence now parses correctly. I will grab the latest master when you get that reporting fix in. Thanks again for such a great tool!

igordejanovic commented 4 years ago

Thanks. will make a regression test with the sentence you provided. The fix is already on the master branch.