masashi-y / depccg

A* CCG Parser with a Supertag and Dependency Factored Model
MIT License
91 stars 28 forks source link

AssertionError for N-best parsing and XML output #7

Closed pasmargo closed 6 years ago

pasmargo commented 6 years ago

When I request N-best parsing and XML output, I get an assertion error:

For 1-best parsing:

echo "this|this|DT|O is|be|VBZ|O a|a|DT|O test|test|NN|O sentence|sentence|NN|O .|.|.|O" | python ../depccg/src/run.py ${depccg_dir}/models/tri_headfirst en --input-format POSandNERtagged --format xml --nbest 1
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="candc.xml"?>
<candc>
<ccg>
<rule type="rp" cat="S[dcl]">
<rule type="ba" cat="S[dcl]">
<lf start="0" span="1" word="this" lemma="this" pos="DT" chunk="XX" entity="O" cat="NP" />
<rule type="fa" cat="S[dcl]\NP">
<lf start="1" span="1" word="is" lemma="be" pos="VBZ" chunk="XX" entity="O" cat="(S[dcl]\NP)/NP" />
<rule type="fa" cat="NP">
<lf start="2" span="1" word="a" lemma="a" pos="DT" chunk="XX" entity="O" cat="NP[nb]/N" />
<rule type="fa" cat="N">
<lf start="3" span="1" word="test" lemma="test" pos="NN" chunk="XX" entity="O" cat="N/N" />
<lf start="4" span="1" word="sentence" lemma="sentence" pos="NN" chunk="XX" entity="O" cat="N" />
</rule>
</rule>
</rule>
</rule>
<lf start="5" span="1" word="." lemma="." pos="." chunk="XX" entity="O" cat="." />
</rule>

</ccg>
</candc>

For 2-best parsing:

<?xml-stylesheet type="text/xsl" href="candc.xml"?>
<candc>
Traceback (most recent call last):
  File "../depccg/src/run.py", line 81, in <module>
    to_xml(res, tagged_doc)
  File "../depccg/src/run.py", line 38, in to_xml
    assert len(tree) == len(tagged)
AssertionError
masashi-y commented 6 years ago

Hi! As C&C parser does not support N best parsing, I do not know how the output xml should look like when using N-best output and C&C xml format. Do you have any suggestion?

hiroshinoji commented 6 years ago

Sorry to interrupt.

@pasmargo I suspect you are now incorporating depccg into ccg2lambda. Sorry for my long silence after saying I will incorporate various CCG parsers into Jigg. Actually, the latest Jigg already supports them, but without k-best mode, which I've not tried on depccg yet. It also doesn't support k-best mode for easyccg for some technical reasons (I think it is possible, though).

Maybe the output format would be changed after the discussion on this and other threads. If the format is decided I want to soon implement k-best mode on Jigg for depccg, as well as easyccg.

pasmargo commented 6 years ago

Do you have any suggestion?

I don't know of a good solution unfortunately. One possibility is to annotate the <ccg> node with the sentence ID. E.g. <ccg ID=1> where the ID would be the equivalent to EasyCCG' ID.

@hiroshinoji Thank you very much for your consideration! I think Jigg is a great solution for the integration of several CCG parsers and I am looking forward to seeing it implemented with n-best support. At the moment I am only drafting some temporary solutions until Jigg is ready. This is not so urgent for me at the moment so please do not work too hard on it! Thank you!!

masashi-y commented 6 years ago

OK, I adopt <ccg sentence=1 id=1>, where the first one is sentence id and the latter is the number of N best parses. I am willing to change it if there is some standard format but I close this as this does not cause the assertion error anymore.