propbank / propbank-release

The official released annotations, both in .prop pointer format and as conll files. Does not contain the source texts
Creative Commons Attribution Share Alike 4.0 International
136 stars 12 forks source link

missing 43 sentences from EWT #7

Open arademaker opened 4 years ago

arademaker commented 4 years ago

https://catalog.ldc.upenn.edu/LDC2012T13 says that EWT has 16,624 sentences. They actually have:

% wc -l `find . -name '*.tree'` | tail
     ...
      10 ./reviews/penntree/278775.xml.tree
       5 ./reviews/penntree/389136.xml.tree
       3 ./reviews/penntree/374604.xml.tree
       2 ./reviews/penntree/137883.xml.tree
       8 ./reviews/penntree/382073.xml.tree
       2 ./reviews/penntree/022273.xml.tree
       3 ./reviews/penntree/211933.xml.tree
       2 ./reviews/penntree/332068.xml.tree
       1 ./reviews/penntree/289763.xml.tree
   16622 total

This number matches the number of sentences in the https://github.com/universaldependencies/UD_English-EWT treebank:

ud-english-ewt % grep sent_id *.conllu | wc -l
   16622

But this propbank-release contains only 16579 sentences. We are missing the following 43 sentences:

timjogorman commented 4 years ago

Thanks for noting this! While it would be good for us to include these, this does not mean that there is missing SRL data -- while I'll need to look into it more, I'm pretty sure that each of these sentences is from a document that had zero predicates to annotate, and our pipeline ended up simply not preparing documents with zero annotations. I think that's an error in our pipeline -- while dropping them would have no effect on standard SRL training (where you have gold predicate identification) it would be more accurate to have these documents included. I'll look into adding them in.

arademaker commented 4 years ago

The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.

manning commented 4 years ago

Random additional comment: There are a number of failures to divide sentences in the LDC EWT data. Up until now, we have kept the sentence divisions consistent between LDC EWT and UD EWT, but I have an intention of some day fixing the erroneous sentence divisions and giving the results back to LDC....

arademaker commented 4 years ago

Can I help somehow? I really would like to see the data more consistent between the LDC EWT, UD EWT, and Propbank. Do you have the list of errors in the division of sentences?

arademaker commented 3 years ago

Hi @manning, I have just noticed that LDC EWT does not contain the division dev/test/train. So maybe the split used in the UD EWT was based on the sets defined here in this repository?

arademaker commented 3 years ago

The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.

I missed one important detail in @timjogorman explanation above. He repeated what he said in https://github.com/propbank/propbank-release/issues/2#issuecomment-339093947 actually. Only files that do not contain any predicate annotated in all its sentences are omitted. So my comment above can be ignored, we do have files with some sentences missing SRL annotation.