Open arademaker opened 4 years ago
Thanks for noting this! While it would be good for us to include these, this does not mean that there is missing SRL data -- while I'll need to look into it more, I'm pretty sure that each of these sentences is from a document that had zero predicates to annotate, and our pipeline ended up simply not preparing documents with zero annotations. I think that's an error in our pipeline -- while dropping them would have no effect on standard SRL training (where you have gold predicate identification) it would be more accurate to have these documents included. I'll look into adding them in.
The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.
Random additional comment: There are a number of failures to divide sentences in the LDC EWT data. Up until now, we have kept the sentence divisions consistent between LDC EWT and UD EWT, but I have an intention of some day fixing the erroneous sentence divisions and giving the results back to LDC....
Can I help somehow? I really would like to see the data more consistent between the LDC EWT, UD EWT, and Propbank. Do you have the list of errors in the division of sentences?
The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.
I missed one important detail in @timjogorman explanation above. He repeated what he said in https://github.com/propbank/propbank-release/issues/2#issuecomment-339093947 actually. Only files that do not contain any predicate annotated in all its sentences are omitted. So my comment above can be ignored, we do have files with some sentences missing SRL annotation.
https://catalog.ldc.upenn.edu/LDC2012T13 says that EWT has 16,624 sentences. They actually have:
This number matches the number of sentences in the https://github.com/universaldependencies/UD_English-EWT treebank:
But this propbank-release contains only 16579 sentences. We are missing the following 43 sentences: