missing 43 sentences from EWT

arademaker commented 4 years ago

https://catalog.ldc.upenn.edu/LDC2012T13 says that EWT has 16,624 sentences. They actually have:

% wc -l `find . -name '*.tree'` | tail
     ...
      10 ./reviews/penntree/278775.xml.tree
       5 ./reviews/penntree/389136.xml.tree
       3 ./reviews/penntree/374604.xml.tree
       2 ./reviews/penntree/137883.xml.tree
       8 ./reviews/penntree/382073.xml.tree
       2 ./reviews/penntree/022273.xml.tree
       3 ./reviews/penntree/211933.xml.tree
       2 ./reviews/penntree/332068.xml.tree
       1 ./reviews/penntree/289763.xml.tree
   16622 total

This number matches the number of sentences in the https://github.com/universaldependencies/UD_English-EWT treebank:

ud-english-ewt % grep sent_id *.conllu | wc -l
   16622

But this propbank-release contains only 16579 sentences. We are missing the following 43 sentences:

reviews-052884-0001 : unique gifts and cards
reviews-172245-0001 : Great store great products
reviews-258042-0001 : Lovley food and fab chips
reviews-018268-0001 : best square slice around.
reviews-253807-0001 : Cheapest drinks in Keene!
reviews-105719-0001 : Over priced for Mexican food
reviews-190389-0001 : very miss informed people!!
reviews-035932-0001 : Simple, Quick take away.
reviews-173758-0001 : best place for snowboard eva.
reviews-211844-0001 : Favorite DD spot in the area!
reviews-189171-0001 : A most outstanding, professional firm.
reviews-228154-0001 : Good food and coffee with a nice atmosphere
reviews-208180-0001 : Good quality Indian food in a pleasant environment
reviews-242303-0001 : Awesome bacon egg and cheese sandwich for breakfast.
reviews-317480-0001 : Great atmosphere, great food.
reviews-317480-0002 : Definitely a must.
reviews-107292-0001 : awesome bagels
reviews-107292-0002 : long lines on the weekends but worth it
reviews-330275-0001 : Some of the nicest people and very good work standards
reviews-235462-0001 : Hobbs on Mass.
reviews-235462-0002 : Absolutely my favorite store in Lawrence, KS
reviews-341435-0001 : Nice and quiet place with cosy living room just outside the city.
reviews-203196-0001 : VINGAS
reviews-203196-0002 : VISAKHA INDUSTRIAL GASES PVT. LTD., location at google maps.
reviews-008635-0001 : Good food and very friendly staff.
reviews-008635-0002 : Very good with my 5 year old daughter.
reviews-008635-0003 : Interesting good value wine list to.
reviews-008635-0004 : Beer a bit expensive.
answers-20090203211448AAoG2yX_ans-0001 : Green Tea Or White Tea?
answers-20090203211448AAoG2yX_ans-0002 : Green
answers-20090203211448AAoG2yX_ans-0003 : Green Tea.
answers-20090203211448AAoG2yX_ans-0004 : Green tea
reviews-327867-0001 : Good clean store nice car wash
reviews-081116-0001 : Best fried shrimp in the state!
reviews-314938-0001 : The best pilates on the Gold Coast!
reviews-184290-0001 : wow wow wow.
reviews-184290-0002 : the bast cab in minneapolis
reviews-388121-0001 : Too many kids, too many knifings, too many taserings.
reviews-058878-0001 : Nice little locally owned greek bar and grill.
reviews-058878-0002 : Good food.
reviews-058878-0003 : Great wings!
reviews-046500-0001 : Mens and Boys Barbers, on the number 9 Bus route.
reviews-046500-0002 : Ladies room, Open Sundays

timjogorman commented 4 years ago

Thanks for noting this! While it would be good for us to include these, this does not mean that there is missing SRL data -- while I'll need to look into it more, I'm pretty sure that each of these sentences is from a document that had zero predicates to annotate, and our pipeline ended up simply not preparing documents with zero annotations. I think that's an error in our pipeline -- while dropping them would have no effect on standard SRL training (where you have gold predicate identification) it would be more accurate to have these documents included. I'll look into adding them in.

arademaker commented 4 years ago

The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.

manning commented 4 years ago

Random additional comment: There are a number of failures to divide sentences in the LDC EWT data. Up until now, we have kept the sentence divisions consistent between LDC EWT and UD EWT, but I have an intention of some day fixing the erroneous sentence divisions and giving the results back to LDC....

arademaker commented 4 years ago

Can I help somehow? I really would like to see the data more consistent between the LDC EWT, UD EWT, and Propbank. Do you have the list of errors in the division of sentences?

arademaker commented 3 years ago

Hi @manning, I have just noticed that LDC EWT does not contain the division dev/test/train. So maybe the split used in the UD EWT was based on the sets defined here in this repository?

arademaker commented 3 years ago

The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus.

I missed one important detail in @timjogorman explanation above. He repeated what he said in https://github.com/propbank/propbank-release/issues/2#issuecomment-339093947 actually. Only files that do not contain any predicate annotated in all its sentences are omitted. So my comment above can be ignored, we do have files with some sentences missing SRL annotation.

propbank / propbank-release

missing 43 sentences from EWT #7