Open arademaker opened 3 years ago
Among those files, 83 files don't matter. They were not used in LDC2012T04 to SRL according to CoNLL-2009-ST/LDC2012T04/docs/English-README.TXT:
But what about the other 500 files?
[1] "wsj/00/wsj_0006.conllu" "wsj/00/wsj_0013.conllu" "wsj/00/wsj_0015.conllu"
[4] "wsj/00/wsj_0018.conllu" "wsj/00/wsj_0026.conllu" "wsj/00/wsj_0027.conllu"
[7] "wsj/00/wsj_0032.conllu" "wsj/00/wsj_0048.conllu" "wsj/00/wsj_0057.conllu"
[10] "wsj/00/wsj_0063.conllu" "wsj/00/wsj_0067.conllu" "wsj/00/wsj_0068.conllu"
[13] "wsj/00/wsj_0073.conllu" "wsj/00/wsj_0080.conllu" "wsj/01/wsj_0106.conllu"
[16] "wsj/01/wsj_0109.conllu" "wsj/01/wsj_0118.conllu" "wsj/01/wsj_0122.conllu"
[19] "wsj/01/wsj_0124.conllu" "wsj/01/wsj_0125.conllu" "wsj/01/wsj_0127.conllu"
[22] "wsj/01/wsj_0132.conllu" "wsj/01/wsj_0135.conllu" "wsj/01/wsj_0136.conllu"
[25] "wsj/01/wsj_0137.conllu" "wsj/01/wsj_0144.conllu" "wsj/01/wsj_0149.conllu"
[28] "wsj/01/wsj_0150.conllu" "wsj/01/wsj_0151.conllu" "wsj/01/wsj_0152.conllu"
[31] "wsj/01/wsj_0153.conllu" "wsj/01/wsj_0157.conllu" "wsj/01/wsj_0158.conllu"
[34] "wsj/01/wsj_0159.conllu" "wsj/01/wsj_0160.conllu" "wsj/01/wsj_0161.conllu"
[37] "wsj/01/wsj_0162.conllu" "wsj/01/wsj_0165.conllu" "wsj/01/wsj_0166.conllu"
[40] "wsj/01/wsj_0167.conllu" "wsj/01/wsj_0168.conllu" "wsj/01/wsj_0169.conllu"
[43] "wsj/01/wsj_0171.conllu" "wsj/01/wsj_0172.conllu" "wsj/01/wsj_0173.conllu"
[46] "wsj/01/wsj_0175.conllu" "wsj/01/wsj_0176.conllu" "wsj/01/wsj_0177.conllu"
[49] "wsj/01/wsj_0178.conllu" "wsj/01/wsj_0179.conllu" "wsj/01/wsj_0184.conllu"
[52] "wsj/01/wsj_0187.conllu" "wsj/01/wsj_0188.conllu" "wsj/01/wsj_0189.conllu"
[55] "wsj/01/wsj_0194.conllu"
..
[520] "wsj/22/wsj_2203.conllu" "wsj/22/wsj_2209.conllu" "wsj/22/wsj_2210.conllu"
[523] "wsj/22/wsj_2211.conllu" "wsj/22/wsj_2212.conllu" "wsj/22/wsj_2213.conllu"
[526] "wsj/22/wsj_2216.conllu" "wsj/22/wsj_2218.conllu" "wsj/22/wsj_2219.conllu"
[529] "wsj/22/wsj_2235.conllu" "wsj/22/wsj_2236.conllu" "wsj/22/wsj_2244.conllu"
[532] "wsj/22/wsj_2245.conllu" "wsj/22/wsj_2246.conllu" "wsj/22/wsj_2248.conllu"
[535] "wsj/22/wsj_2249.conllu" "wsj/22/wsj_2251.conllu" "wsj/22/wsj_2256.conllu"
[538] "wsj/22/wsj_2259.conllu" "wsj/22/wsj_2262.conllu" "wsj/22/wsj_2266.conllu"
[541] "wsj/22/wsj_2267.conllu" "wsj/22/wsj_2268.conllu" "wsj/22/wsj_2271.conllu"
[544] "wsj/22/wsj_2272.conllu" "wsj/22/wsj_2273.conllu" "wsj/22/wsj_2277.conllu"
[547] "wsj/22/wsj_2282.conllu"
This is what I've heard referred to as the "financial subset" of WSJ. There was a bunch of files which were judged to be extremely repetitive formulaic financial text (largely a list of stocks going up and down) and those putting together OntoNotes decided to remove them from the release. So as far as I've been told, it was intentional that it was omitted from OntoNotes, even though some of them already had SRL annotations.
Updating these wasn't really in scope when we did the unification, so I don't think any unified form exists (it requires updating the old SRL data to match OntoNotes tokenization, and some manual judgments for the small set of frames that changed since PB1)
Most importantly, these files were not included in the treebank revision that was aimed at improving synchronization with PropBank. So at this point the original TB/PB annotation on those files is pretty archaic. I was personally very unhappy with that decision because those files also had WSD and I believe coref as well, so it seemed wasteful but it wasn’t up to me.
Martha
On Sep 2, 2021, at 7:46 AM, timjogorman @.***> wrote:
This is what I've heard referred to as the "financial subset" of WSJ. There was a bunch of files which were judged to be extremely repetitive formulaic financial text (largely a list of stocks going up and down) and those putting together OntoNotes decided to remove them from the release. So as far as I've been told, it was intentional that it was omitted from OntoNotes, even though some of them already had SRL annotations.
Updating these wasn't really in scope when we did the unification, so I don't think any unified form exists (it requires updating the old SRL data to match OntoNotes tokenization, and some manual judgments for the small set of frames that changed since PB1)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-release/issues/14#issuecomment-911695576, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327U56RRJS2USHITHPPLT755DFANCNFSM5CRBO7LQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
copied from https://github.com/UniversalDependencies/UD_English-EWT/issues/204
I would like to clarify the relation between https://catalog.ldc.upenn.edu/LDC2013T19 (OntoNotes 5.0) with https://catalog.ldc.upenn.edu/LDC2012T04 (2009 CoNLL Shared Task). The doc of LDC2012T04 says that the texts came from https://catalog.ldc.upenn.edu/LDC95T7. I didn't know that LDC95T7 is part of LDC2013T19 (OntoNotes). I just found
ontonotes-release 5.0/data/files/data/english/annotations/nw/wsj/
with the same subdirectories of WSJ thatLDC95T7/combined/wsj/
!So LDC95T7 is part of LDC2013T19! My bad! I didn't know that. I also didn't know about https://catalog.ldc.upenn.edu/LDC2015T13, it is a revised version of https://catalog.ldc.upenn.edu/LDC95T7 but sadly, there is no reference to this new version in the https://catalog.ldc.upenn.edu/LDC95T7 page.
Since I am projecting the SRL into UD annotations obtained from LDC95T7, and I have already done that for EWT and Ontonotes, I may already have the data I was trying to obtain from LDC2012T04. I suspect that probably the SRL annotations from LDC2012T04 should also differ from the OntoNotes annotations here, being the last probably the most recent one. Am I right? Maybe @MarthaSPalmer can also have something to add here and can confirm my understanding!
But in OntoNotes, we have 1728 files inside
data/files/data/english/annotations/nw/wsj/
subdirectories and in LDC95T7 we have 2312 files. There are 584 missing files in OntoNotes, does anyone knows anything about it? There are the missing ones in OntoNotes: