Open arademaker opened 4 years ago
I think this has to be regarded as a text processing error in the preparation of propbank materials….
The word has a U+00AD soft hyphen character in it. This is a valid Unicode character. It's not an encoding error, and it is most definitely not a space character.
I think the only two good choices are to either preserve the original as a single token, or to decide that you don't want to deal with soft hyphen characters and to delete it leaving one token basically
. This is just a processing mistake.
Regardless of the decision in the https://github.com/UniversalDependencies/UD_English-EWT/issues/83, the data here must be compatible with it. So basic<U+00AD>ally
or basically
need to be a single token.
The original https://catalog.ldc.upenn.edu/LDC2012T13 data contains (ADVP (GW basic) (RB ally))
. But I am assuming that fixing the LDC data is hard.
Can we fix the 20111107175720AAlb2TB_ans.xml.gold_skel?
I would absolutely love for the "stand-off" PropBank EWT data to be switched over to point to English UD -- removing the reliance on LDC2012T13 would let us fix all of these issues easily (and people have easier access to the data). As long as PropBank is based on LDC2012T13, it's a pain to do any of these fixes (and any LDC update could take years).
Oh, yes, please! That would be terrific. I don’t have a real PB master like Tim working at CU anymore but I have a student who just graduated with an MS in Computational Linguistics and has some experience with moving PB mappings from Treebank parsers to UD. He would need supervision but if it would be helpful I am happy to volunteer him to help with this.
On Nov 11, 2020, at 3:54 PM, timjogorman notifications@github.com<mailto:notifications@github.com> wrote:
I would absolutely love for the "stand-off" PropBank EWT data to be switched over to point to English UD -- removing the reliance on LDC2012T13 would let us fix all of these issues easily (and people have easier access to the data). As long as PropBank is based on LDC2012T13, it's a pain to do any of these fixes (and any LDC update could take years).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-release/issues/8#issuecomment-725706587, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327QVPC4BUR37WH4ZOK3SPMI3FANCNFSM4KOJQO7A.
See https://github.com/UniversalDependencies/UD_English-EWT/issues/83
UD treebank preserved in a single token the word
basically
regardless of the encoding error. But the Propbank data broke it into two tokens: