propbank / propbank-release

The official released annotations, both in .prop pointer format and as conll files. Does not contain the source texts
Creative Commons Attribution Share Alike 4.0 International
135 stars 12 forks source link

encoding error #8

Open arademaker opened 4 years ago

arademaker commented 4 years ago

See https://github.com/UniversalDependencies/UD_English-EWT/issues/83

UD treebank preserved in a single token the word basically regardless of the encoding error. But the Propbank data broke it into two tokens:

google/ewt/answers/00/20111107175720AAlb2TB_ans.xml  14   16        basic    GW            (S(ADVP*         -            -        *   (ARGM-ADV*             *             *
google/ewt/answers/00/20111107175720AAlb2TB_ans.xml  14   17         ally    RB                   *)        -            -        *            *)            *             *
manning commented 4 years ago

I think this has to be regarded as a text processing error in the preparation of propbank materials….

The word has a U+00AD soft hyphen character in it. This is a valid Unicode character. It's not an encoding error, and it is most definitely not a space character.

I think the only two good choices are to either preserve the original as a single token, or to decide that you don't want to deal with soft hyphen characters and to delete it leaving one token basically. This is just a processing mistake.

arademaker commented 3 years ago

Regardless of the decision in the https://github.com/UniversalDependencies/UD_English-EWT/issues/83, the data here must be compatible with it. So basic<U+00AD>ally or basically need to be a single token.

The original https://catalog.ldc.upenn.edu/LDC2012T13 data contains (ADVP (GW basic) (RB ally)). But I am assuming that fixing the LDC data is hard.

Can we fix the 20111107175720AAlb2TB_ans.xml.gold_skel?

timjogorman commented 3 years ago

I would absolutely love for the "stand-off" PropBank EWT data to be switched over to point to English UD -- removing the reliance on LDC2012T13 would let us fix all of these issues easily (and people have easier access to the data). As long as PropBank is based on LDC2012T13, it's a pain to do any of these fixes (and any LDC update could take years).

MarthaSPalmer commented 3 years ago

Oh, yes, please! That would be terrific. I don’t have a real PB master like Tim working at CU anymore but I have a student who just graduated with an MS in Computational Linguistics and has some experience with moving PB mappings from Treebank parsers to UD. He would need supervision but if it would be helpful I am happy to volunteer him to help with this.

On Nov 11, 2020, at 3:54 PM, timjogorman notifications@github.com<mailto:notifications@github.com> wrote:

I would absolutely love for the "stand-off" PropBank EWT data to be switched over to point to English UD -- removing the reliance on LDC2012T13 would let us fix all of these issues easily (and people have easier access to the data). As long as PropBank is based on LDC2012T13, it's a pain to do any of these fixes (and any LDC update could take years).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-release/issues/8#issuecomment-725706587, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327QVPC4BUR37WH4ZOK3SPMI3FANCNFSM4KOJQO7A.