propbank / propbank-release

The official released annotations, both in .prop pointer format and as conll files. Does not contain the source texts
Creative Commons Attribution Share Alike 4.0 International
133 stars 13 forks source link

Ontonotes raw content #15

Open arademaker opened 2 years ago

arademaker commented 2 years ago

EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see https://github.com/UniversalDependencies/UD_English-EWT/issues/252)

What about the Ontonotes? The .onf files are the only ones with the plain sentence, but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?

MarthaSPalmer commented 2 years ago

All of the ON files came from LDC originally. They might have them.

Martha

On Oct 4, 2021, at 7:57 AM, Alexandre Rademaker @.***> wrote:



EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see UniversalDependencies/UD_English-EWT#252https://github.com/UniversalDependencies/UD_English-EWT/issues/252.

What about the Ontonotes? The .onf files are the only ones with the plain sentence, but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-release/issues/15, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327SU6RFN5CCLV3H3W23UFGXCPANCNFSM5FJN47MA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

arademaker commented 2 years ago

Thank you @MarthaSPalmer, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf:

Plain sentence:
---------------
    Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused
    to shop there and pay for my return privledges ?

Treebanked sentence:
--------------------
    Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you
    refused *PRO*-1 to shop there and pay for my return privledges ?
MarthaSPalmer commented 2 years ago

I”m sorry. We only worked with what we got from LDC.

Martha

On Oct 5, 2021, at 2:59 PM, Alexandre Rademaker @.**@.>> wrote:

Thank you @MarthaSPalmerhttps://github.com/MarthaSPalmer, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf:

Plain sentence:

Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused
to shop there and pay for my return privledges ?

Treebanked sentence:

Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you
refused *PRO*-1 to shop there and pay for my return privledges ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-release/issues/15#issuecomment-934826048, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327T5P4Q6PNVC5IAQGYDUFNRL3ANCNFSM5FJN47MA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

V3RGANz commented 2 years ago

@arademaker your concrete example can be found in metadata/context directory. Try to grep -r /path/to/dir -e 'text' with some substrings of example you are looking for

arademaker commented 2 years ago

LDC distribution contains only the subfolder wb in the ontonotes-release-5.0/data/files/data/english/metadata/context folder. Moreover, the <text> tag contains the raw text, but no raw of the sentences... recovering the sentence split would be an extra hard word.

V3RGANz commented 2 years ago

@arademaker if OntoNotes sentences is tokenised only (without token editing) it should be easy to find sentence by regular expression

pseudocode example:

no_trace_tokens: List[str]
raw_text: str
if all(t in raw_text for t in no_trace_tokens):  # we found text candidate, then search for sentence
    sentence_regex = '\s*'.join(no_trace_tokens)
    re.search(sentence_regex, raw_text)

Yes, there are some work to be done, but I can just propose the solution for your problem. About wb. Is pre-tokenisation occurs in some other corpus?