Open arademaker opened 2 years ago
All of the ON files came from LDC originally. They might have them.
Martha
On Oct 4, 2021, at 7:57 AM, Alexandre Rademaker @.***> wrote:
EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see UniversalDependencies/UD_English-EWT#252https://github.com/UniversalDependencies/UD_English-EWT/issues/252.
What about the Ontonotes? The .onf files are the only ones with the plain sentence, but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-release/issues/15, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327SU6RFN5CCLV3H3W23UFGXCPANCNFSM5FJN47MA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Thank you @MarthaSPalmer, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf
:
Plain sentence:
---------------
Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused
to shop there and pay for my return privledges ?
Treebanked sentence:
--------------------
Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you
refused *PRO*-1 to shop there and pay for my return privledges ?
I”m sorry. We only worked with what we got from LDC.
Martha
On Oct 5, 2021, at 2:59 PM, Alexandre Rademaker @.**@.>> wrote:
Thank you @MarthaSPalmerhttps://github.com/MarthaSPalmer, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf:
Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused
to shop there and pay for my return privledges ?
Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you
refused *PRO*-1 to shop there and pay for my return privledges ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-release/issues/15#issuecomment-934826048, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327T5P4Q6PNVC5IAQGYDUFNRL3ANCNFSM5FJN47MA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@arademaker your concrete example can be found in metadata/context directory. Try to grep -r /path/to/dir -e 'text'
with some substrings of example you are looking for
LDC distribution contains only the subfolder wb
in the ontonotes-release-5.0/data/files/data/english/metadata/context
folder. Moreover, the <text>
tag contains the raw text, but no raw of the sentences... recovering the sentence split would be an extra hard word.
@arademaker if OntoNotes sentences is tokenised only (without token editing) it should be easy to find sentence by regular expression
pseudocode example:
no_trace_tokens: List[str]
raw_text: str
if all(t in raw_text for t in no_trace_tokens): # we found text candidate, then search for sentence
sentence_regex = '\s*'.join(no_trace_tokens)
re.search(sentence_regex, raw_text)
Yes, there are some work to be done, but I can just propose the solution for your problem.
About wb
. Is pre-tokenisation occurs in some other corpus?
EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see https://github.com/UniversalDependencies/UD_English-EWT/issues/252)
What about the Ontonotes? The
.onf
files are the only ones with theplain sentence,
but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?