Open timotheecour opened 4 months ago
If you're accessing data from the Token
objects, then yes seeing "POSTCLITIC"
(or sometimes "PRECLITIC"
) at the word
attribute of Token
is intentional by design and not a bug. The package has a strong emphasis on aligning the cleaned-up utterance with the available %mor tier, so when we get, say, 5 words from the utterance but 6 elements from %mor (which is the situation with pre-clitics like French l'article or post-clitics like English gotta), then there'd be one Token
object without a word form, in which case I've decided to put in something like "POSTCLITIC"
in its word
attribute. I can update the documentation to mention that "PRECLITIC"
and "POSTCLITIC"
are the only possibilities in Token
's word
attribute that are not from the data.
Also, it sounds like you're interested in getting the transcription data that's cleaned up and without the CHAT annotations? The way the package does the utterance cleaning is to use the currently private _clean_utterance function. I think I can consider adding a new attribute to the Utterance
object to hold the cleaned-up utterance (which wouldn't have "PRECLITIC"
or "POSTCLITIC"
), so that users wouldn't have to access Token
objects' word
attribute to join back an utterance on their own.
I think I can consider adding a new attribute to the Utterance object to hold the cleaned-up utterance
that would be great
actually, calling _clean_utterance
doesn't make any difference, looks like it's already called by
reader.utterances()
eg if I run:
for a in reader.utterances():
transcript = " ".join([b.word for b in a.tokens])
assert _clean_utterance(transcript)==transcript
and the POSTCLITIC is still there (eg we don't POSTCLITIC need any soap .
), and the other CHA artifacts are still there, eg
∇oh I don't know , I think I'll go ▔home tomorrow▔∇
Ideally, I should be able to have an API to get raw transcript, eg:
oh I don't know , I think I'll go home tomorrow
and (optionally) an API to get raw transcript with just the rich annotations like breaths, laughs etc(I realize that might be hard to define, but something like "all the audible sounds"):
I mean: , but like <I was like one and a half centimeters> [% laughing fast]
=>
I mean, but like I was like one and a half centimeters [laughing fast]
How do I get these back? reader.utterances()
strips out these [% laughing fast]
and similar
Describe the bug POSTCLITIC gets output as word. I wonder what else gets similarly generated; it makes it harder to use this data for transcription purposes
Relevant CHILDES or TalkBank data https://sla.talkbank.org/TBB/homebank/Public/VanDam-5minute/ML77/ML77_020400a.cha
To reproduce
this shows a >0 number
Expected behavior no POSTCLITIC should be output
Note zooming in on where this occurs:
%mor: co|okay mod|got~inf|to v|put&ZERO pro:dem|these v|back .
=>
"okay gotta POSTCLITIC put these back ."
Note 2 in https://github.com/jacksonllee/pylangacq/issues/23#issuecomment-2027100363 @jacksonllee mentions:
which makes me wonder, is this even intentional? how can caller distinguish what are actual words? should this code (to get transcript) be replaced by something else?