POSTCLITIC gets generated in the list of word even though not present in source transcript

timotheecour commented 4 months ago

Describe the bug POSTCLITIC gets output as word. I wonder what else gets similarly generated; it makes it harder to use this data for transcription purposes

Relevant CHILDES or TalkBank data https://sla.talkbank.org/TBB/homebank/Public/VanDam-5minute/ML77/ML77_020400a.cha

To reproduce

def bug_D20240709T163017_POSTCLITIC():
    import pylangacq
    filename="/data/timothee/talkbank/media/homebank/Public/VanDam-5minute/ML77/ML77_020400a.cha"
    reader = pylangacq.read_chat(filename)
    segments = []
    count = 0
    for a in reader.utterances():
        for b in a.tokens:
            count += b.word == "POSTCLITIC"
    print(count)
bug_D20240709T163017_POSTCLITIC()

this shows a >0 number

Expected behavior no POSTCLITIC should be output

Note zooming in on where this occurs:

%gra:   1|2|SUBJ 2|0|ROOT 3|4|NEG 4|2|COMP 5|2|PUNCT
*CHI:   wow . ^U17651_20428^U
%mor:   co|wow .
%gra:   1|0|INCROOT 2|1|PUNCT
*MOT:   (..) okay gotta put these back . ^U20428_26799^U
%mor:   co|okay mod|got~inf|to v|put&ZERO pro:dem|these v|back .
%gra:   1|4|COM 2|4|AUX 3|4|INF 4|0|ROOT 5|6|SUBJ 6|4|COMP 7|4|PUNCT
*CHI:   &+ba back . ^U26799_29336^U
%mor:   adv|back .

=>

"okay gotta POSTCLITIC put these back ."

Note 2 in https://github.com/jacksonllee/pylangacq/issues/23#issuecomment-2027100363 @jacksonllee mentions:

Token(word='POSTCLITIC', pos='pro:obj', mor='me', gra=Gra(dep=2, head=1, rel='OBJ')),

which makes me wonder, is this even intentional? how can caller distinguish what are actual words? should this code (to get transcript) be replaced by something else?

transcript = " ".join([b.word for b in a.tokens])

jacksonllee commented 4 months ago

If you're accessing data from the Token objects, then yes seeing "POSTCLITIC" (or sometimes "PRECLITIC") at the word attribute of Token is intentional by design and not a bug. The package has a strong emphasis on aligning the cleaned-up utterance with the available %mor tier, so when we get, say, 5 words from the utterance but 6 elements from %mor (which is the situation with pre-clitics like French l'article or post-clitics like English gotta), then there'd be one Token object without a word form, in which case I've decided to put in something like "POSTCLITIC" in its word attribute. I can update the documentation to mention that "PRECLITIC" and "POSTCLITIC" are the only possibilities in Token's word attribute that are not from the data.

Also, it sounds like you're interested in getting the transcription data that's cleaned up and without the CHAT annotations? The way the package does the utterance cleaning is to use the currently private _clean_utterance function. I think I can consider adding a new attribute to the Utterance object to hold the cleaned-up utterance (which wouldn't have "PRECLITIC" or "POSTCLITIC"), so that users wouldn't have to access Token objects' word attribute to join back an utterance on their own.

timotheecour commented 3 months ago

I think I can consider adding a new attribute to the Utterance object to hold the cleaned-up utterance

that would be great

timotheecour commented 3 months ago

actually, calling _clean_utterance doesn't make any difference, looks like it's already called by reader.utterances() eg if I run:

for a in reader.utterances():
  transcript = " ".join([b.word for b in a.tokens])
  assert _clean_utterance(transcript)==transcript

and the POSTCLITIC is still there (eg we don't POSTCLITIC need any soap .), and the other CHA artifacts are still there, eg ∇oh I don't know , I think I'll go ▔home tomorrow▔∇ Ideally, I should be able to have an API to get raw transcript, eg: oh I don't know , I think I'll go home tomorrow

and (optionally) an API to get raw transcript with just the rich annotations like breaths, laughs etc(I realize that might be hard to define, but something like "all the audible sounds"):

I mean: , but like <I was like one and a half centimeters> [% laughing fast] => I mean, but like I was like one and a half centimeters [laughing fast]

How do I get these back? reader.utterances() strips out these [% laughing fast] and similar

jacksonllee / pylangacq

POSTCLITIC gets generated in the list of word even though not present in source transcript #26