Utterance stacker: Quick improvements

kayaulai commented 1 year ago

Currently, lines with no actual words are all put in one gigantic Utterance. I suggest simply removing this Utterance.
Very large gapUnits, say above 5, should be disallowed. Currently, there are some ridiculous gapUnits produced by the Utterance stacker, which is difficult to detect without looking at the gapUnits (because a naive annotator might just assume those lines got assigned to different utterances).

JWD: See detailed comments below.

johnwdubois commented 1 year ago

My suggestion: (see #1446 )

Keep the current utterance concatenation rule (which concatenates successive units by the same speaker into one utterance), with the following exceptions:
- follow the concatenation rule as long as all units in a sequence are verbal (unitType = verbal), but NOT when they are non-verbal --see below
- gapUnits < 6 (Otherwise, start a new utterance)
Classify units as {verbal, laugh, pause, vocalism, annotation, other}.
- If a unit contains at least one word (kind = word), then unitType = verbal
- Else, if it contains a laugh, then unitType = laugh
- Else, if it contains a pause or in-breath (or both), then unitType = pause
- Else, if it contains a vocalism, then unitType = vocalism
- Else, if it contains ONLY annotation (e.g. transcriber's comments, glosses, etc.), then unitType = annotation
- Else, unitType = other
Assign utteranceType based on the unitType:
- if all units in an utterance are verbal (unitType = verbal), then utteranceType = verbal
If a unit is nonverbal (not all utteranceType != verbal), then
- if the next unit by the same participant has the same utteranceType, and gapUnits = 0, then extend the utterance to include it, and assign utteranceType to be the same as its component unitType value(s)
- if the the next unit has a different utteranceType, end the utterance, and assign utteranceType to be the same as its component unitType value(s) (see #1446 )

kayaulai commented 1 year ago

I'm uncertain about using kind = word, because I fear that will make the stacker too SBC-specific.

johnwdubois commented 1 year ago

Point taken. Still, reference to "kind = word" is just one way to describe the algorithm/pseudocode. The same effect can be gotten by writing a little routine that does the same thing (presumably with a higher error rate, but all you really need is to recognize one word per IU to get the main benefit. (see #1446 )

johnwdubois / rezonator

Utterance stacker: Quick improvements #1431