timmahrt / praatIO

A python library for working with praat, textgrids, time aligned audio transcripts, and audio files. It is primarily used for extracting features from and making manipulations on audio files given hierarchical time-aligned transcriptions (utterance > word > syllable > phone, etc).
MIT License
311 stars 33 forks source link

Why filter out empty labels from Intervals? #19

Closed macriluke closed 4 years ago

macriluke commented 4 years ago
        if tierType == INTERVAL_TIER:
            while True:
                try:
                    timeStart, timeStartI = _fetchRow(tierData,
                                                      "xmin = ", labelI)
                    timeEnd, timeEndI = _fetchRow(tierData,
                                                  "xmax = ", timeStartI)
                    label, labelI = _fetchRow(tierData, "text =", timeEndI)
                except (ValueError, IndexError):
                    break

                label = label.strip()
                if label == "":
                    continue
                tierEntryList.append((timeStart, timeEnd, label))
            tier = IntervalTier(tierName, tierEntryList, tierStart, tierEnd)

Why wouldn't I want the intervals exactly as they appear in the file?

timmahrt commented 4 years ago

Empty labels can represent pauses or areas with no data. In my own use cases, it is rare that I want or need intervals with empty labels, "blanked intervals".

For example, if I want to get the duration of every word in a tier, if the tier includes intervals with blank labels, I have to guard the calculation like so:

intervals = []
for start, stop, label in tier.entryList:
    if label != '':
        intervals.append(stop - start)

Then lets say I'm manipulating the textgrid and I don't remove blanked intervals beforehand. Let's say that I only want to keep intervals that are vowels. A blanked entry is not a vowel, but we want to keep them--so we have a check, as before if isVowel(label) or label == '':. And if an interval is deleted--eg the label is a consonant--then we need to merge the blanked interval before and after the deleted interval into one larger interval with a blank label (but only if they are blank!--we have to check that too). It's more annoying [to me] to manipulate textgrids if we have to consider blanked intervals.

Perhaps the only time I want the blanked intervals is when I'm calculating pause/silence durations. With blanked entries removed, pauses can be calculated by tier.entryList[i+1][0] - tier.entryList[i][1] for each entry.

On save, the spaces get reinserted, because textgrids do need blanked intervals to render correctly in praat (https://github.com/timmahrt/praatIO/blob/master/praatio/tgio.py#L1430).


Are blanked intervals something that you want in your textgrids while manipulating them? What is your use case?

macriluke commented 4 years ago

I'm working with forced alignment of a transcribed audio dataset- I have to have adjust the audio and textgrids to include a silence at the start and end.

I think perhaps this preprocessing behavior isn't immediately intuitive- I suggest an optional 'raw' flag argument could allow for reading in everything in the file. Or perhaps an optional processor function could be supplied that would handle this sort of preprocessing.

timmahrt commented 4 years ago

I agree the existing behaviour is not intuitive. For the short term, I like the idea of a raw flag argument.

I'll try to make the change and publish it later today. Thanks!

timmahrt commented 4 years ago

I've added an optional argument to openTextgrid:

openTextgrid(fnFullPath, readRaw=False)

If readRaw=True you should get a TextGrid object containing points and intervals with empty labels (''). For backwards compatibility however, the default value is False.

I bundled these changes into release 4.1.0. Please let me know if you have any feedback.

macriluke commented 4 years ago

Pulled it this morning, couldn't be happier! Thanks for the quick help on this.