timmahrt / praatIO

A python library for working with praat, textgrids, time aligned audio transcripts, and audio files. It is primarily used for extracting features from and making manipulations on audio files given hierarchical time-aligned transcriptions (utterance > word > syllable > phone, etc).
MIT License
299 stars 32 forks source link

openTextgrid() cannot correctly parse the file if there are '\n's within the label text of interval tiers #24

Closed GalaxieT closed 2 years ago

GalaxieT commented 3 years ago

Files like the following:

item []:
    item [1]:
        class = "IntervalTier"
        name = "Tokens"
        xmin = 0.0
        xmax = 16.6671875
        intervals: size = 22
        intervals [1]:
            xmin = 0.0
            xmax = 0.32
            text = "#"
        intervals [2]:
            xmin = 0.32
            xmax = 1.165
            text = "zao
chen
liu
wan
er
ne"

Only the "zao part is recognized. According to the manual of Praat, string variables are identified by double quotes instead of newlines. (double quotes in text are turned into two double quotes in the file: " → """" image

It is not hard to fix it, but I'm unfamiliar with git/github. So I paste the changed code in below (in place of original _fetchRow in tgio):

def _fetchRow_for_text(dataStr, searchStr, index):
    startIndex = dataStr.index(searchStr, index) + len(searchStr)
    first_quote_index = dataStr.index("\"", startIndex)

    looking = True
    next_quote_index = dataStr.index("\"", first_quote_index+1)
    while looking:
        try:
            neighbor_letter = dataStr[next_quote_index+1]
            if neighbor_letter == "\"":
                next_quote_index = dataStr.index("\"", next_quote_index+1)
            else:
                looking = False
        except IndexError:
            looking = False
    final_quote_index = next_quote_index

    word = dataStr[first_quote_index+1:final_quote_index]
    word = word.replace("\"\"", "\"")

    return word, final_quote_index + 1

I suppose it might be possible that in other places, like textgrid short version reading and writing, there are also problems due to this issue.

GalaxieT commented 3 years ago

The cod above seems not compatible. I made some adjustments.

def _fetchRow(dataStr, searchStr, index):

    startIndex = dataStr.index(searchStr, index) + len(searchStr)
    endline_index = dataStr.index("\n", startIndex)
    if "\"" in dataStr[startIndex:endline_index]:
        first_quote_index = dataStr.index("\"", startIndex)

        looking = True
        next_quote_index = dataStr.index("\"", first_quote_index+1)
        while looking:
            try:
                neighbor_letter = dataStr[next_quote_index+1]
                if neighbor_letter == "\"":
                    next_quote_index = dataStr.index("\"", next_quote_index+2)
                else:
                    looking = False
            except IndexError:
                looking = False
        final_quote_index = next_quote_index

        word = dataStr[first_quote_index+1:final_quote_index]
        word = word.replace("\"\"", "\"")

        endIndex = dataStr.index("\n", final_quote_index)

        return word, endIndex + 1
    else:
        endIndex = endline_index
        word = dataStr[startIndex:endIndex]
        word = word.strip()
        return word, endIndex + 1
timmahrt commented 3 years ago

Hi! Sorry for the bug. I should be able to take a look at this tomorrow. Thanks!

timmahrt commented 3 years ago

I've just released version 4.3.0. This should have robust support for newlines and quotes. If you have problems with any textgrids please let me know.

timmahrt commented 2 years ago

I think this issue has been resolved. Is it ok to close?

GalaxieT commented 2 years ago

Sure.