timmahrt / praatIO

A python library for working with praat, textgrids, time aligned audio transcripts, and audio files. It is primarily used for extracting features from and making manipulations on audio files given hierarchical time-aligned transcriptions (utterance > word > syllable > phone, etc).
MIT License
299 stars 32 forks source link

Issues parsing TextGrids from ELAN #30

Closed mmcauliffe closed 2 years ago

mmcauliffe commented 3 years ago

I've had a couple of users reporting issues with loading TextGrids exported from ELAN. The issue seems to be that the "item [1]" lines are formatted without a space ("item[1]"), so the parsing in https://github.com/timmahrt/praatIO/blob/master/praatio/tgio.py#L1896 fails. I think a reasonable fix would be something like re.split(r'item ?\[', data, flags=re.MULTILINE)[1:].

Looks like you're working on a 5.0, so don't know if that would be the place to fix it or if it would be better for me to submit a PR for the main branch.

timmahrt commented 3 years ago

I'm ok with a hotfix for the current version of praatio. If you want to make a PR that would be great, otherwise I can draft one sometime this week. Can you send me one of the affected textgrids?

Will there be problems later on in praatio? After line 1896 there is the code:

        if 'class = "IntervalTier"' in tierTxt:
            tierType = INTERVAL_TIER
            searchWord = "intervals ["
        else:
            tierType = POINT_TIER
            searchWord = "points ["

and then later:

tierName, tierNameI = _fetchTextRow(header, 0, "name = ")

I wonder if the ELAN-created textgrids will have similar issues elsewhere.

mmcauliffe commented 3 years ago

Here's the two versions: elan_export_version.txt praat_export_version.txt

There are a couple of other differences like lack of colons after each interval, but maybe not a huge issue. I'll add a test for an ELAN file as part of the PR.

timmahrt commented 3 years ago

When I merged the PR this automatically closed. Sorry!

Thanks again for the PR. I've released praatio v4.4.0 with the changes from your PR. If you have any other questions or comments, please let me know.

timmahrt commented 2 years ago

@mmcauliffe Hello, I'm preparing to release praatio 5.0. It is not backwards compatible. Any libraries using it will need to change a few things.

It seems that you can pin your praatio version like so in setup.py:

    install_requires=["praatio ~= 4.1"],

I just did this for one of my libraries that depends on praatio https://github.com/timmahrt/ProMo/blob/master/setup.py

If it helps, I can open two PRs in the Montreal Forced Aligner--one to pin the version number and one to make the necessary changes needed to upgrade to version 5.

Do you have any thoughts?

mmcauliffe commented 2 years ago

I'll pin the version for the next release I'll be doing shortly, and then I'll work on upgrading for 5.0 for the following one. I don't imagine the changes will be too much, since I'm just relying on a very small subset of the praatio functionality.

timmahrt commented 2 years ago

Sounds good.

Yes, the changes aren't backwards compatible, but they shouldn't be too heavy--the name of the libraries to import has changed. The arguments for saving and opening textgrids have also changed. I'll include some documentation with the new release when it goes out.

mmcauliffe commented 2 years ago

Closing this out since it should be all resolved and everything has been migrated over to 5.0 in MFA.