timmahrt / praatIO

A python library for working with praat, textgrids, time aligned audio transcripts, and audio files. It is primarily used for extracting features from and making manipulations on audio files given hierarchical time-aligned transcriptions (utterance > word > syllable > phone, etc).
MIT License
299 stars 32 forks source link

TextGrid.tierDict can be modified, corrupting the TextGrid #39

Closed scottmk closed 8 months ago

scottmk commented 1 year ago

As far as I can tell from your examples, the canonical way to access a TextGridTier from a TextGrid is to access the internal tierDict (Example).

However, because tierDict is mutable, one can end up adding a new TextGridTier to this tierDict, e.g.,

new_textgrid_tier = new_textgrid.tierDict.setdefault(
    "new_tier",
    IntervalTier(
        "new_tier", [], textgrid_obj.minTimestamp, textgrid_obj.maxTimestamp
    )
)

If this happens, the TextGrid will essentially be in an Illegal State, because TextGrid.tierNameList will be missing the new tier. This obviously breaks functions like TextGridTier#replaceTier, among other things.

I think there are at least two possible solutions:

  1. Add a getTier method and rename tierDict to _tierDict to indicate that it is a protected member that shouldn't be altered
  2. Change tierDict to an OrderedDict and remove tierNameList entirely. You can then just use tierDict.keys() in place of tierNameList

I think it'd probably be best to implement both solutions, particularly because it's dangerous to have two parallel data structures that you need to keep in sync. However, simply implementing number 1 alone should solve the problem quickly and easily.

I'm also happy to submit a PR for this if you'd like.

Thanks again for all your work!

timmahrt commented 1 year ago

I think full immutability would be pretty cool but I'm not sure how easy it would be to do. I might look into that. If done, then a full set of setter and getter methods will be necessary.

Regardless, I like the idea of moving to OrderedDict just for the semantics.

It might be a while before I can focus my attention on this, but I will sleep on it!

scottmk commented 1 year ago

I think full immutability would be pretty cool but I'm not sure how easy it would be to do.

In hindsight, not very easy at all in Python. You basically have to make a new data structure. I think just marking it as protected and adding a getter makes more sense. But I think just removing the list and relying on the keyset of the dict instead will protect from this problem. There will probably be some annoying finagling to get it to work tho, haha.

Regardless, I like the idea of moving to OrderedDict just for the semantics.

Totally! Actually, as of Python 3.6, all dicts preserve the order of keys, but that doesn't help if you want to be compatible with older versions.

One downside of OrderedDict is it always appends new entries to the end, and if you update an entry, it moves it to the end. So if you want to preserve the order of entries as you update them, you have to do an annoying O(n) solution using OrderedDict#move_to_end and/or OrderedDict#popitem. Same for arbitrary insertion, which is something you currently support. I'll see if I can think of more elegant solutions.

It might be a while before I can focus my attention on this, but I will sleep on it!

Of course, makes sense! I just wanted to post issues as I found them, so I didn't forget!

If you're open to it, I might send pull requests later down the road. I'm working on a thesis right now, so it may be some time.

timmahrt commented 1 year ago

Sorry for taking so long.

I've got a PR up. I decided that for now I'd like to keep the same interface but I resolved the issue where tierNameList can diverge from tierDict.

I did this by making tierDict an OrderedDict and making tierNameList a property.

One downside of OrderedDict is...if you update an entry, it moves it to the end.

I tried to replicate this in a test but couldn't. Was the issue maybe resolved in Python 3.6?

https://docs.python.org/3/library/collections.html#collections.OrderedDict: Changed in version 3.6: With the acceptance of [PEP 468](https://peps.python.org/pep-0468/), order is retained for keyword arguments passed to the [OrderedDict](https://docs.python.org/3/library/collections.html#collections.OrderedDict) constructor and its update() method.

Please let me know if you have any feedback.

scottmk commented 1 year ago

No worries! Life happens, I know how it goes.

I tried to replicate this in a test but couldn't. Was the issue maybe resolved in Python 3.6?

Fascinating. I didn't know they had fixed this behavior. That's really nice!

Thanks for making the fixes. I'll take a look at the PR as soon as I can, but if I take too long, obviously don't wait on me :)

timmahrt commented 1 year ago

Add a getTier method and rename tierDict to _tierDict to indicate that it is a protected member that shouldn't be altered

I was a bit on the fence at first but actually I think its a nice change so I added it to my PR as well.

timmahrt commented 1 year ago

I merged my PR in prep for a release. But if you have any comments or concerns, please share! I can open a follow-up PR.

timmahrt commented 1 year ago

I created a follow-up PR that also makes tier.entries (formally tier.entryList) read only. Both changes are now released!

If you have any feedback, please share!

scottmk commented 1 year ago

Thanks for this! I'll check it out

timmahrt commented 8 months ago

I'll close this for now but if there are any related issues, feel free to reopen!