`intersection`: issue on consecutive duplicate words

scottmk commented 1 year ago

Today I encountered an issue with the behavior of intersection.

Say I have a WORD tier that looks like this:

UPON | A | A | TIME

And I have a PHONE tier that looks like this:

AH0 | P | AA1 | N | AH0 | EY1 | T | AY1 | M

Assuming these are time-aligned correctly, when I call intersection, I get a list that looks something like this:

['UPON-AH0', 'UPON-P', 'UPON-AA1', 'UPON-N', 'A-AH0', 'A-EY1', 'TIME-T', 'TIME-AY1', 'TIME-M']

Because I have two intervals in the WORD tier which have the same label, from this intersection I can't really tell if I have two distinct words "A" that have the respective transcriptions "AH0" and "EY1", or if I have one distinct word "A" transcribed as "AH0 EY1".

Obviously, there is no right way to solve this, but I would suggest that since we do know that the word entries are distinct, that perhaps instead the label should be the WORD label plus a tuple of all the PHONE labels that coincide with it. Something like this:

['UPON-(AH0, P, AA1, N)', 'A-(AH0)', 'A-(EY1)', 'TIME-(T, AY1, M)']

This would also mean that the interval boundaries would be the boundaries of the left-hand side tier. So my example would be for

word_tier.intersection(phone_tier)

If you instead did

phone_tier.intersection(word_tier)

you would get

['AH0-UPON', 'P-UPON', 'AA1-UPON', 'N-UPON', 'AH0-A', 'EY1-A', 'T-TIME', 'AY1-TIME', 'M-TIME']

What do you think?

scottmk commented 1 year ago

Alternatively, it could behave as it currently does, and instead you could add the entryList entry index from each tier to the intersection entries.

timmahrt commented 1 year ago

I think you first solution sounds reasonable. I'll take a look and see what I can do.

timmahrt commented 1 year ago

When I went to implement the changes, I realized that my original intention with intersection() wouldn't be compatible with the changes you suggested.

For example, what would happen for a phone list, where only some of the phones are listed for each word, e.g. [(0, 1, "hello")] and [(0.1, 0.2, 'e'), (0.7, 1.0, 'o')] What would the expected output be? Under the existing intersection method, there would be two intervals output [(0.1, 0.2, "hello(e)"), (0.7, 1.0, "hello(o)")], but I think one could argue that in some cases only one interval is wanted (0, 1, "hello(e,o)")--which is more in line with your use case.

I wondered how I could accommodate these two scenarios--parameterize intersection()?

I decided that a simpler solution was to create a different method mergeLabels(). I implemented that in https://github.com/timmahrt/praatIO/pull/47 I also added some documentation to the existing intersection().

What do you think? Does mergeLabels() work for your use case?

timmahrt commented 1 year ago

Here is the method signature: https://github.com/timmahrt/praatIO/pull/47/files#diff-35a03755d23b8e11ea1a0d22db05fa23181cc9dfc8a6675bb72e8781ca4b269eR572

Here is an example usage from the tests, using the example you provided: https://github.com/timmahrt/praatIO/pull/47/files#diff-821de34f450931440c2ec4dcdea75ca2127eea10060b8674de5f20eeae4a303dR1225

timmahrt commented 1 year ago

I merged my PR and built a release. I've been sitting on a lot of code since November which I really shouldn't have done.

Reviews on the merged PR are still welcome--I can make a follow-up PR. :bow:

scottmk commented 1 year ago

Thanks for this! I think creating the new method is a great compromise and this helps my use case a lot.

I'll take a look at the PR and see if I have any comments to make.

Thanks for the quick response!

timmahrt / praatIO

`intersection`: issue on consecutive duplicate words #45