lingdb / Sound-Comparisons

Exploring phonetic diversity across language families —
http://www.soundcomparisons.com
Other
13 stars 8 forks source link

Kindly share and test the Praat auto-create-tiers-and-speech-segments.script #442

Closed Linguista closed 7 years ago

Linguista commented 7 years ago

The new and vastly improved version of auto-create-tiers-and-speech-segments.script is now ready and documented. It can be downloaded here, and the wiki is here.

It would be good to share this message and the script with whoever will be needing it.

LauraWae commented 7 years ago

Hi Scott, Thank you so much. I have just played around with it. It seems very clear and easily to deal with to me. For now, I have one small point to improve: the abbreviation for Brazil in our naming system is "Br", not "Bra". Sorry if we have given/confirmed erroneous information about that.

Linguista commented 7 years ago

Thanks for the feedback, Laura. I just uploaded a corrected version of the script which has "Br_" as the abbreviation for Brazilian recordings.

LauraWae commented 7 years ago

Hi, A second one: The function "doubled words" as "Type of elicitation" is not very helpful because "extractWAVfiles" does not accept doubled glosses. I have tried it with a doubled-glossed textgrid, and it does not extract any soundfile.

However, in general, we also want only one sound of one word uploaded. We only want two, where there are signifcant differences in the spelling or in morphology (which is what the tiers "AltLex", "AltPron", "Dff Mme Str?" Str" and "Dff Meaning?" are for.)

My proposal for dealing with doubled-elicitation types is this: Tell the script to skip each second word. Then, the editor has to take care of putting the words he wants in the right order. This could be a little less manual selection work than deleting every second elicitation we don't want.

LauraWae commented 7 years ago

A third thing that comes to me is the question about what to do with introductory speech at the beginning of each recording. We tend to preserve that information even on the cleaned up and pre-edited sound files for the extracting, for documentary reasons. Is there a way the script could recognize a speech like "My name is X, I live in Y and I speak my language Z"? The script should then start only when the speaker starts his elicitation of the word list. It should be probably something about the Min speech interval duration.

LauraWae commented 7 years ago

As we are expecting about 25 completely new languages from Malakula within the next month, I thought of another broadening functionality of the script:

It could be nice to be able to apply this script to several sound and textgrid files at a time, like for example 25 cleaned up soundfiles with their textgrids which are all in one folder.

Linguista commented 7 years ago

My proposal for dealing with doubled-elicitation types is this: Tell the script to skip each second word. Then, the editor has to take care of putting the words he wants in the right order. This could be a little less manual selection work than deleting every second elicitation we don't want.

That's indeed a possible solution. It introduces a certain possibility of human error when copying and pasting, but nothing too serious, I should think.

I've just added two new options to the "Insert glosses" function: Doubled words (suppress first word) and Doubled words (suppress second word). The original option is now called Doubled words (all).

I'll upload this as soon as other changes are done.

Linguista commented 7 years ago

We tend to preserve that information even on the cleaned up and pre-edited sound files for the extracting, for documentary reasons. Is there a way the script could recognize a speech like "My name is X, I live in Y and I speak my language Z"? The script should then start only when the speaker starts his elicitation of the word list.

Short of adding speech recognition and content analysis, I can't think of any way to accomplish this automatically -- to the script, speech is speech, whatever it's about.

But this is trivial to deal with manually -- just delete the s from the interval(s) in which the speaker provides information about themselves and their language and then (re-)run the "Insert glosses" function.

Linguista commented 7 years ago

I've uploaded the newest version of the script (v3.2) and updated the wiki.

Linguista commented 7 years ago

It could be nice to be able to apply this script to several sound and textgrid files at a time, like for example 25 cleaned up soundfiles with their textgrids which are all in one folder.

Being that it's virtually impossible for everything to be segmented and labeled perfectly without at least some manual intervention, you're still going to have to deal with each recording individually, so recursive processing would save very little time, indeed.

Plus, if this were implemented you'd end up with 25 different Sound objects and 25 different TextGrid objects, all with nearly identical file names, loaded in Praat all at once. That's a recipe for confusion! (For example, it would make it all too easy to get confused about what TextGrid goes with what Sound object).

In short, the benefits would be almost nonexistent, and there would be certain drawbacks, so I don't think this is a feature that should be implemented.

LauraWae commented 7 years ago

Being that it's virtually impossible for everything to be segmented and labeled perfectly without at least some manual intervention, you're still going to have to deal with each recording individually, so recursive processing would save very little time, indeed.

Yes, I can see that.

Plus, if this were implemented you'd end up with 25 different Sound objects and 25 different TextGrid objects, all with nearly identical file names, loaded in Praat all at once. That's a recipe for confusion! (For example, it would make it all too easy to get confused about what TextGrid goes with what Sound object).

Actually, the file names are not that similar (if you have a look at the Malakula file names for example). However, I also get your point that it makes sense to take care of each sound file individually. In any case, thanks for considering.

AvivaShimelman commented 7 years ago

Hi, Scott,

I'm currently in Seattle, Washington (GMT-8). I've noticed that a lot of your posts come in at what would be the end of the workday here. Do you stay up really late? Get up really early? Are you not on Central Europe time? Do you have a delay programmed on your email?

:)

A.

On 2/18/17, Scott Sadowsky notifications@github.com wrote:

Closed #442.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/lingdb/Sound-Comparisons/issues/442#event-968194164

-- Aviva Shimelman, PhD

Linguista commented 7 years ago

No fancy technology involved -- I'm back in Chile (GMT -3:00) now, plus I'm a natural born night owl ;-)

LauraWae commented 7 years ago

Hi Scott,

I have a follow up question that came up when starting on the Malakula files with the script.

The word list order of Aviva's recording and the word list order of the index are not the same. So, the automatic fill-in function is not working as it should. Attached you find Aviva's wordlist order. Can you create a second kind of index that the script could use to fill in the words properly? Thanks.

And, because people liked it: If you feel like you need a joke, read this out loud: "Thanks, Scott, we have a new script." ;)

All the best to Chile,

Laura Tryon 215 list with Bislama equivalents.xlsx

Linguista commented 7 years ago

Hi Laura,

We can certainly make it possible to use this other Malakula wordlist; we just have to decide how.

Can the sound files used with this new word list be systematically distinguished from the sound files used with the other word list? If so, the selection can be automated. If not, users will have to manually select the second wordlist.

On a different matter, the file you sent me is nothing like the other index files. The only field it seems to share with the other files is English word, which I'm assuming maps to gloss.

Other index files:

gloss proto ixelicitation ixmorph
one wan 1 0
two tu 2 0
three tri 3 0

New Malakula file:

Paul’s number Tryon word number English word Bislama word
· 1 Hand han
· 2 Left lef

Now, all I really need is the gloss field, as this script does nothing with any other ones, but I wanted to make sure that this is indeed the correct list, and that the field English word is what you're using as gloss (as Bislama is a lingua franca in that part of the world, I can't be 100% sure).

Finally, I'm going to need to make a plain text version of this file for the script to use.

And, because people liked it: If you feel like you need a joke, read this out loud: "Thanks, Scott, we have a new script." ;)

Is this in any way related to the fact that Bavarians always greet me, even when I'm nowhere nearby ("Grüß Scott")?

Linguista commented 7 years ago

There are two other possible issues with the new word list.

  1. Four of the words have initial caps ("Hand", "Left", "Right", "I"), whereas only "I" has them in the main Malakula list. I suspect that the post-processing script that is run later in the workflow will choke on the first three of these words, as they're lower case in the other index file. So is there any problem if I write the first three with initial lowercase letters?

  2. The new Malakula list doesn't follow some of the conventions of all the other index files that I've seen. Just taking into account the English word field, these include: a. Separating words with spaces rather than with the standard underscores (e.g. to laugh). b. Using slashes, which are carefully avoided in other lists (e.g. person/human). c. Using commas, which are also avoided (e.g. to stab, pirce [sic]). d. Verbs are given in the to x form, rather than the bare x form. d. Actually, many entries have entirely different glosses from the first Malakula list. The glosses are much shorter in the other lists, whereas here they seem to correspond to the long form in the translation templates (e.g. to live, be live vs. live).

I'd suggest that all of you guys ( @LauraWae , @AvivaShimelman and @PaulHeggarty ) take a good look at the second Malakula word list before I proceed.

LauraWae commented 7 years ago

Hi Scott, That's my part, I will do it. Thanks.

LauraWae commented 7 years ago

Hi again, Scott.

Please find attached the corrected word list.

This word order has always been used by Aviva in her recordings, so there is no need to include an option to select or any other mechanism. (The order of the index file follows up the index number of each word.)

Malakula_wordorder_equals_elicationorder.txt

Linguista commented 7 years ago

Hi Laura,

This word order has always been used by Aviva in her recordings, so there is no need to include an option to select or any other mechanism. (The order of the index file follows up the index number of each word.)

I'm a bit confused now. Do you still need to use both Malakula wordlists, and therefore need a way to select one or the other?

Or does this wordlist replace the other? Because in this case, you should be able to just delete the old one, give this one the same name that the old one had, and add a first line as a header (anything will do; the script discards it), and then start using this new list without me doing anything at all.

LauraWae commented 7 years ago

Hi Scott, It actually does replace the other. Thanks for the hints for how to create the new list.