timmahrt / ProMo

Prososdy Morph: A python library for manipulating pitch and duration in an algorithmic way, for resynthesizing speech.
MIT License
81 stars 21 forks source link

issue running examples #2

Closed yyf closed 7 years ago

yyf commented 7 years ago

Hi, Thanks for the great library. Having some issue running the examples:

For pitch_morph_example.py

`Traceback (most recent call last): File "pitch_morph_example.py", line 63, in praatEXE=praatEXE) File "build/bdist.macosx-10.12-x86_64/egg/promo/f0_morph.py", line 84, in f0Morph promo.f0_morph.MissingPitchDataException:

No data points available in a region for morphing. Two data points are needed in each region to do the morph Regions with fewer than two samples are skipped, which should be fine for some cases (e.g. unvoiced segments). If you need more data points, see promo.morph_utils.interpolation`

Maybe I miss some steps?

timmahrt commented 7 years ago

There was a bug in one of the dependencies praatio, which I patched last night. When did you install promo/praatio?

Try doing a fresh reinstall: pip install promo --upgrade

Or at least reinstall praatio: https://github.com/timmahrt/praatIO

If you still get the error, let me know. Thanks!

timmahrt commented 7 years ago

For clarification, it was the exact same error (same file and same line): https://travis-ci.org/timmahrt/ProMo/jobs/230355605

yyf commented 7 years ago

Thanks @timmahrt upgraded both resolved the issue.

Still trying to get myself familiarized with praat, wondering what steps are needed to create the TextGrid file a arbitrary wav file? so as to put in the files folder.

Can you provide an example using praatIO to create textgrids using data from other sources?

Thanks

timmahrt commented 7 years ago

To create an empty textgrid for an arbitrary audio file, the only piece of information you need is the duration of the audio file. The textgrid also must have at least one tier. A tier is either a point tier or an interval tier. If you wanted to mark all of the places where there was audio clipping, you'd use a point tier. If you wanted to mark all of the words in a recording, you'd use an interval tier, for example.

import os from praatio import tgio from praatio import audioio

wavFN = "Full/path/to/file/myAudio.wav" tgFN = os.path.splitext(wavFN)[0] + ".TextGrid"

duration = audioio.WavQueryObj(wavFN).getDuration() tier = tgio.IntervalTier("words", [], 0, duration) # TierName, List of intervals, tier start time, tier end time

tg = tgio.TextGrid() tg.addTier(tier) tg.save(tgFN)

Later on you can access the list of intervals or points using: tier.entryList

If you do alter it, it's generally best to work on a fresh copy of the list and create a new version of the tier and textgrid like so: newEntryList = [(start, stop, "-") for start, stop, label in tier.entryList] # Replacing labels with '-' tg.replaceTier(tier.name, newEntryList)

Hmmm....I'm going to write up some formal tutorial to the library.

Let me know if you have any more questions!

yyf commented 7 years ago

Thanks, this is helpful and a formal tutorial will be great too.

Tried to run the pitch morph example using two audio files with their associated TextGrid files, but ran into Key error:

`--------------------------------------------------------------------------- KeyError Traceback (most recent call last)

in () 15 # We'll use textgrids for this purpose. 16 tierName = "PhonAlign" ---> 17 fromPitch = f0_morph.getPitchForIntervals(fromPitch, fromTGFN, tierName) 18 toPitch = f0_morph.getPitchForIntervals(toPitch, toTGFN, tierName) /usr/local/lib/python2.7/site-packages/promo-1.2.5-py2.7.egg/promo/f0_morph.pyc in getPitchForIntervals(data, tgFN, tierName) 37 ''' 38 tg = tgio.openTextGrid(tgFN) ---> 39 data = tg.tierDict[tierName].getValuesInIntervals(data) 40 data = [dataList for _, dataList in data] 41 KeyError: 'PhonAlign'` Need to specify PhonAlign as Key in writing a TextGrid file?
timmahrt commented 7 years ago

You are using your own wav and textgrid files?

With morphing, an important idea is that you can choose the regions to morph. At a base level, no textgrid is necessary. You can just morph the pitch contour of one file to that of another (I'll come back to this in a bit).

However, if we want to morph regions, we need to have the same number of regions in the target and source files. In pitch_morph_example.py, I use textgrids for this purpose. You can call your tiers whatever they want. In the example files, the target tier name is "PhonAlign".

For your data, you should use whatever tiers you want and make sense for your data. "Word"? "Utterances"? Etc. "PhonAlign" is not a magic or reserved word. It's just what I picked in the example file.

Let's say in your source and target textgrids, there is a tier called "word". In that case, you should put tierName = "word" And the two tiers should have the same number of labeled segments. They can be labeled anything (except empty strings or pure whitespace) and the labeled intervals do not have to have the same duration. They just have to have the same number.

Let's say your tiers have three labeled intervals each. The command f0_morph.getPitchForIntervals() should return a list with three sublists. Each sublist contains f0 data for that segment. The f0 data are the raw pitch values recorded at regular intervals. fromPitch = f0_morph.getPitchForIntervals(fromPitch, fromTGFN, tierName)

fromPitch --> [[110, 111, 109], [170, 165, 160], [98, 100, 105, 110, 115]]

Ok, so if you just want to morph one utterance to another without bothering with indivual segments, you don't even need textgrids. You can just do this: fromPitch = [fromPitch, ]

f0Morph expects a list of lists. audioToPI returns a list. So if you just want to morph across a whole utterance, the above trick will do what you need. For individual sentences or segments shorter than that, this will may work ok. For longer segments, the results will be garbage.

timmahrt commented 7 years ago

To be short and explicit: The error you received is saying that there is no tier in your textgrid called 'PhonAlign'. You should change tierName to match one of the interval tiers in your textgrid. That tier must have at least one labeled interval. The number of intervals in that tier must match between the two textgrids.

timmahrt commented 7 years ago

Progress is coming along on the tutorial for praatio. I hope it will be useful (to the community at large). I'll post here once I upload something.

yyf commented 7 years ago

Thanks for the detailed explanation, and yes, trying to use my own wav files and their associated textgrid files. Still trying to figure out how to set tierName properly. How should I examine the interval tiers in my textgrids? I assume just open the textgrid file in the praat app? Do I have to manually label the interval or it's supposed to be there already in the file generated by pitch_morph_example.py?

Would be helpful to have a simple morphing example and more advanced ones, where the simple one is just morphing one to another without textgrid/segment, fromPitch = [fromPitch, ]

timmahrt commented 7 years ago

Still trying to figure out how to set tierName properly. I'm not sure I understand. Can you explain the nature of your data? You have a collection of audio files I'm assuming. Do you have existing transcripts? Are the transcripts in .TextGrid format?

What is the task that you would like to do? For example, along the lines of: "I have some sentence-long recordings and I would like to morph the pitch between different files, at the word level, but I haven't transcribed my data yet".

Do I have to manually label the interval or it's supposed to be there already in the file generated by pitch_morph_example.py?

Yes, unfortunately the intervals will have to be created by some other system.

You can always manually create the intervals in praat. If you have never used praat to annotate audio files, this is a good tutorial that covers the basics: https://youtu.be/64cclyKVJZ4?t=100

I recommend opening up the examples provided in promo using praat. And then open up your own audio files in praat. I think it might make it clearer what the textgrid is.

If you have a lot of data, this might not be practical or possible. There are ways to automatically annotate your data. Depending on your data and the task you want to do, this could be easy or it could be difficult. For example, if you have clean recordings of English sentences where the speaker was reading out sentences from a script, you can used a forced aligner like SPPAS or easy align (a plugin for praat) which will automatically transcribe your data with high accuracy for free. http://www.sppas.org/ http://latlcui.unige.ch/phonetique/easyalign.php

Would be helpful to have a simple morphing example and more advanced ones, where the simple one is just morphing one to another without textgrid/segment, fromPitch = [fromPitch, ]

Here is an example that does not use textgrids. This will be added to a promo tutorial (which I'll work on after I finish the praatio tutorial): https://www.dropbox.com/s/9bext4torjziexc/morph_examples_no_textgrids.py?dl=0

yyf commented 7 years ago

Noticed the textgrid file of my own wav file doesn't have the info as in your mary1.TextGrid file. Wondering what's the process to properly generate a textgrid file? in the standalone praat app first?

timmahrt commented 7 years ago

If you want to create a TextGrid file manually in praat, this video shows how https://youtu.be/64cclyKVJZ4?t=100

Earlier I gave an example of how to programmatically generate TextGrid files from audio. Did you have problems running this code or did you have questions about it?

import os from praatio import tgio from praatio import audioio

wavFN = "Full/path/to/file/myAudio.wav" tgFN = os.path.splitext(wavFN)[0] + ".TextGrid"

duration = audioio.WavQueryObj(wavFN).getDuration() tier = tgio.IntervalTier("words", [], 0, duration) # TierName, List of intervals, tier start time, tier end time

tg = tgio.TextGrid() tg.addTier(tier) tg.save(tgFN)

yyf commented 7 years ago

Sorry for the confusion. Was able to programmatically generate textgrid file. Diving into the video tutorials now and will see if I can get the example running using my own wav files.

A flow chart that illustrates how to use ProMo with other systems, i.e. annotation in praat, could be helpful, too.

BTW, the example without using textgrid works. Thanks.

timmahrt commented 7 years ago

How goes transcribing your textgrids and using praat?

I've released a new version of praatio and ProMo. I updated lots of documentation and tried to streamline the interface. It's hopefully easier to use now.

pip install praatio --upgrade pip install promo --upgrade

I've finished the first praatio tutorial: https://nbviewer.jupyter.org/github/timmahrt/praatIO/blob/master/tutorials/tutorial1_intro_to_praatio.ipynb or find it in the /tutorials/ folder of praatio: https://github.com/timmahrt/praatIO

If you go through it, I'd appreciate any feedback you have.

I'll need to step away from this for a while. Maybe I can work on the promo tutorial over the weekend.

yyf commented 7 years ago

Still work in progress, but i went through the PraatIO tutorial. It's super informative, thanks for writing it up. Will be interesting to see some tutorial on ProMo too. In ProMo, any fundamental limitation on speech re synthesis in terms of perceptual quality?

timmahrt commented 7 years ago

There are roughly three limitations (that I can think of at the moment).

1) The more you manipulate the pitch contour, the more distorted the signal becomes. If you take a word that has a sharp fall and you turn it into a sharp rise, you can expect distortion. It may or may not affect what you are trying to do.

2) There can be correlations between segmental and prosodic phenomenon. For example, in English there is a phenomenon called focus, that is used, among other things, to introduce new information ("Who ate the cheese? [Tom] ate the cheese." 'Tom' is focused. 'Cheese' would be given or 'unfocused') For words with focus, they'll receive a pitch accent and greater articulation than words without focus. In the example I gave, 'Tom' will be produced with greater articulation and cheese with less.

Let's say you reversed the contour. You map the pitch of "Tom ate the [cheese]" onto "[Tom] ate the cheese". It might sound ok. Or it might not sound ok because the pitch contour mismatches with the focus information in the consonants and vowels.

If you can carefully control how the sentences are produced, it's possible to get around this issue. And it might not be a problem at all, but it has been a problem before in my data.

3) Are you familiar with voicing? https://en.wikipedia.org/wiki/Voice_(phonetics)

Pitch is conveyed through F0, which only exists for voiced segments. Vowels are voiced but many consonants are not voiced. If your pitch manipulations are fine grained and you have lots of voiceless consonants in your utterances, there may be no audible difference in the resynthesized recordings.

yyf commented 7 years ago

Thanks @timmahrt

  1. wondering if there is any quantitative or statistical metrics of these limitations, for example, the maximum pitch change for a given duration while maintaining the identity of the voice?
  2. to articulate the focus, curious what are some common control parameters other than intensity, pitch, and duration? in general and in Praat. This might be a bit off the original topic if you don't mind.
  3. in terms of the degree of voicing, what might be the closest to 'voice onset time' in Praat?

I guess there is not a automatic way to separate voiced and voiceless segments yet. This will still have to be done manually at the stage of annotation in TextGrid, correct? Really appreciate your detailed explanation and tutorials.

timmahrt commented 7 years ago
  1. Identity is rather subjective and includes voice qualities other than just pitch. You can make someone's voice sound deeper or higher, but how much manipulation is necessary to 'trick' someone into thinking the speaker is someone else, is dependent on the speaker and the particular listener, I guess.

I will cover this point a bit in my ProMo tutorial, with some examples.

If you're trying to change the speaker's identity, you might have fun playing with the changeGender function in praat. Select an audio file in praat. Then press Convert >> Change Gender.

or in praatio:

from praatio import praat_scripts praat_scripts.changeGender()

  1. If you wanted to control for non-prosodic aspects of focus you would have to use splicing. Splicing involves inserting a speech segment into a place where it wasn't said. So we could produce the utterance 'Bob' and "Father". Then using splicing, we could create new words like "Fob" or "Bother". If you've transcribed each phone, the splicing process can be done automatically using praatio. Splicing works best if the speaker is the same both recording samples and if the replaced material was said in the same context as the new material--thanks to coarticulation effects.

Splicing is a general use technique. If you are working on very specific sounds, you might be able to apply a sound-specific solution. For example, there has been a lot of work been done on manipulation of voice onset timing.

Just this week I was working with a focused production of the word 'him' that I needed unfocused. Manipulating pitch was not enough but I found that by removing about half of the 'h' sound led to a more natural unfocused production of 'him' (I didn't even need to worry about the 'i' or 'm'). I determined this ahead of time by getting the duration of 'h' when 'him' is focused and 'him' is unfocused and found them to be very different (~0.1 seconds long compared to ~0.03 seconds long in my small lab-produced dataset). Unvoiced fricatives can generally be chopped up without much care because they're just noise. More care is needed with other sounds.

  1. Voice onset time is the time when voicing begins after a burst. For voiced stops like 'b' it can be negative (voicing begins before for the burst--sometimes in American English 'bye' will be emphasized by starting the voicing earlier. The end result comes out like 'mbye'). Otherwise, VOT will be positive but very small and for unvoiced stops like 'p' it will be a larger, positive value.

What are you trying to do with VOT? I have a colleague who works on manipulating VOT if you have questions. From her I get the impression that it is not easy to get good quality results

I haven't used it before, but there is a tool for automatically measuring VOT: https://github.com/mlml/autovot

I guess there is not a automatic way to separate voiced and voiceless segments yet.

This is actually trivial to do in praat. Select an audio file in praat and click "view". In the window that pops up, select the far right option "Pulses >> Show pulses". These pulses are "glottal pulses"--each is one movement of the vocal folds.

What do you want to do with that information?

This will still have to be done manually at the stage of annotation in TextGrid, correct?

What are you trying to annotate? If you want to annotate sub-phonemic information (like VOT), then yes, you'll likely have to do that by hand. If you want to annotate the words in an utterance, you will have to do that manually unless you want to see if some speech recognition tools work you for. If you want to annotate the phones in an utterance, I recommend you use a forced aligner like sppas or easy align (I linked to these earlier).

Really appreciate your detailed explanation and tutorials.

My pleasure! I've put quite a bit of work into the ProMo tutorial but it still needs some work. Maybe I can finish it this weekend.

timmahrt commented 7 years ago

I've got the pitch manipulation tutorial online. Two parts are up with another two planned. https://nbviewer.jupyter.org/github/timmahrt/ProMo/blob/master/tutorials/tutorial1_1_intro_to_promo.ipynb

If you have a chance to go through it and have any feedback, please let me know!

yyf commented 7 years ago

Thanks for the tutorials, they are helpful.

Ran into this issue AttributeError: 'module' object has no attribute 'audioToPI'in a virtual environment (flask) when I called the following:

fromPitch = pitch_and_intensity.audioToPI(root, fromWavFN, root, fromPitchFN, praatEXE, minPitch, maxPitch, forceRegenerate=False)

Double checked it works fine in my iPython environment and regular python environment. Any idea why the submodule is not there under a virtual environment?

Thanks.

timmahrt commented 7 years ago

Glad to hear the tutorials are helpful. I'm pretty busy at the moment so it will likely be a few months before I can add more, but I do have plans for them eventually.

I believe you just need to update your praatio library in your library and change the instances of audioToPI() to extractPI().

If that doesn't work, let me know. Thanks!

yyf commented 7 years ago

Got a different error after changing it to extractPI(), any suggestion? Maybe related to my use of os.path.abspath in a virtualenv?

File "/Users/...app.py", line 121, in up
    fromPitch = pitch_and_intensity.extractPI(root, fromWavFN, root, fromPitchFN, praatEXE, minPitch, maxPitch, forceRegenerate=False)

File "/Users/...venv/lib/python2.7/site-packages/praatio/pitch_and_intensity.py", line 284, in extractPI
    pitchQuadInterp=pitchQuadInterp)

File "/Users/...venv/lib/python2.7/site-packages/praatio/pitch_and_intensity.py", line 97, in _extractPIFile
    utils.makeDir(outputPath)

File "/Users/...venv/lib/python2.7/site-packages/praatio/utilities/utils.py", line 146, in makeDir
    os.mkdir(path)

OSError: [Errno 2] No such file or directory: ''
timmahrt commented 7 years ago

Sorry for the the problems. The function extractPI() takes different arguments than the old audioToPI(). Here is the new function argument list:

extractPI(inputFN, outputFN, praatEXE, minPitch, maxPitch, sampleStep=0.01, silenceThreshold=0.03, forceRegenerate=True, tgFN=None, tierName=None, tmpOutputPath=None, undefinedValue=None, medianFilterWindowSize=0, pitchQuadInterp=False)

The older function used to take the file path and the file name as separate arguments, while all of my other functions took the full path to a file as an argument. I made this change so that this function is more consistent with other behavior in my code.

My last message stated that no further changes would be needed but that wasn't true. Sorry for the error.

yyf commented 7 years ago

Thanks, it solved the argument issue, but led to Praat execution failed in the same virtualenv. I'm still checking if all my paths/arguments are correct.

Traceback (most recent call last):
  ...
  File "...//app.py", line 143, in up
    pitchQuadInterp=False)
  File "...//venv/lib/python2.7/site-packages/praatio/pitch_and_intensity.py", line 284, in extractPI
    pitchQuadInterp=pitchQuadInterp)
  File "...//venv/lib/python2.7/site-packages/praatio/pitch_and_intensity.py", line 120, in _extractPIFile
    utils.runPraatScript(praatEXE, scriptFN, argList)
  File "...//venv/lib/python2.7/site-packages/praatio/utilities/utils.py", line 208, in runPraatScript
    raise PraatExecutionFailed(cmdList)
PraatExecutionFailed: 
Praat Execution Failed.  Please check the following:
- Praat exists in the location specified
- Praat script can execute ok outside of praat
- script arguments are correct

If you can't locate the problem, I recommend using absolute paths rather than relative paths and using paths without spaces in any folder or file names

Here is the command that python attempted to run:
/Applications/Praat.app/Contents/MacOS/Praat --run ...//venv/lib/python2.7/site-packages/praatio/praatScripts/get_pitch_and_intensity.praat .../myFolder/17-07-05_21-29-08.wav .../myFolder/17-07-05_21-29-08.txt 0.01 50 350 0.03 -1 -1 0 0
timmahrt commented 7 years ago

Are you using full paths? That output looks strange. You should be able to copy and paste the output (the bit under "here is the command that python tried to run") into a command window and have it run independently of python. If it can run ok in the command window then python should be able to run it too.

And if it can't run ok in the command window, then python won't be able to run it.

Does that help? Tim

On Jul 6, 2017 6:59 AM, "YYF" notifications@github.com wrote:

Thanks, it solved the argument issue, but led to Praat execution failed in the same virtualenv. I'm still checking if all my paths/arguments are correct.

Traceback (most recent call last): ... File "...//app.py", line 143, in up pitchQuadInterp=False) File "...//venv/lib/python2.7/site-packages/praatio/pitch_and_intensity.py", line 284, in extractPI pitchQuadInterp=pitchQuadInterp) File "...//venv/lib/python2.7/site-packages/praatio/pitch_and_intensity.py", line 120, in _extractPIFile utils.runPraatScript(praatEXE, scriptFN, argList) File "...//venv/lib/python2.7/site-packages/praatio/utilities/utils.py", line 208, in runPraatScript raise PraatExecutionFailed(cmdList) PraatExecutionFailed: Praat Execution Failed. Please check the following:

If you can't locate the problem, I recommend using absolute paths rather than relative paths and using paths without spaces in any folder or file names

Here is the command that python attempted to run: /Applications/Praat.app/Contents/MacOS/Praat --run ...//venv/lib/python2.7/site-packages/praatio/praatScripts/get_pitch_and_intensity.praat .../myFolder/17-07-05_21-29-08.wav .../myFolder/17-07-05_21-29-08.txt 0.01 50 350 0.03 -1 -1 0 0

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/timmahrt/ProMo/issues/2#issuecomment-313295414, or mute the thread https://github.com/notifications/unsubscribe-auth/AChv0DdUprMKLvF8sYtHxnY2fs350YnJks5sLGnEgaJpZM4NXFiE .

yyf commented 7 years ago

It was indeed a full path issue. extractPI() is working in my virtualenv : ]

Exploring F0morph() now. Wondering what's the recommended range of file length difference (millisecond or sample) for F0morph() to work nicely? i.e. the duration difference between one wav file and the other. Is there need to preprocess the files so they roughly align within certain percentage in terms of silence and voiced sections?

Thanks

timmahrt commented 7 years ago

F0morph() does not require the two files to be the same length. F0morph() uses proportion time for the target pitch contours (it will map the start of contour A to the start of contour B and the end of contour A to the end of contour B--regardless of what times the starts and ends occur at.

The answer to your question depends on A) the language and B) the kinds of recordings you are morphing.

Japanese, French, English, and Chinese all use word-level intonation very differently.

If you are working with recordings that are very similar, you might not need to change anything, even if the durations are different. e.g. John kicked the ball to Mike and Bob lobbed the can at Todd.

You can probably morph between those with no problem. But for a structurally different sentence like Tom praised Fred for winning

or even worse For winning, Tom praised Fred

the output won't make sense.

In Chinese, words are differentiated by the shape of the f0 contour that falls over the word. It probably doesn't make sense to map the pitch between different sentences.

What language are you trying to work with and what kind of data do you have?

Tim Mahrt Post-Doctoral Researcher Laboratoire Parole et Langage Aix-Marseille Université www.timmahrt.com

On Thu, Jul 6, 2017 at 5:33 PM, YYF notifications@github.com wrote:

It was indeed a full path issue. extractPI() is working in my virtualenv : ]

Exploring F0morph() now. Wondering what's the recommended range of file length difference (millisecond or sample) for F0morph() to work nicely? i.e. the duration difference between one wav file and the other. Is there need to preprocess the files so they roughly align within certain percentage in terms of silence and voiced sections?

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/timmahrt/ProMo/issues/2#issuecomment-313432249, or mute the thread https://github.com/notifications/unsubscribe-auth/AChv0NGZMbrNZ1ZTO_xX3IAdCDcS4w0uks5sLP41gaJpZM4NXFiE .

timmahrt commented 7 years ago

If it's more convenient, I've set up a gitter page which has public and private messaging: https://gitter.im/pythonProMo/Lobby

Also, I don't think I answered your question: "Is there need to preprocess the files so they roughly align within certain percentage in terms of silence and voiced sections?"

Absolutely not. However, silence and voiced sections do pose a problem. The pitch tracker will have no data for those areas. To get around this issue, the function praatio.pitch_and_intensity.extractPitch() has an optional argument 'pitchQuadInterp'. If true, the pitch contour will be interpolated.

This is good for very short silences and for unvoiced regions. It probably is not appropriate for long silences. For example, if someone is reading sentences and pauses after each sentence. In cases like that, you would need to preprocess the speech into chunks.

yyf commented 7 years ago

Thanks for setting gitter up, was thinking about the same.

Also, working primarily with English as normal wav files. Gonna try out the interp option.

Closing the issue now.