Separation of non-verbal vocalizations

weili-git commented 1 year ago

Thank you so much for providing this dataset. I am trying to use this dataset for speech synthesis, but the non-verbal sounds really influence the results. Is there any preprocessed Emov-DB in which the laughters and yawns have been removed. Or could you show me the details about how to remove them by using the gentle toolkits?

noetits commented 1 year ago

Hello,

Although I put some information about gentle back then, today I think I would proceed with MFA (montreal-forced-aligner).

Installation
Phone alignment of a dataset with their acoustic and g2p models

For this, you would have to

do a restructured copy of the dataset to have 1 big folder with pairs of wav files and txt files. Each txt file would have therefore 1 sentence.
run mfa on it, it will generate texgrid files with results of alignments of phones
read these texgrid files in python to have e.g. a pandas dataframe with phones. Let me give you a function I use for that below, using textgrid
You could then discard non-verbal expressions by extracting only content from start of first phoneme to end of last phoneme.

If you do that, it would be nice if you could give your full script here (transformation of dataset + alignment), so that I can add it (or do a pull request with this additional file so that I can just accept it).

import textgrid
def get_all_phone_with_timings(f='data/librispeech_alignments/dev-clean/8842/304647/8842-304647-0013.TextGrid'):
    """get all phonemes of a sentence located in tg[1], and filter silence and empty parts, then convert to DataFrame
    """
    tg = textgrid.TextGrid.fromFile(f)
    # get phones and drop "sp", "sil" and empty strings
    phones=[[el.minTime, el.maxTime, el.mark] for el in tg[1] if el.mark not in ['sil','sp','','spn']]
    phones=pd.DataFrame(phones)
    phones.columns=["start", "end", "phone"]
    return phones

weili-git commented 1 year ago

Thank you so much for your quick reply. I tried to generate the textgrid files using MFA but got some errors. Here is my script code.

import os
import shutil
import requests
import tarfile

class Emov:
    def __init(self):
        pass

    def prepare_mfa(self, clean=False):
        def remove_punct(string): 
            punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
            for x in string.lower(): 
                if x in punctuations: 
                    string = string.replace(x, " ") 

            return string.lower()
        # create the textfile with the same name of wavfile

        # 1. read transcripts
        with open("EMOV-DB/cmuarctic.data", "r") as rf:
            lines = rf.readlines()

        label_to_transcript = {}

        for line in lines:
            line = line.split('"')
            sent = line[1]
            label = line[0].rstrip().split('_')[-1]
            if label[0] == "b":
                continue
            label = label[1:]
            sent = remove_punct(sent) # remove punct
            sent = sent.replace("1908", "nineteen o eight")
            sent = sent.replace("18", "eighteen")
            sent = sent.replace("16", "sixteen")
            sent = sent.replace("nightglow", "night glow")
            sent = sent.replace("mr ", "mister ")
            sent = sent.replace("mrs ", "misters ")
            sent = sent.replace("  ", " ")
            label_to_transcript[label] = sent

        # 2. scan wavfiles and create textfiles
        for speaker in range(1, 5):
            speaker_path = os.path.join("EMOV-DB", str(speaker))
            for audio in os.listdir(speaker_path):
                if audio[-4:] == ".wav":
                    textfile = audio[:-4] + ".lab"
                    label = audio.split('_')[-1].split('.')[0]
                    transcript = label_to_transcript[label]
                    if clean:
                        os.remove(os.path.join(speaker_path, textfile))
                    else:
                        with open(os.path.join(speaker_path, textfile), 'w') as wf:
                            wf.write(transcript)

    def download(self):
        download_links = [
            "https://www.openslr.org/resources/115/bea_Amused.tar.gz",
            "https://www.openslr.org/resources/115/bea_Angry.tar.gz",
            "https://www.openslr.org/resources/115/bea_Disgusted.tar.gz",
            "https://www.openslr.org/resources/115/bea_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/bea_Sleepy.tar.gz",

            "https://www.openslr.org/resources/115/jenie_Amused.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Angry.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Disgusted.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Sleepy.tar.gz",

            "https://www.openslr.org/resources/115/josh_Amused.tar.gz",
            "https://www.openslr.org/resources/115/josh_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/josh_Sleepy.tar.gz",

            "https://www.openslr.org/resources/115/sam_Amused.tar.gz",
            "https://www.openslr.org/resources/115/sam_Angry.tar.gz",
            "https://www.openslr.org/resources/115/sam_Disgusted.tar.gz",
            "https://www.openslr.org/resources/115/sam_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/sam_Sleepy.tar.gz",

            "http://www.festvox.org/cmu_arctic/cmuarctic.data"
        ]

        target_directories = [

            "EMOV-DB/1",
            "EMOV-DB/1",
            "EMOV-DB/1",
            "EMOV-DB/1",
            "EMOV-DB/1",

            "EMOV-DB/2",
            "EMOV-DB/2",
            "EMOV-DB/2",
            "EMOV-DB/2",
            "EMOV-DB/2",

            "EMOV-DB/3",
            "EMOV-DB/3",
            "EMOV-DB/3",

            "EMOV-DB/4",
            "EMOV-DB/4",
            "EMOV-DB/4",
            "EMOV-DB/4",
            "EMOV-DB/4",

            "EMOV-DB"
        ]

        for directory in target_directories:
            os.makedirs(directory, exist_ok=True)

        for link, target_directory in zip(download_links, target_directories):
            filename = os.path.basename(link)
            file_path = os.path.join(target_directory, filename)

            response = requests.get(link, stream=True)
            if response.status_code == 200:
                with open(file_path, 'wb') as file:
                    for chunk in response.iter_content(1024):
                        file.write(chunk)
                print(f"download successed:{filename}")

                if filename[-5:]!=".data":
                    with tarfile.open(file_path, 'r:gz') as tar:
                        tar.extractall(path=target_directory)
                    os.remove(file_path)
            else:
                print(f"download failed:{filename}")

dataset = Emov()
# dataset.download()
dataset.prepare_mfa()

# mfa validate /home/weili/data/EMOV-DB english_us_arpa english_us_arpa

# mfa g2p /home/weili/Documents/MFA/EMOV-DB/oovs_found_english_us_arpa.txt english_us_arpa /home/weili/data/EMOV/g2pped_oovs.txt --dictionary_path english_us_arpa

# mfa model add_words english_us_arpa /home/weili/data/EMOV/g2pped_oovs.txt

# mfa align /home/weili/data/EMOV-DB english_us_arpa english_us_arpa /home/weili/data/EMOV

I followed the guidance to add OOVs to the dictionary. But when I executed the command "mfa align xx", it just gave out the IndexError like this,

...
Collecting phone and word alignments from alignment lattices...  
...
Job 3 encountered an error:
Traceback (most recent call last):

  File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/abc.py", line 92, in run
    yield from self._run()

  File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/alignment/multiprocessing.py", line 2389, in _run
    ) = self.cleanup_intervals(utterance, intervals)

  File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/alignment/multiprocessing.py", line 2018, in cleanup_intervals
    cur_word = word_pronunciations[words_index]

IndexError: list index out of range

Here is a similar error.

I have no idea how to solve this since I am not very familiar with MFA. I know this problem is related to MFA instead of the speech dataset itself. I would appreciate it If you could tell me where the problem is. Thank you so much!

weili-git commented 1 year ago

I tried to solve this problem by adding the "--clean" flag after the validate command.

weili-git commented 1 year ago

get_emov.py.txt This is my script to download and process the dataset. It seems that most of the wavfiles can be converted. Thank you very much!

noetits commented 1 year ago

Thanks a lot. I added your class in a file here Then to use your class, I think the sequence of commands would be the following, I will add these into the README so that people can easily extract MFA alignments:

In a python terminal:

from emov_mfa_alignment import Emov
dataset = Emov()
dataset.download()
dataset.prepare_mfa()

Then in a shell terminal:

mfa align EMOV-DB/ english_us_arpa english_us_arpa EMOV_mfa_textgrids

Then your "convert" function is the function to remove non-verbal vocalizations that would be before/after the whole sentence

from emov_mfa_alignment import Emov
dataset = Emov()
dataset.convert()

If we wanted to be a bit more perfectionists, we could parametrize the output path at least, and provide a visualization of progression at the diffferent processing steps (with tqdm pypi library). But that is already very nice to have a working script for this :)

numediart / EmoV-DB

Separation of non-verbal vocalizations #5