numediart / EmoV-DB

The Emotional Voices Database: Towards Controlling the Emotional Expressiveness in Voice Generation Systems
Other
251 stars 19 forks source link

Separation of non-verbal vocalizations #5

Closed weili-git closed 1 year ago

weili-git commented 1 year ago

Thank you so much for providing this dataset. I am trying to use this dataset for speech synthesis, but the non-verbal sounds really influence the results. Is there any preprocessed Emov-DB in which the laughters and yawns have been removed. Or could you show me the details about how to remove them by using the gentle toolkits?

noetits commented 1 year ago

Hello,

Although I put some information about gentle back then, today I think I would proceed with MFA (montreal-forced-aligner).

For this, you would have to

If you do that, it would be nice if you could give your full script here (transformation of dataset + alignment), so that I can add it (or do a pull request with this additional file so that I can just accept it).


import textgrid
def get_all_phone_with_timings(f='data/librispeech_alignments/dev-clean/8842/304647/8842-304647-0013.TextGrid'):
    """get all phonemes of a sentence located in tg[1], and filter silence and empty parts, then convert to DataFrame
    """
    tg = textgrid.TextGrid.fromFile(f)
    # get phones and drop "sp", "sil" and empty strings
    phones=[[el.minTime, el.maxTime, el.mark] for el in tg[1] if el.mark not in ['sil','sp','','spn']]
    phones=pd.DataFrame(phones)
    phones.columns=["start", "end", "phone"]
    return phones
weili-git commented 1 year ago

Thank you so much for your quick reply. I tried to generate the textgrid files using MFA but got some errors. Here is my script code.

import os
import shutil
import requests
import tarfile

class Emov:
    def __init(self):
        pass

    def prepare_mfa(self, clean=False):
        def remove_punct(string): 
            punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
            for x in string.lower(): 
                if x in punctuations: 
                    string = string.replace(x, " ") 

            return string.lower()
        # create the textfile with the same name of wavfile

        # 1. read transcripts
        with open("EMOV-DB/cmuarctic.data", "r") as rf:
            lines = rf.readlines()

        label_to_transcript = {}

        for line in lines:
            line = line.split('"')
            sent = line[1]
            label = line[0].rstrip().split('_')[-1]
            if label[0] == "b":
                continue
            label = label[1:]
            sent = remove_punct(sent) # remove punct
            sent = sent.replace("1908", "nineteen o eight")
            sent = sent.replace("18", "eighteen")
            sent = sent.replace("16", "sixteen")
            sent = sent.replace("nightglow", "night glow")
            sent = sent.replace("mr ", "mister ")
            sent = sent.replace("mrs ", "misters ")
            sent = sent.replace("  ", " ")
            label_to_transcript[label] = sent

        # 2. scan wavfiles and create textfiles
        for speaker in range(1, 5):
            speaker_path = os.path.join("EMOV-DB", str(speaker))
            for audio in os.listdir(speaker_path):
                if audio[-4:] == ".wav":
                    textfile = audio[:-4] + ".lab"
                    label = audio.split('_')[-1].split('.')[0]
                    transcript = label_to_transcript[label]
                    if clean:
                        os.remove(os.path.join(speaker_path, textfile))
                    else:
                        with open(os.path.join(speaker_path, textfile), 'w') as wf:
                            wf.write(transcript)

    def download(self):
        download_links = [
            "https://www.openslr.org/resources/115/bea_Amused.tar.gz",
            "https://www.openslr.org/resources/115/bea_Angry.tar.gz",
            "https://www.openslr.org/resources/115/bea_Disgusted.tar.gz",
            "https://www.openslr.org/resources/115/bea_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/bea_Sleepy.tar.gz",

            "https://www.openslr.org/resources/115/jenie_Amused.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Angry.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Disgusted.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/jenie_Sleepy.tar.gz",

            "https://www.openslr.org/resources/115/josh_Amused.tar.gz",
            "https://www.openslr.org/resources/115/josh_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/josh_Sleepy.tar.gz",

            "https://www.openslr.org/resources/115/sam_Amused.tar.gz",
            "https://www.openslr.org/resources/115/sam_Angry.tar.gz",
            "https://www.openslr.org/resources/115/sam_Disgusted.tar.gz",
            "https://www.openslr.org/resources/115/sam_Neutral.tar.gz",
            "https://www.openslr.org/resources/115/sam_Sleepy.tar.gz",

            "http://www.festvox.org/cmu_arctic/cmuarctic.data"
        ]

        target_directories = [

            "EMOV-DB/1",
            "EMOV-DB/1",
            "EMOV-DB/1",
            "EMOV-DB/1",
            "EMOV-DB/1",

            "EMOV-DB/2",
            "EMOV-DB/2",
            "EMOV-DB/2",
            "EMOV-DB/2",
            "EMOV-DB/2",

            "EMOV-DB/3",
            "EMOV-DB/3",
            "EMOV-DB/3",

            "EMOV-DB/4",
            "EMOV-DB/4",
            "EMOV-DB/4",
            "EMOV-DB/4",
            "EMOV-DB/4",

            "EMOV-DB"
        ]

        for directory in target_directories:
            os.makedirs(directory, exist_ok=True)

        for link, target_directory in zip(download_links, target_directories):
            filename = os.path.basename(link)
            file_path = os.path.join(target_directory, filename)

            response = requests.get(link, stream=True)
            if response.status_code == 200:
                with open(file_path, 'wb') as file:
                    for chunk in response.iter_content(1024):
                        file.write(chunk)
                print(f"download successed:{filename}")

                if filename[-5:]!=".data":
                    with tarfile.open(file_path, 'r:gz') as tar:
                        tar.extractall(path=target_directory)
                    os.remove(file_path)
            else:
                print(f"download failed:{filename}")

dataset = Emov()
# dataset.download()
dataset.prepare_mfa()

# mfa validate /home/weili/data/EMOV-DB english_us_arpa english_us_arpa

# mfa g2p /home/weili/Documents/MFA/EMOV-DB/oovs_found_english_us_arpa.txt english_us_arpa /home/weili/data/EMOV/g2pped_oovs.txt --dictionary_path english_us_arpa

# mfa model add_words english_us_arpa /home/weili/data/EMOV/g2pped_oovs.txt

# mfa align /home/weili/data/EMOV-DB english_us_arpa english_us_arpa /home/weili/data/EMOV

I followed the guidance to add OOVs to the dictionary. But when I executed the command "mfa align xx", it just gave out the IndexError like this,

...
Collecting phone and word alignments from alignment lattices...  
...
Job 3 encountered an error:
Traceback (most recent call last):

  File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/abc.py", line 92, in run
    yield from self._run()

  File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/alignment/multiprocessing.py", line 2389, in _run
    ) = self.cleanup_intervals(utterance, intervals)

  File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/alignment/multiprocessing.py", line 2018, in cleanup_intervals
    cur_word = word_pronunciations[words_index]

IndexError: list index out of range

Here is a similar error.

I have no idea how to solve this since I am not very familiar with MFA. I know this problem is related to MFA instead of the speech dataset itself. I would appreciate it If you could tell me where the problem is. Thank you so much!

weili-git commented 1 year ago

I tried to solve this problem by adding the "--clean" flag after the validate command.

weili-git commented 1 year ago

get_emov.py.txt This is my script to download and process the dataset. It seems that most of the wavfiles can be converted. Thank you very much!

noetits commented 1 year ago

Thanks a lot. I added your class in a file here Then to use your class, I think the sequence of commands would be the following, I will add these into the README so that people can easily extract MFA alignments:

In a python terminal:

from emov_mfa_alignment import Emov
dataset = Emov()
dataset.download()
dataset.prepare_mfa()

Then in a shell terminal:

mfa align EMOV-DB/ english_us_arpa english_us_arpa EMOV_mfa_textgrids

Then your "convert" function is the function to remove non-verbal vocalizations that would be before/after the whole sentence

from emov_mfa_alignment import Emov
dataset = Emov()
dataset.convert()

If we wanted to be a bit more perfectionists, we could parametrize the output path at least, and provide a visualization of progression at the diffferent processing steps (with tqdm pypi library). But that is already very nice to have a working script for this :)