Closed weili-git closed 1 year ago
Hello,
Although I put some information about gentle back then, today I think I would proceed with MFA (montreal-forced-aligner).
For this, you would have to
If you do that, it would be nice if you could give your full script here (transformation of dataset + alignment), so that I can add it (or do a pull request with this additional file so that I can just accept it).
import textgrid
def get_all_phone_with_timings(f='data/librispeech_alignments/dev-clean/8842/304647/8842-304647-0013.TextGrid'):
"""get all phonemes of a sentence located in tg[1], and filter silence and empty parts, then convert to DataFrame
"""
tg = textgrid.TextGrid.fromFile(f)
# get phones and drop "sp", "sil" and empty strings
phones=[[el.minTime, el.maxTime, el.mark] for el in tg[1] if el.mark not in ['sil','sp','','spn']]
phones=pd.DataFrame(phones)
phones.columns=["start", "end", "phone"]
return phones
Thank you so much for your quick reply. I tried to generate the textgrid files using MFA but got some errors. Here is my script code.
import os
import shutil
import requests
import tarfile
class Emov:
def __init(self):
pass
def prepare_mfa(self, clean=False):
def remove_punct(string):
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, " ")
return string.lower()
# create the textfile with the same name of wavfile
# 1. read transcripts
with open("EMOV-DB/cmuarctic.data", "r") as rf:
lines = rf.readlines()
label_to_transcript = {}
for line in lines:
line = line.split('"')
sent = line[1]
label = line[0].rstrip().split('_')[-1]
if label[0] == "b":
continue
label = label[1:]
sent = remove_punct(sent) # remove punct
sent = sent.replace("1908", "nineteen o eight")
sent = sent.replace("18", "eighteen")
sent = sent.replace("16", "sixteen")
sent = sent.replace("nightglow", "night glow")
sent = sent.replace("mr ", "mister ")
sent = sent.replace("mrs ", "misters ")
sent = sent.replace(" ", " ")
label_to_transcript[label] = sent
# 2. scan wavfiles and create textfiles
for speaker in range(1, 5):
speaker_path = os.path.join("EMOV-DB", str(speaker))
for audio in os.listdir(speaker_path):
if audio[-4:] == ".wav":
textfile = audio[:-4] + ".lab"
label = audio.split('_')[-1].split('.')[0]
transcript = label_to_transcript[label]
if clean:
os.remove(os.path.join(speaker_path, textfile))
else:
with open(os.path.join(speaker_path, textfile), 'w') as wf:
wf.write(transcript)
def download(self):
download_links = [
"https://www.openslr.org/resources/115/bea_Amused.tar.gz",
"https://www.openslr.org/resources/115/bea_Angry.tar.gz",
"https://www.openslr.org/resources/115/bea_Disgusted.tar.gz",
"https://www.openslr.org/resources/115/bea_Neutral.tar.gz",
"https://www.openslr.org/resources/115/bea_Sleepy.tar.gz",
"https://www.openslr.org/resources/115/jenie_Amused.tar.gz",
"https://www.openslr.org/resources/115/jenie_Angry.tar.gz",
"https://www.openslr.org/resources/115/jenie_Disgusted.tar.gz",
"https://www.openslr.org/resources/115/jenie_Neutral.tar.gz",
"https://www.openslr.org/resources/115/jenie_Sleepy.tar.gz",
"https://www.openslr.org/resources/115/josh_Amused.tar.gz",
"https://www.openslr.org/resources/115/josh_Neutral.tar.gz",
"https://www.openslr.org/resources/115/josh_Sleepy.tar.gz",
"https://www.openslr.org/resources/115/sam_Amused.tar.gz",
"https://www.openslr.org/resources/115/sam_Angry.tar.gz",
"https://www.openslr.org/resources/115/sam_Disgusted.tar.gz",
"https://www.openslr.org/resources/115/sam_Neutral.tar.gz",
"https://www.openslr.org/resources/115/sam_Sleepy.tar.gz",
"http://www.festvox.org/cmu_arctic/cmuarctic.data"
]
target_directories = [
"EMOV-DB/1",
"EMOV-DB/1",
"EMOV-DB/1",
"EMOV-DB/1",
"EMOV-DB/1",
"EMOV-DB/2",
"EMOV-DB/2",
"EMOV-DB/2",
"EMOV-DB/2",
"EMOV-DB/2",
"EMOV-DB/3",
"EMOV-DB/3",
"EMOV-DB/3",
"EMOV-DB/4",
"EMOV-DB/4",
"EMOV-DB/4",
"EMOV-DB/4",
"EMOV-DB/4",
"EMOV-DB"
]
for directory in target_directories:
os.makedirs(directory, exist_ok=True)
for link, target_directory in zip(download_links, target_directories):
filename = os.path.basename(link)
file_path = os.path.join(target_directory, filename)
response = requests.get(link, stream=True)
if response.status_code == 200:
with open(file_path, 'wb') as file:
for chunk in response.iter_content(1024):
file.write(chunk)
print(f"download successed:{filename}")
if filename[-5:]!=".data":
with tarfile.open(file_path, 'r:gz') as tar:
tar.extractall(path=target_directory)
os.remove(file_path)
else:
print(f"download failed:{filename}")
dataset = Emov()
# dataset.download()
dataset.prepare_mfa()
# mfa validate /home/weili/data/EMOV-DB english_us_arpa english_us_arpa
# mfa g2p /home/weili/Documents/MFA/EMOV-DB/oovs_found_english_us_arpa.txt english_us_arpa /home/weili/data/EMOV/g2pped_oovs.txt --dictionary_path english_us_arpa
# mfa model add_words english_us_arpa /home/weili/data/EMOV/g2pped_oovs.txt
# mfa align /home/weili/data/EMOV-DB english_us_arpa english_us_arpa /home/weili/data/EMOV
I followed the guidance to add OOVs to the dictionary. But when I executed the command "mfa align xx", it just gave out the IndexError like this,
...
Collecting phone and word alignments from alignment lattices...
...
Job 3 encountered an error:
Traceback (most recent call last):
File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/abc.py", line 92, in run
yield from self._run()
File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/alignment/multiprocessing.py", line 2389, in _run
) = self.cleanup_intervals(utterance, intervals)
File "/home/weili/miniconda3/envs/mfa/lib/python3.8/site-packages/montreal_forced_aligner/alignment/multiprocessing.py", line 2018, in cleanup_intervals
cur_word = word_pronunciations[words_index]
IndexError: list index out of range
Here is a similar error.
I have no idea how to solve this since I am not very familiar with MFA. I know this problem is related to MFA instead of the speech dataset itself. I would appreciate it If you could tell me where the problem is. Thank you so much!
I tried to solve this problem by adding the "--clean" flag after the validate command.
get_emov.py.txt This is my script to download and process the dataset. It seems that most of the wavfiles can be converted. Thank you very much!
Thanks a lot. I added your class in a file here Then to use your class, I think the sequence of commands would be the following, I will add these into the README so that people can easily extract MFA alignments:
In a python terminal:
from emov_mfa_alignment import Emov
dataset = Emov()
dataset.download()
dataset.prepare_mfa()
Then in a shell terminal:
mfa align EMOV-DB/ english_us_arpa english_us_arpa EMOV_mfa_textgrids
Then your "convert" function is the function to remove non-verbal vocalizations that would be before/after the whole sentence
from emov_mfa_alignment import Emov
dataset = Emov()
dataset.convert()
If we wanted to be a bit more perfectionists, we could parametrize the output path at least, and provide a visualization of progression at the diffferent processing steps (with tqdm pypi library). But that is already very nice to have a working script for this :)
Thank you so much for providing this dataset. I am trying to use this dataset for speech synthesis, but the non-verbal sounds really influence the results. Is there any preprocessed Emov-DB in which the laughters and yawns have been removed. Or could you show me the details about how to remove them by using the gentle toolkits?