k-farruh / speech-accent-detection

The human speaks a language with an accent. A particular accent necessarily reflects a person's linguistic background. The model defines accent based audio record. The result of the model could be used to determine accents and help decrease accents to English learning students and improve accents by training.
MIT License
54 stars 12 forks source link
accent accent-detection english-languages mfcc native-speakers

Speech Accent Detection

The human speaks a language with an accent. A particular accent necessarily reflects a person's linguistic background. The model defines accent based audio record. The result of the model could be used to determine accents and help decrease accents to English learning students and improve accents by training.

Outline

About

The English language is a global language. It is becoming a must language for most people. Since the English language going to be part of different nationals, the origin of the English language is changing based on location and the mother languages of the people in that area. So on we can find American-English, England-English, Indian-English, and other English languages. One of the significant differences between different English languages is the accent.

Accent detection would allow us to define the student's accent level and re-train with a native speaker or for the school that wants to hire the teacher, which could determine the accent on the teacher.

Objectives

Dependencies:

The model runs on Ubuntu 18.04 Python requirement libraries in requirements.txt Moreover needs:

Dataset

  1. George Mason University Speech Accent Archive dataset contains around 3500 audio files and speakers from over 100 countries.

    All speakers in the dataset read from the same passage:

    "Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."

  2. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92).

    This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out 400 sentences, which were selected from a newspaper, the rainbow passage, and an elicitation paragraph used for the speech accent archive. The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of newspaper texts selected based on a greedy algorithm that increases the contextual and phonetic coverage.

  3. Mozilla Voice data. The Mozilla Voice data contains tens of thousands of files of native and non-native speakers speaking different sentences.

The dataset contained .mp3 audio files, which were converted to .wav audio files.

Mel Frequency Cepstrum Coefficients (MFCC)

To vectorize the audio files by creating MFCC's. MFCC's are meant to mimic the biological process of humans creating sound to produce phonemes and the way humans perceive these sounds.

Phonemes are base units of sounds that combine to make up words for a language. Non-native English speakers will use different phonemes than native speakers. The phonemes non-native speakers use will also be unique to their native language(s). By identifying the difference in phonemes, we will be able to differentiate between accents.

Overview

  1. Bin the raw audio signal
    Better to produce a matrix from a continuous signal, by binning the audio signal. On short time scales, we assume that audio signals do not change very much. Longer frames will vary too much, and shorter frames will not provide enough signal. The standard is to bin the raw audio signal into 20-40 ms frames.

    The following steps are applied over every one of the frames. A set of coefficients is determined for every frame:

  2. Calculate the periodogram power estimates
    This process models how the cochlea interprets sounds by vibrating at different locations based on the incoming frequencies. The periodogram is an analog for this process as it measures spectral density at different frequencies. First, we need to take the Discrete Fourier Transform of every frame. The periodogram power estimate is calculated using the following equation:

  3. Apply Mel filterbank and sum energies in each filter
    The cochlea can't differentiate between frequencies that are very close to each other. This problem is amplified at higher frequencies, meaning that greater ranges of frequencies will be increasingly interpreted as the same pitch. So, we sum up the signal at various increasing ranges of frequencies to get a measure of the energy density in each range.

    This filterbank is a set of 26 triangular filters. These filters are vectors that are mostly zero, except for a small range of the spectrum. First, we convert frequencies to the Mel Scale (which turns the actual tone of a frequency to its perceived frequency). Then we multiply each filter with the power spectrum and add the resulting coefficients to obtain the filter bank energies. In the end, we will have a single coefficient for each filter.

  4. Take the log of all filter energies
    We need to record the previously calculated filter bank energies because humans can differentiate between low frequency sounds better than they can between high-frequency sounds. The shape of the matrix hasn't changed, so we still have 26 coefficients.

  5. Take Discrete Cosine Transform (DCT) of the log filterbank energies
    Because the standard is to create overlapping filterbanks, these energies are correlated, and we use DCT to decorrelate them. The higher DCT coefficients are then dropped, which has been shown to perform model performance, leaving us with 13 cepstral coefficients.

Resources for learning about MFCC's:
1) Pratheeksha Nair's Medium Aricle
2) Haytham Fayek's Personal Website
3) Practical Cryptography

Models

From Keras FAQ: "A Keras model has two modes: training and testing. Regularization mechanisms, such as Dropout and L1/L2 weight regularization, are turned off at a testing time.

Besides, the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss."

FFNN (Feed-Forward Neural Network)

The FFNN model architecture:

Classification Results

CNN (Convolution Neural Network)

The CNN model architecture:

Classification Results

LSTM (Long Short-Term Memory)

Classification Results

Incorrect Classifications

Going back and listening to the files where my model failed brought two conclusions:

While it would be best if the model could also correctly classify these blended accents, the mixed accents may not pose a severe problem. Speech recognition systems may not have a problem picking up what these speakers are saying. For example, a speech recognition system trained mainly on US data may be able to pick up reasonably well on a speaker from the UK who has spent a fair amount of time in the US. As long as the model is classifying speakers with more traditional UK accents, we can build another speech recognition model for these speakers.

Future Work

Try to classify more accents, instead of native and non-native accents. It would ultimately like to train CNN to classify most of the accents with more datasets.

Unfortunately, the dataset is not enough for multi-accent detection.