snakers4 / open_stt

Open STT
Other
783 stars 81 forks source link

Some benchmarks on the datasets #5

Closed snakers4 closed 4 years ago

snakers4 commented 5 years ago

Below I will post some of the results on the public part of the dataset Both train and validation

Hope this will inspire the community to share their results and models

snakers4 commented 5 years ago

@akreal As you requested. Some results of a more or less well trained model (csv+feather formats). share_results_1.zip

You can open the feather file like this

import pandas as pd
df = pd.read_feather('../data/share_results_1.feather')

Overall, this model is not overfitted and there is no post-processing yet.

akreal commented 5 years ago

Perfect, thank you!

snakers4 commented 5 years ago

As you can see the model is not fully fitted yet (we are still in exploratory phase) But it works perfectly on some easier datasets already

image

Obviously I exclude the following datasets from the file

snakers4 commented 5 years ago

Now if we exclude "bad" files from here, we will get more interesting results. I cannot say that all of these files have poor annotation, but the majority do.

share_results_v02.zip

image

snakers4 commented 5 years ago

Almost finished collecting v05 and searching hyper-params, will be posting new benchmarks and new data soon

m1ckyro5a commented 5 years ago

@snakers4 What model did you use for benchmark?

snakers4 commented 5 years ago

@m1ckyro5a wav2letter inspired fork of the fork of deep speech pytorch

m1ckyro5a commented 5 years ago

@snakers4 How about deepspeech2? Which model is better?

snakers4 commented 5 years ago

It is hard to tell yet The performance now is more limited by the data for us, more than by the model Of course we compared some models side by side (CNN, RNN) only to find that RNNs were a bit better with the same number of weight updates, but slower in general

Some benches we ran on LibriSpeech network_bench.xlsx

snakers4 commented 5 years ago

I will structure the benchmark files from now a bit

Please note that exclusion files #7 were based on this benchmarks as well previously

All charts contain CER

Dataset benchmark v05

File

File

Model

CNN trained with CTC loss Tuning with phonemes

Youtube

TED talks are much cleaner youtube

Audio books

Notice the second normal bump books

TTS

tts

Academic datasets

academic

ASR datasets

Pranks are very noisy by default asr

Radio

Quite good fit as well radio

Strict exclude file for distillation

An idea on how to set thresholds:

CLEAN_THRESHOLDS = {
    # very strict conditions, datasets are clean, no problem
    'tts_russian_addresses_rhvoice_4voices':0.2,
    'private_buriy_audiobooks_2':0.1,

    # strict conditions, datasets vary
    'public_youtube700':0.2,
    'public_youtube1120':0.2,
    'public_youtube1120_hq':0.2,
    'public_lecture_1':0.2,
    'public_series_1':0.2,

    # strict conditions, dataset mostly clean
    'radio_2':0.2,

    # very strict conditions, datasets are dirty
    'asr_public_phone_calls_1':0.2,
    'asr_public_phone_calls_2':0.2,
    'asr_public_stories_1':0.2,
    'asr_public_stories_2':0.2,

    # mostly just to filter outliers
    'ru_tts':0.4,
    'ru_ru':0.4,
    'voxforge_ru':0.4,
    'russian_single':0.4
}
snakers4 commented 5 years ago

Also a comment - model was not over-fitted, it is selected based on optimal generalization

vadimkantorov commented 5 years ago

https://ru-open-stt.ams3.digitaloceanspaces.com/benchmark_v05_public.csv.zip is in fact a gzip-compressed file (not a zip-compressed one), so one should decompress it with zcat benchmark_v05_public.csv.zip > benchmark_v05_public.csv

unzipping fails with:

 $ unzip benchmark_v05_public.csv.zip
Archive:  benchmark_v05_public.csv.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of benchmark_v05_public.csv.zip or
        benchmark_v05_public.csv.zip.zip, and cannot find benchmark_v05_public.csv.zip.ZIP, period.

after gzip-decompression the first line contains some weird stuff:

$ head -n 1 benchmark_v05_public.csv
data/dataset_cleaning/benchmark_v05_public.csv0000644000175000001441656463430613513563560021050 0ustar  kerasusers
johnnych7027 commented 4 years ago

Hi! What datasets have speaker labels? Is there any information in which release the speaker labels will be? Thanks a lot!

snakers4 commented 4 years ago

We decided not to update and / or maintain these for reasons.