mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.4k stars 3.98k forks source link

Transcription having lot of spelling errors and getting wrong word segments(although phonetically correct some times) #1817

Closed raghavk92 closed 5 years ago

raghavk92 commented 5 years ago

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Hi, I was trying to transcribe two different audio samples. One has a bit of backgroud music. I actually extracted audio from an apple ad where jonathan ive speaks with a really clear voice but has background music.I converted to 16000 samples a second as required by deepspeech I found a lot of spelling errors.

Mistakes like evolution is spelt evil lution. And its an apple watch ad. So how do i correct this. I tried to use the latest lm , trie models still the transcription is bad.

I ll list what i used but please tell what should i use.

I used the latest alpha release of deep speech 0.4.0-alpha.3 as the stable release was giving really bad results. I used output_graph from reuben’s release because the 0.3.0 was giving very bad results as it was just gibberish and nothing of vcalue was there in the transcription for 0.3.0 models and this fix was providing in the github issue https://github.com/mozilla/DeepSpeech/issues/1156

output graph of reuben’s release: GitHub 1

reuben/DeepSpeech A TensorFlow implementation of Baidu's DeepSpeech architecture - reuben/DeepSpeech

lm and trie i used from https://github.com/mozilla/DeepSpeech/tree/master/data/lm

and alphabet.txt i used from the 0.3.0 models release in the github readme.The alphabets.txt maybe from this link but i am not sure right now: https://github.com/mozilla/DeepSpeech/tree/master/data

So the transcription that i get for apple ad : https://www.youtube.com/watch?v=6EiI5_-7liQ

transcription is : e e e in i an an an enemple agh seres for is more than an evil lution erepresents a fundamental redesin anryengineering of apple watchretaining the riginal i comicg design veloped ury find the for olsimanaging to make it fine be new display is now oven birty percen larger and is seemlessly integrated into the product the interface as been read deigned fron you tiplay providing more information with rich a detail the heard wore hand the software combine to define a very new and truly intergrated singular design novigating with the digital crown olready one of the most intricat makhalisms wit ever created has been intirely igreengineeredwith hapti feeback dilivering a presise ecannical field as idrol in addition to an obtea hasanco the is a new applepizine ilectrical hars and se to the lousutitake in electra cardia graham or easy ge to share with your doctor a momnentesichievement for a were of a divice placing a finger on the tigital crownd i eeplose cerkid with a lectrods on the bank providing dater the easy g busesanaliz your harid whole understanding hea health is a sential to ou well bei aditional features in in harmsmans in courag es ti live and overall healther or tantive life the excela romiter girescove an alfliter allow you to recall youtypes of workelse measure runs withincreased presision and tra your all day activity with great accuracy in hart selilar connectiv ity in tabu something prulyliberating the obility distaklinected with just your wach fon case music streaming and even a mergency essistence ol immediately evolable from your restch eries for is a device so powerful so postnal so liperating i con change the way ou liveach day

and for the other file link is : https://www.youtube.com/watch?v=GnGI76__sSA

and the transcipption with vad transcriber is - DEBUG:root:Processing chunk 00 DEBUG:root:Running inference… DEBUG:root:Inference took 2.720s for 5.880s audio file. DEBUG:root:Transcript: stevies to um saye o me and heused to saye is a lut DEBUG:root:Processing chunk 01 DEBUG:root:Running inference… DEBUG:root:Inference took 0.292s for 1.470s audio file. DEBUG:root:Transcript: jonny DEBUG:root:Processing chunk 02 DEBUG:root:Running inference… DEBUG:root:Inference took 0.337s for 1.620s audio file. DEBUG:root:Transcript: is it that the idea DEBUG:root:Processing chunk 03 DEBUG:root:Running inference… DEBUG:root:Inference took 0.282s for 1.530s audio file. DEBUG:root:Transcript: DEBUG:root:Processing chunk 04 DEBUG:root:Running inference… DEBUG:root:Inference took 0.772s for 3.750s audio file. DEBUG:root:Transcript: and sometimes they wore DEBUG:root:Processing chunk 05 DEBUG:root:Running inference… DEBUG:root:Inference took 0.639s for 3.180s audio file. DEBUG:root:Transcript: really do pe DEBUG:root:Processing chunk 06 DEBUG:root:Running inference… DEBUG:root:Inference took 0.918s for 4.410s audio file. DEBUG:root:Transcript: sometimes they would tru to dreadful DEBUG:root:Processing chunk 07 DEBUG:root:Running inference… DEBUG:root:Inference took 0.632s for 3.090s audio file. DEBUG:root:Transcript: sometimes they of the air from the room DEBUG:root:Processing chunk 08 DEBUG:root:Running inference… DEBUG:root:Inference took 0.638s for 3.000s audio file. DEBUG:root:Transcript: an me liftis poth completely silent DEBUG:root:Processing chunk 09 DEBUG:root:Running inference… DEBUG:root:Inference took 0.845s for 4.200s audio file. DEBUG:root:Transcript: od crazy magninificen ideas DEBUG:root:Processing chunk 10 DEBUG:root:Running inference… DEBUG:root:Inference took 0.403s for 2.010s audio file. DEBUG:root:Transcript: whire simple ones DEBUG:root:Processing chunk 11 DEBUG:root:Running inference… DEBUG:root:Inference took 0.371s for 1.890s audio file. DEBUG:root:Transcript: hin this sufflety DEBUG:root:Processing chunk 12 DEBUG:root:Running inference… DEBUG:root:Inference took 0.288s for 1.470s audio file. DEBUG:root:Transcript: tee tal DEBUG:root:Processing chunk 13 DEBUG:root:Running inference… DEBUG:root:Inference took 0.352s for 1.740s audio file. DEBUG:root:Transcript: eatto e profound DEBUG:root:Processing chunk 14 DEBUG:root:Running inference… DEBUG:root:Inference took 0.366s for 1.860s audio file. DEBUG:root:Transcript: just i speve DEBUG:root:Processing chunk 15 DEBUG:root:Running inference… DEBUG:root:Inference took 0.382s for 1.950s audio file. DEBUG:root:Transcript: loved ydeas DEBUG:root:Processing chunk 16 DEBUG:root:Running inference… DEBUG:root:Inference took 0.434s for 2.160s audio file. DEBUG:root:Transcript: an loved maan stuff DEBUG:root:Processing chunk 17 DEBUG:root:Running inference… DEBUG:root:Inference took 0.513s for 2.550s audio file. DEBUG:root:Transcript: he treated the process DEBUG:root:Processing chunk 18 DEBUG:root:Running inference… DEBUG:root:Inference took 1.094s for 5.370s audio file. DEBUG:root:Transcript: treativeity with the rare and a wonderful reverence DEBUG:root:Processing chunk 19 DEBUG:root:Running inference… DEBUG:root:Inference took 0.871s for 4.260s audio file. DEBUG:root:Transcript: is the i think he better than any one understood DEBUG:root:Processing chunk 20 DEBUG:root:Running inference… DEBUG:root:Inference took 1.017s for 5.010s audio file. DEBUG:root:Transcript: wile ideas oltemately can be so powerful DEBUG:root:Processing chunk 21 DEBUG:root:Running inference… DEBUG:root:Inference took 0.598s for 2.970s audio file. DEBUG:root:Transcript: egin as fratile DEBUG:root:Processing chunk 22 DEBUG:root:Running inference… DEBUG:root:Inference took 0.383s for 1.920s audio file. DEBUG:root:Transcript: e fomd thoughts DEBUG:root:Processing chunk 23 DEBUG:root:Running inference… DEBUG:root:Inference took 1.123s for 5.490s audio file. DEBUG:root:Transcript: so esily mistd so easily compromise so isily josquift DEBUG:root:Processing chunk 24 DEBUG:root:Running inference… DEBUG:root:Inference took 0.909s for 4.230s audio file. DEBUG:root:Transcript: on love the way that he listened so intendly DEBUG:root:Processing chunk 25 DEBUG:root:Running inference… DEBUG:root:Inference took 0.432s for 2.190s audio file. DEBUG:root:Transcript: loved his perseption DEBUG:root:Processing chunk 26 DEBUG:root:Running inference… DEBUG:root:Inference took 0.582s for 2.910s audio file. DEBUG:root:Transcript: is remarkable sensitive ity DEBUG:root:Processing chunk 27 DEBUG:root:Running inference… DEBUG:root:Inference took 0.544s for 2.700s audio file. DEBUG:root:Transcript: nd his surgecly preciseieinion DEBUG:root:Processing chunk 28 DEBUG:root:Running inference… DEBUG:root:Inference took 0.350s for 1.920s audio file. DEBUG:root:Transcript: DEBUG:root:Processing chunk 29 DEBUG:root:Running inference… DEBUG:root:Inference took 0.551s for 2.700s audio file. DEBUG:root:Transcript: i really believe there was a beuty DEBUG:root:Processing chunk 30 DEBUG:root:Running inference… DEBUG:root:Inference took 0.869s for 4.410s audio file. DEBUG:root:Transcript: e sehela how meen his insih was DEBUG:root:Processing chunk 31 DEBUG:root:Running inference… DEBUG:root:Inference took 0.456s for 2.280s audio file. DEBUG:root:Transcript: sometimes et could spey DEBUG:root:Processing chunk 32 DEBUG:root:Running inference… DEBUG:root:Inference took 0.585s for 3.030s audio file. DEBUG:root:Transcript: as um suremany you know DEBUG:root:Processing chunk 33 DEBUG:root:Running inference… DEBUG:root:Inference took 1.022s for 4.920s audio file. DEBUG:root:Transcript: steve didn’t comfined his sensif excellent to make him products DEBUG:root:Processing chunk 34 DEBUG:root:Running inference… DEBUG:root:Inference took 0.544s for 2.610s audio file. DEBUG:root:Transcript: you a wo we travel together DEBUG:root:Processing chunk 35 DEBUG:root:Running inference… DEBUG:root:Inference took 0.356s for 1.770s audio file. DEBUG:root:Transcript: wold check hin DEBUG:root:Processing chunk 36 DEBUG:root:Running inference… DEBUG:root:Inference took 0.387s for 1.920s audio file. DEBUG:root:Transcript: t gop to my room DEBUG:root:Processing chunk 37 DEBUG:root:Running inference… DEBUG:root:Inference took 0.868s for 4.260s audio file. DEBUG:root:Transcript: nat leave my bags thery needly but te door DEBUG:root:Processing chunk 38 DEBUG:root:Running inference… DEBUG:root:Inference took 1.239s for 6.390s audio file. DEBUG:root:Transcript: with numat DEBUG:root:Processing chunk 39 DEBUG:root:Running inference… DEBUG:root:Inference took 0.814s for 4.080s audio file. DEBUG:root:Transcript: gon si on the bed DEBUG:root:Processing chunk 40 DEBUG:root:Running inference… DEBUG:root:Inference took 1.061s for 5.220s audio file. DEBUG:root:Transcript: on si on the bed next to the fhun DEBUG:root:Processing chunk 41 DEBUG:root:Running inference… DEBUG:root:Inference took 0.283s for 1.470s audio file. DEBUG:root:Transcript: wat DEBUG:root:Processing chunk 42 DEBUG:root:Running inference… DEBUG:root:Inference took 0.434s for 2.130s audio file. DEBUG:root:Transcript: n evetible fone cal DEBUG:root:Processing chunk 43 DEBUG:root:Running inference… DEBUG:root:Inference took 2.631s for 12.990s audio file. DEBUG:root:Transcript: ony this hoodself soctless go DEBUG:root:Processing chunk 44 DEBUG:root:Running inference… DEBUG:root:Inference took 0.308s for 1.560s audio file. DEBUG:root:Transcript: used to joe DEBUG:root:Processing chunk 45 DEBUG:root:Running inference… DEBUG:root:Inference took 0.631s for 3.150s audio file. DEBUG:root:Transcript: lunitics a takean over the assinem DEBUG:root:Processing chunk 46 DEBUG:root:Running inference… DEBUG:root:Inference took 0.576s for 2.760s audio file. DEBUG:root:Transcript: swe shard gedioxsignment DEBUG:root:Processing chunk 47 DEBUG:root:Running inference… DEBUG:root:Inference took 1.090s for 5.070s audio file. DEBUG:root:Transcript: spending months and months working on a part of a product DEBUG:root:Processing chunk 48 DEBUG:root:Running inference… DEBUG:root:Inference took 0.493s for 2.310s audio file. DEBUG:root:Transcript: nobody with ever see DEBUG:root:Processing chunk 49 DEBUG:root:Running inference… DEBUG:root:Inference took 0.290s for 1.380s audio file. DEBUG:root:Transcript: owith the rese DEBUG:root:Processing chunk 50 DEBUG:root:Running inference… DEBUG:root:Inference took 0.872s for 4.020s audio file. DEBUG:root:Transcript: did it because we because we really believed that it was right DEBUG:root:Processing chunk 51 DEBUG:root:Running inference… DEBUG:root:Inference took 0.276s for 1.410s audio file. DEBUG:root:Transcript: cause we cared DEBUG:root:Processing chunk 52 DEBUG:root:Running inference… DEBUG:root:Inference took 0.542s for 2.520s audio file. DEBUG:root:Transcript: elieved that there was a grammidty DEBUG:root:Processing chunk 53 DEBUG:root:Running inference… DEBUG:root:Inference took 0.751s for 3.570s audio file. DEBUG:root:Transcript: umast ascensive civic responsibility DEBUG:root:Processing chunk 54 DEBUG:root:Running inference… DEBUG:root:Inference took 0.452s for 2.280s audio file. DEBUG:root:Transcript: so care wavbyyongs DEBUG:root:Processing chunk 55 DEBUG:root:Running inference… DEBUG:root:Inference took 0.619s for 2.940s audio file. DEBUG:root:Transcript: and e sot of functional imperative DEBUG:root:Processing chunk 56 DEBUG:root:Running inference… DEBUG:root:Inference took 0.108s for 0.630s audio file. DEBUG:root:Transcript: DEBUG:root:Processing chunk 57 DEBUG:root:Running inference… DEBUG:root:Inference took 0.340s for 1.800s audio file. DEBUG:root:Transcript: wok DEBUG:root:Processing chunk 58 DEBUG:root:Running inference… DEBUG:root:Inference took 0.488s for 2.340s audio file. DEBUG:root:Transcript: hoopfully appeared in evi table DEBUG:root:Processing chunk 59 DEBUG:root:Running inference… DEBUG:root:Inference took 0.309s for 1.560s audio file. DEBUG:root:Transcript: hid simple DEBUG:root:Processing chunk 60 DEBUG:root:Running inference… DEBUG:root:Inference took 0.225s for 1.140s audio file. DEBUG:root:Transcript: teasy DEBUG:root:Processing chunk 61 DEBUG:root:Running inference… DEBUG:root:Inference took 0.301s for 1.500s audio file. DEBUG:root:Transcript: really cost DEBUG:root:Processing chunk 62 DEBUG:root:Running inference… DEBUG:root:Inference took 0.323s for 1.650s audio file. DEBUG:root:Transcript: cost te soledin i DEBUG:root:Processing chunk 63 DEBUG:root:Running inference… DEBUG:root:Inference took 0.460s for 2.190s audio file. DEBUG:root:Transcript: you know i cost him most DEBUG:root:Processing chunk 64 DEBUG:root:Running inference… DEBUG:root:Inference took 0.312s for 1.500s audio file. DEBUG:root:Transcript: cared the most DEBUG:root:Processing chunk 65 DEBUG:root:Running inference… DEBUG:root:Inference took 0.956s for 4.620s audio file. DEBUG:root:Transcript: he wo in the most deeply he constantly questioned DEBUG:root:Processing chunk 66 DEBUG:root:Running inference… DEBUG:root:Inference took 0.290s for 1.380s audio file. DEBUG:root:Transcript: this good enough DEBUG:root:Processing chunk 67 DEBUG:root:Running inference… DEBUG:root:Inference took 0.245s for 1.230s audio file. DEBUG:root:Transcript: this right DEBUG:root:Processing chunk 68 DEBUG:root:Running inference… DEBUG:root:Inference took 0.530s for 2.610s audio file. DEBUG:root:Transcript: dispite all his successis DEBUG:root:Processing chunk 69 DEBUG:root:Running inference… DEBUG:root:Inference took 0.404s for 2.040s audio file. DEBUG:root:Transcript: his achievements DEBUG:root:Processing chunk 70 DEBUG:root:Running inference… DEBUG:root:Inference took 1.089s for 5.220s audio file. DEBUG:root:Transcript: never presued he never assumed thet we would get there in the end DEBUG:root:Processing chunk 71 DEBUG:root:Running inference… DEBUG:root:Inference took 0.397s for 2.010s audio file. DEBUG:root:Transcript: nideas didn’t come DEBUG:root:Processing chunk 72 DEBUG:root:Running inference… DEBUG:root:Inference took 0.529s for 2.640s audio file. DEBUG:root:Transcript: the proace it types faled DEBUG:root:Processing chunk 73 DEBUG:root:Running inference… DEBUG:root:Inference took 0.778s for 3.840s audio file. DEBUG:root:Transcript: it was with great intent with faith DEBUG:root:Processing chunk 74 DEBUG:root:Running inference… DEBUG:root:Inference took 0.477s for 2.400s audio file. DEBUG:root:Transcript: he decided to believe DEBUG:root:Processing chunk 75 DEBUG:root:Running inference… DEBUG:root:Inference took 0.298s for 1.530s audio file. DEBUG:root:Transcript: then shally DEBUG:root:Processing chunk 76 DEBUG:root:Running inference… DEBUG:root:Inference took 0.317s for 1.530s audio file. DEBUG:root:Transcript: a something greaght DEBUG:root:Processing chunk 77 DEBUG:root:Running inference… DEBUG:root:Inference took 0.539s for 2.730s audio file. DEBUG:root:Transcript: joy of getting man DEBUG:root:Processing chunk 78 DEBUG:root:Running inference… DEBUG:root:Inference took 0.526s for 2.640s audio file. DEBUG:root:Transcript: i loved is infhusiasm DEBUG:root:Processing chunk 79 DEBUG:root:Running inference… DEBUG:root:Inference took 0.484s for 2.430s audio file. DEBUG:root:Transcript: simple thelight DEBUG:root:Processing chunk 80 DEBUG:root:Running inference… DEBUG:root:Inference took 0.474s for 2.370s audio file. DEBUG:root:Transcript: ma i mixed with serilief DEBUG:root:Processing chunk 81 DEBUG:root:Running inference… DEBUG:root:Inference took 0.423s for 2.130s audio file. DEBUG:root:Transcript: the year we got there DEBUG:root:Processing chunk 82 DEBUG:root:Running inference… DEBUG:root:Inference took 0.319s for 1.590s audio file. DEBUG:root:Transcript: we got there in the end DEBUG:root:Processing chunk 83 DEBUG:root:Running inference… DEBUG:root:Inference took 0.233s for 1.140s audio file. DEBUG:root:Transcript: ahe was good DEBUG:root:Processing chunk 84 DEBUG:root:Running inference… DEBUG:root:Inference took 0.448s for 2.250s audio file. DEBUG:root:Transcript: conceise smile conye DEBUG:root:Processing chunk 85 DEBUG:root:Running inference… DEBUG:root:Inference took 1.010s for 4.710s audio file. DEBUG:root:Transcript: selebration of making something grat for everybody DEBUG:root:Processing chunk 86 DEBUG:root:Running inference… DEBUG:root:Inference took 0.662s for 3.270s audio file. DEBUG:root:Transcript: enjoying the defeat of sinisism DEBUG:root:Processing chunk 87 DEBUG:root:Running inference… DEBUG:root:Inference took 1.439s for 6.600s audio file. DEBUG:root:Transcript: rjection of reason the rejection of being told a hundred times in condo that DEBUG:root:Processing chunk 88 DEBUG:root:Running inference… DEBUG:root:Inference took 0.733s for 3.570s audio file. DEBUG:root:Transcript: so hes i think was in victory for beauty DEBUG:root:Processing chunk 89 DEBUG:root:Running inference… DEBUG:root:Inference took 0.307s for 1.560s audio file. DEBUG:root:Transcript: pperity DEBUG:root:Processing chunk 90 DEBUG:root:Running inference… DEBUG:root:Inference took 0.605s for 2.970s audio file. DEBUG:root:Transcript: he would say for givein at dham DEBUG:root:Processing chunk 91 DEBUG:root:Running inference… DEBUG:root:Inference took 0.840s for 4.140s audio file. DEBUG:root:Transcript: he was my closeess and we must loa friend DEBUG:root:Processing chunk 92 DEBUG:root:Running inference… DEBUG:root:Inference took 2.090s for 9.300s audio file. DEBUG:root:Transcript: together fornerly fitteen years and he still laughed to the way i sad ali minum DEBUG:root:Processing chunk 93 DEBUG:root:Running inference… DEBUG:root:Inference took 0.487s for 2.340s audio file. DEBUG:root:Transcript: past tothe weeks DEBUG:root:Processing chunk 94 DEBUG:root:Running inference… DEBUG:root:Inference took 0.968s for 4.410s audio file. DEBUG:root:Transcript: wh we ill bing struggling to find ways to save tood by DEBUG:root:Processing chunk 95 DEBUG:root:Running inference… DEBUG:root:Inference took 0.342s for 1.620s audio file. DEBUG:root:Transcript: t smooning DEBUG:root:Processing chunk 96 DEBUG:root:Running inference… DEBUG:root:Inference took 0.380s for 1.920s audio file. DEBUG:root:Transcript: smply once who weren DEBUG:root:Processing chunk 97 DEBUG:root:Running inference… DEBUG:root:Inference took 0.372s for 1.860s audio file. DEBUG:root:Transcript: ank you staye DEBUG:root:Processing chunk 98 DEBUG:root:Running inference… DEBUG:root:Inference took 0.628s for 3.000s audio file. DEBUG:root:Transcript: f youl remarkable vision DEBUG:root:Processing chunk 99 DEBUG:root:Running inference… DEBUG:root:Inference took 0.332s for 1.620s audio file. DEBUG:root:Transcript: ichis inited DEBUG:root:Processing chunk 100 DEBUG:root:Running inference… DEBUG:root:Inference took 0.319s for 1.590s audio file. DEBUG:root:Transcript: nspired DEBUG:root:Processing chunk 101 DEBUG:root:Running inference… DEBUG:root:Inference took 0.526s for 2.550s audio file. DEBUG:root:Transcript: this extraordinary groups of people DEBUG:root:Processing chunk 102 DEBUG:root:Running inference… DEBUG:root:Inference took 0.525s for 2.580s audio file. DEBUG:root:Transcript: for the oll the weav hof men from you DEBUG:root:Processing chunk 103 DEBUG:root:Running inference… DEBUG:root:Inference took 0.781s for 3.660s audio file. DEBUG:root:Transcript: nfor all thet we will continue to learn from each other DEBUG:root:Processing chunk 104 DEBUG:root:Running inference… DEBUG:root:Inference took 0.200s for 1.050s audio file. DEBUG:root:Transcript: st DEBUG:root:Processing chunk 105 DEBUG:root:Running inference… DEBUG:root:Inference took 1.926s for 9.900s audio file. DEBUG:root:Transcript: ee

The results are sometime phonetically correct but the transcription is full of spelling errors as above.

So how should i improve this transcription. should i use different models but where do i get them from. How can i improve this without training because i dont have annotated samples.

And if it needs training how much minimum training it needs and how do i train it in the most minimum way possible to get a good transcription . And how many minimum samples would i need to annotate and train to get a good transcription if training is needed.

I used discourse but didnt get any response

Thanks in advance Raghav

lissyx commented 5 years ago

I used the latest alpha release of deep speech 0.4.0-alpha.3 as the stable release was giving really bad results. I used output_graph from reuben’s release because the 0.3.0 was giving very bad results as it was just gibberish and nothing of vcalue was there in the transcription for 0.3.0 models and this fix was providing in the github issue #1156

Please elaborate. You took 0.3.0 model as-is and used it with 0.4.0-alpha.3 binaries ? Or did you made any extra step ? The 0.3.0 out-of-the-box model would output gibberish with those binaries because of no softmax later, so one need to re-export to make it compatible.

I actually extracted audio from an apple ad where jonathan ive speaks with a really clear voice but has background music.I converted to 16000 samples a second as required by deepspeech I found a lot of spelling errors.

What are the exact original and converted audio specs ? Conversion could add artifacts that messes up with recognition.

raghavk92 commented 5 years ago

@lissyx i didnt use the 0.3.0 model because it gave that error no softmax layer. I used the reuben release 0.2.0-ctc-decode which i got from another github issue. The link for the model is https://github.com/reuben/DeepSpeech/releases/tag/v0.2.0-prod-ctcdecode

All the other files lm, trie i have given links to which i used.

And the original audio that i downloaded from youtube and converted to audio with youtube-dl package - RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz

And the converted audio with ffmpeg(specs) - RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

and i also want to know if conversion can mess up recognition where and how do i get the audio specs deep speech will work with . I mean like how do i convert because mostly audio i have will be in a different sample rate mostly 44100k than 16k and even if i train i will train with the converted samples only. and if converted samples create problem how to resolve this

Also one more thing i want to ask i was trying to train but i couldnt find the checkpoint for the reuben's release so how to train and what to put in checkpoint_dir to train from pre release model? do i have to put 0.2.0 checkpoint or the output_graph files can be used for training above them . And how to use output_graph to train if that is possible because in the readme its written only to give the checkpoint directory?

kdavis-mozilla commented 5 years ago

As you look to have a mix-and-match model, what might be easier, instead of tracking down the problem, is to just wait until Monday when we are planning on doing the 0.4.0 release, then use that.

lissyx commented 5 years ago

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz

And the converted audio with ffmpeg(specs) - RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

So you have noisy, music background PCM 16-bits stereo 44.1kHz converted to PCM 16-bits mono 16kHz ? Devil's lies in details, it might also come from how you perform the ffmpeg conversion. Check on Discourse there are some good examples.

raghavk92 commented 5 years ago

@lissyx i converted the exact same way as specified in one of the discourse posts....with parameters -acodec pcm_s16le -ac 1 -ar 16000 . And i get it if noisy sample is giving errors but i have checked on atleast 15 samples witg interviews of tim cook and elon musk....clear non noisy data but still there is spelling error like here becomes hear and evolution is evil lution. Above i only gave two samples one noisy background and one without noisy background.

How do i improve this result. And i also want to know if you could tell me when i train on the pre trained model what all files do i need to give in checkpoint_dir.... Which version of the checkpoint( as suggested by kdavis i can wait till monday for the latest checkpoint as you would release 0.4.0) and does it need to contain lm and trie or do i need to give thembas seperate arguments because in the readme it shows that i only need to give checkpoint_dir. And as i am new to training i wanted to know if i am thinking correctly that i cant train on top of output graph file but i can only train on the checkpoint file right?

lissyx commented 5 years ago

@lissyx i converted the exact same way as specified in one of the discourse posts....with parameters -acodec pcm_s16le -ac 1 -ar 16000 . And i get it if noisy sample is giving errors but i have checked on atleast 15 samples witg interviews of tim cook and elon musk....clear non noisy data but still there is spelling error like here becomes hear and evolution is evil lution. Above i only gave two samples one noisy background and one without noisy background.

How do i improve this result. And i also want to know if you could tell me when i train on the pre trained model what all files do i need to give in checkpoint_dir.... Which version of the checkpoint( as suggested by kdavis i can wait till monday for the latest checkpoint as you would release 0.4.0) and does it need to contain lm and trie or do i need to give thembas seperate arguments because in the readme it shows that i only need to give checkpoint_dir. And as i am new to training i wanted to know if i am thinking correctly that i cant train on top of output graph file but i can only train on the checkpoint file right?

Can you please start by explaining exactly how you run things ? And use proper formatting ? Your first post is barely readable, it's painful to distinguish between your statements, your questions, and your console output.

Can you verify with the basic tools, like evaluate.py and native client ? You mention VAD Transcriber, this is another element modifying the behavior ...

raghavk92 commented 5 years ago

@lissyx sorry for not formatting it well. Please have a look at the details below:

I used the latest alpha release of deep speech 0.4.0-alpha.3 As i wrote above I use the files from following links:

To convert the audio i was using: youtube-dl package for converting video to audio and like youtube-dl --extract-audio --audio-format wav

  1. The audio file is an apple watch video with background music which i get it if it had a bad transcription because of the background music. The link for the file : https://www.youtube.com/watch?v=6EiI5_-7liQ and transcription: e e e in i an n an a enemple agh seres for is more than an evil lution erepresents a fundamental redesin anryengineering of apple watchrtaining the riginal i comicg designveloped ury find the for olsimanaging to make it fine be new display is now oven birty percen larger and is seemlessly integrated into the product the interface as been read deigned fron you tiplay providing more information with rich a detail the heard wore hand the software combine to define a very new and truly intergrated singular design novigating with the digital crown olready one of the most intricat makhalisms wit ever created has been intirely igreengineeredwith hapti feeback dilivering a presise ecannical field as idrol in addition to an obtea hasanco the is a new applepizine ilectrical hars and se to the lousutitake in electra cardia graham or easy ge to share with your doctor a momnentesichievement for a were of a divice placing a finger on the tigital crownd i eeplose cerkid with a lectrods on the bank providing dater the easy g busesanaliz your harid whole understanding hea health is a sential to ou well bei aditional features in in harmsmans in courag es ti live and overall healther or tantive life the excela romiter girescove an alfliter allow you to recall youtypes of workelse measure runs withincreased presision and tra your all day activity with great accuracy in hart selilar connectiv ity in tabu something prulyliberating the obility distaklinected with just your wach fon case music streaming and even a mergency essistence ol immediately evolable from your restch eries for is a device so powerful so postnal so liperating i con change the way ou liveach day Inference took 61.270s for 182.370s audio file.

  2. The other video is jonathan ive speech . majorly clear but somethimes in between people are laughing. The link for the file : https://www.youtube.com/watch?v=GnGI76__sSA and the transcription: let stevis to em saye to me and heused to save is a lot i jonny is it tog pe idea and sometimes they wore really do' pe sometimes they would truly dreadful but sometimes they took the air from the room a neylifted poth completely silent bold crazy maginicicen ideas or quiet simple wones which in thi sufflety bed detail they were uttery profound in just ac steeve loved ydeas and loved makan stuff he treated the prosess a creative atty with a rare an a wonderful reverence is the i think he better than any one understood the wile ideas altimately con be so powerful they begin as fratile bery form thoughts so easily missed so easily compromised so isily jusquift in i loved the way tha he listened so intendly a loved his peception his remarkaple sensitiveity and his surtily creisopinion i really believe there was a beauty in houe singular how ken his insihtwas even though sometimes it could stey as im surmany of iu know steve didn't confind his sensiv excellent to make him products yon a wo me travel together we would check in and i goup to my room an at leave my bags very needly by the door and i would nompact and i woul go in so under bad i would gon to un the bed next to the phone an i would wait for the inevitable pone col e tony this hadtell supsless go used to chok the the lunitics had taken over the assilemas we shared a giddiesinment spending months and months working on a part of a product that nobody would ever see on the wit her eyes but we did it because we because we really believed that it was right because we cared he believed that there was a gramity almost ascensive civic ressponsibility to care way beyond any sot of tunctional imperative no ale the work hopefully appeared in evi table appeared simple easy it really cost e costasoldin i but you know what he cost him most he cared to most he weried the most deeply he constantly quistioned isis good enough is dhis riht u dispite al his successis all his achievements he never presumed he never assumed thet we would get there i the end when the ideas didn't come and when the prote types failed it was with great intent with faith he decided to believe we would iventually make something great but the joy of getting map i loved is inthusiasm his simple delight oftan atin mixed with therilief but the year we got there we got ther in the end and he was good you cancee smiled conyou the celebration of making something grat for everybody enjoying the defeat of cinisism the rejection of reason the rejection of being told a hundred times and condo that so his i think was a victri forbety fa perity an it he would say forgivin at dam he was my closesst and wo most loae friend we wore together fornelly fifteen years and he still laughed to the way ic said aliminionfr the past to weeks wi a wing struggling to find wast to sake good by this morning i simply ant to wend by saink thank youstath thank you for your remarkable vision which is ounited an inspired this extraordinary group of people for the all the weak of mun from you and for all thet we will continue to learn from each other thank ousta

  3. This is a tim cook interview. choose this because he speaks slowly and also two people speaking and silence in between. majorly no background noise. The link for the file : https://www.youtube.com/watch?v=qvR3PEPqX9A&t and transcription: tempegir so much particnar tire le shar we ppreciated thenk yiu for colin on i wantod dig right into the results tim and i tol revenues specifically pos is is yumention that was lower than expected an tat o gaid fir for the revenue short fall here and i went togainst pecifically it the trenger sing an shina mam because you you say something interesting which it isn't just the aconomy there it's also es rising trade tentions what did you mean by that tem year if if you look at our results oh our short fall is ah over huntred persent froit pi ton in its primarily and greater child its ok with as we' look at what's going on in china the it's clear that the economy began to slow there for the second half and when i believe to be the case is the trade tentions between e nited states an china put a disinal pressure on their tonomy and so we saw as the quarter wint on a things like god traffic in our retal stores traffic an our channel parter stonees ah the report of the smart pondi indistry ah contracting ah inparticularly bad in nevembered i haven't seem to dicember never yet but i would i would guess the vacnodon be good ither and so ah des what we san and um now thrl wat o things we can do to ah turn our atu to it sort of turn our ah business around and interms of the amoth in china in imore a generally ah a cross were focusing on it if you look at i fon more an a mackral lovel i di dhe storyan i found is in a dition to the emerging market weekness which is priarily in chinon it's tead oh there's not as any subsidies is there used to be for a charrier pointi yer and were that didn't all happen yesterday for if you've been out of the market for two or three years and you come back it looks like that to you i f acts was a big challenge in the corter as i interestrate heighes of stordi ne ie state there's more fore capital cunning in that makes the dollar much stronger in i de translation in we knew that with going to be a factor it effected usto a by bac two hundred bases points i in it in then it' sort of ina in addition to those to things we started a programe erealide ah were we drematically lowered the battery replacement price and so we we had fort of the collection of items going on some that are mackro economic and some that ere appes pecific and were not going to sit around witing for the mac rob to change i hop that a dies in im actually octennis tiv ah but we'e gong to focus ah really i deeply on the things we can control and let inters e things perhaps here littl be adier hintrolo to mi i just wet touch o yan as pecifically ut to back to that because the trad tentions are having a secture seing iconoy there but but you see evidence that perhaps apples also getting caugh in the cross by orntems of is her evinens tha chine consumer to say you know whether's te dipute there's tention in their takin at on apple an some way as well will i certainly oh apple has not been targueted by the government and so loono me turtake wa any kind of doubt of that right up tap vhere arl reports ah sort of sparatic reports abowd uh somebody talking about not bying a products ah because we're american may be som a little bit on social media maybe a guy standing in front of store something my my personal since sis that this sa small ah kicking myn ne chinas not monelistic just like a nerica not min alissin he have people wilh different views an different ideas and so do i think anybody alected ah not to by because of that i'm sure some people did but but my sens is thof much lorgyour issue is thi slowing of the conomy and in this a betray tention thats furher pressured and gi tat od given it this is a headwin and enman more than you expect to have you taught hem just intersted to present trum or members uninistration is is a big important american aconomy and your singlisten this trade dispute as really impacting our busets have you have you recently talked to those mevers in isration at enconveyed that you know i i'm telling our investrous first about what we sal ah last quarter into inets bat is the way of should be a but i've had obioustly many manu discussions over ah the course of many mods to to be constructive and to give sort of my prespective on trade in the importance of i to the american acconomy as well and ah i i i feel my come that i'm being listened to an in that respect and so ii'm actually incouraged by what i've heard a most recent ly coming from the us and from china and ah hoefally will sie some change togivn it those trea tentions tim tin an they do remain heated um given the pressur seeing isyu speak a tray ors in desters in vusiness people now in the cours ahead con you ben nanigate this well you you you fokutso whath you can control and so wit what i look at this i say ah wut de ar there's some weakness outside of tin as well i would have liked tove done better and some of our developed markets and so how can we do that will the subsidies have or or fewer these days that's true but we can stort e or we had sorted a trade in per graham and wich cert it panarly eeas drehe th invironment you o keeps a yunit with some one that that wants it in the person o want o new and gits one as well and it's great for developers and in so ford this well but but we have no't really marketed at very much and that truth is to a consumer the tradian looks like a subsidy because it lowers the price of you the fone that you want and sir jus the min of an example of that and so the the h the rechal price of the ah i ponten ar inunited states is sevel foty no ah but if you havebeen to detot to trade am some plaus which many people are in order to to get that the price goes all the way down to for forty nine or loss and i and so there is a substantial venefit economic and inviromental from tradin werl so working on oh placing a ability to do mutly charges an and so it begins til look like more the traditional way of paing thor it through the carrier by peno taking the the rates out for for twenty for months or so and se you wind up getting a i incredilig ny pond it so much beter th what you've had for twenty thirty dollars a month tor so and ah and so we're doing that we're all so uppiing a lot of focus on the service side are stores are unbelievable at service and the at dibility people for rewaryng about transfering te dayda in o the very word to the snow fan there be something that they lost in the process and so wer rere putting a lot of temphasis on doing that indoing that well and so those are just somethings the other things ah which are not different than we fawg but did affect or revenue in the carter are things like whid ome supply constrates when an unpresedented number of new products during the carter we had new watches and ah we had on new i pe prose both of these were constraing for all or most of the corter did you think un looking back in you theatm do you think ou try to introduce too much new two fast no i think you you you know our arstale jode is we were lestings when the ready and and i that's the way it should be if if you ever stort worring about cantiblizing yourself you can talk torself into not do an molth thas in so i were allive our products were ready over that period now wold i have lihed a simidom to be ready a few unts erlierof course i would onways like that but but gunerally were it wore we're still goin o ned match yon the roadf of a shivping is een a rat and ten eask you too madis wilise in deser's get a lot of information an metrixs but i a as you guys fell outlessners going to b some changes in disclosure you're not young get the number of i tonship to y more you guys don't see at his a relevant metric so much as in the past um if that isn't the date a points thid investrshol be folks on what ore the dat o points yea invester she fult a good a wettion look what we did years ar go actually without the watch we'v never dislosire and so wid fy it was a just because we were secret a people it's it we looked at this in you the watches were wide range and turns of priceing it we knew that a ventually we would have a selilar wadch withere's a stainless stell verses in alominum a there's even an adition and so you begin to say what vou you is there in adding pace things up i made the comparison it sort of like you and i go into the grocery store in putting things in our cart and coming up tohe register and the person sang how many agod it dust makes sis to admam together any more because the price ranges are so wide so we didn't do it on ah watchfen the beginning we' never done it on i pad as we now sten back from lo from the fone we have fhonds being sold in eemerging markets ah like a ifhen six as for renthry hundred dolars and so you got a range from three hundred two a thousand or orin some paces over a thousand depending u pon your selection of of af d ah flash and so forth anstad this thing has lostits meaning an so we felt that at the end of the day ah we were giving andvesters in sort of pointing down to something as if it had this incredible importance to it will be on what it does that doesn't mean were never going to com ind anunit again if we think thit we can better explain resolts with talking about youts i tek i deci something about hem but but jinerally to have it onn a you know every ninety day clock of releasing this ii i don't i think it does the indvestor a discervic frankly but now we are making aditional discosures as well like for example we're going to give the gross morgen or services business you know we've never done that before services has ag grong you know an predible a mout ah we're over t we're going to have over port over ten point eght billion and when we wereport later the smot for a last corder that u new racor and what what drove that to may colorgaves a we ask towar was at news ig is is this is ah this is a cut of exciting for os decause so many things hit records in are the abstore ded apple music hit a ne rackors apple pay hit a new ractor our ah searchad product from the absore kit a new ractord ah a i cloud hiddene raggord and so you know it's very wide and each of the gographry es a gography's hid a porterly racer so even in china they abstore hit o cordly racer whys that it's because its driven by the installed base and aren't stall base ih grew oh young knicely yeurover yer and china is well and an as i say in the letter ah wevh picked up a hundred more milion active divices over the last wile months alone said the visisaning credible number and itwill have we get some interesting things in the pipe line i inservices ar a course we do on products is well and ah so mats fort of another way too ah grow the company to find a question ti es ounger to end the cuartier a hundred thirty billion in nhet cash that's hri am o apple has a history do um a lot of backquisition they tend to be thoe smaller m tiggest is tree billion for beats um do you think maybe given that cash position um would apple be open to maybe shiftin out thinks about acquisitions in doing aquisitions that may be investers with think or biger or more meingful you you know before for us ah we've never changed our vinacropact positions we've never said thy shelt lot by ig compinat or tha shalt love my medium company or iit has to only be indiscountry that country we always looked at it from astrategict point ef view an as what is it due for the postomer what is it di for the uther and sod the vast majority of ours had been technology and people that we think would bring a better user experieence that there's a feature or something that that we could do in the future in the vacat help on doing that but that doesit i i always been very clear is we look at it many many companies including vory large companaes we've al lukeed so for not to do those because we haven't found one that we say wile that's o that's a knice ir suchal table but i i'v never well wouldow am wet we do have a lot of a neckcash an i believe the firthecompanies ar stock as an appredale galue and in so ah yi you can dethit ah we're going to be bind sumstock under of the plan that we pad aut therefore quite some time i tenfhenu to much ror taxa you generous eet ti yo its ho galow toar jawe rall im mi he Inference took 368.070s for 839.401s audio file.

Now my questions:

  1. How do i get a better transcription for all these files (the one with background noise and the ones without)?

  2. How should i train and with which checkpoints and other files to get a better transcription.

  3. What should i do differently in training for files with background noise to get a better transcription and what should i do in training to get better transcription for no noise in background?

  4. What is the best way to convert so that conversion doesnt affect the transcription?

  5. Is there anything else that i missed that i should try .Please tell.

I still have to use evaluate.py . I ll use it and update my comment.

Hopefully this is better formatting.

Thanks in advance

lissyx commented 5 years ago

Thanks, that's much more readable, though transcriptions could have been in a separate gist :)

lissyx commented 5 years ago

youtube-dl package for converting video to audio and like youtube-dl --extract-audio --audio-format wav ffmpeg -i inputfile.wav -acodec pcm_s16le -ac 1 -ar 16000 output16k.wav

Interesting. I know we did some demo of some live youtube video transcription taking audio output from the system directly, and getting pretty good results with the streamng, likely much better. This was using the streaming API, but with some other VAD.

Also, I don't know what youtube-dl does when you extract audio that way. Is it copying the raw stream or doing something?

lissyx commented 5 years ago
3\. The link for the file : https://www.youtube.com/watch?v=qvR3PEPqX9A&t and transcription:

youtube-dl fetches some AAC stereo 44.1kHz audio. According to youtube-dl manpage, the --extract-audio depends on ffmpeg / libavcodec, so it's already torturing the audio stream. As expected, it extracts PCM LE16 as stereo 44.1kHz.

AAC being lossy, we're likely getting some artifacts. @raghavk92 Could you give a try by not passing --extract-audio --audio-format wav and use ffmpeg to directly extract the original AAC into proper mono 16kHz pcm audio ?

raghavk92 commented 5 years ago

This the exact output of the youtube-dl package.

[youtube] qvR3PEPqX9A: Downloading webpage
[youtube] qvR3PEPqX9A: Downloading video info webpage
[youtube] qvR3PEPqX9A: Extracting video information
WARNING: unable to extract uploader nickname
[download] Destination: Watch CNBC's full interview with Apple CEO Tim Cook-qvR3PEPqX9A.m4a
[download] 100% of 12.96MiB in 06:44
[ffmpeg] Correcting container in "Watch CNBC's full interview with Apple CEO Tim Cook-qvR3PEPqX9A.m4a"
[ffmpeg] Destination: Watch CNBC's full interview with Apple CEO Tim Cook-qvR3PEPqX9A.wav
Deleting original file Watch CNBC's full interview with Apple CEO Tim Cook-qvR3PEPqX9A.m4a (pass -k to keep)

according to this output it seems that it is downloading m4a file with the help of the extractor as i have given below and then converting to wav with the post processor using ffmpeg for which i have provided the link below.So i dont think AAC stereo file is being fetched but m4a file is and i think its being converting with ffmpeg. Do you want me to convert m4a to wav myself with ffmpeg? Or am i not understanding something that you said..please explain if i am wrong...please suggest what to do.

i think the extractor being used is from this link https://github.com/rg3/youtube-dl/blob/4bede0d8f5b6fc8d8e46ee240f808935e03eafa2/youtube_dl/extractor/youtube.py

and the post processor for audio extraction is ffmpeg from this link : https://github.com/rg3/youtube-dl/blob/4bede0d8f5b6fc8d8e46ee240f808935e03eafa2/youtube_dl/postprocessor/ffmpeg.py

lissyx commented 5 years ago

@raghavk92 Well, that's exactly what's I'm saying. Youtube-dl fetches raw audio in the m4a and it's AAC:

$ mediainfo Watch\ CNBC\'s\ full\ interview\ with\ Apple\ CEO\ Tim\ Cook-qvR3PEPqX9A.f140.m4a 
General
Complete name                            : Watch CNBC's full interview with Apple CEO Tim Cook-qvR3PEPqX9A.f140.m4a
Format                                   : dash
Codec ID                                 : dash (iso6/mp41)
File size                                : 13.0 MiB
Duration                                 : 13 min 59 s
Overall bit rate                         : 129 kb/s
Encoded date                             : UTC 2019-01-03 02:40:16
Tagged date                              : UTC 2019-01-03 02:40:16

Audio
ID                                       : 1
Format                                   : AAC
Format/Info                              : Advanced Audio Codec
Format profile                           : LC
Codec ID                                 : mp4a-40-2
Duration                                 : 13 min 59 s
Bit rate                                 : 128 kb/s
Channel(s)                               : 2 channels
Channel positions                        : Front: L R
Sampling rate                            : 44.1 kHz
Frame rate                               : 43.066 FPS (1024 SPF)
Compression mode                         : Lossy
Stream size                              : 12.8 MiB (99%)
Title                                    : ISO Media file produced by Google Inc. Created on: 01/02/2019.
Language                                 : English
Encoded date                             : UTC 2019-01-03 02:40:16
Tagged date                              : UTC 2019-01-03 02:40:16

And then when you ask for WAV it extracts:

$ mediainfo Watch\ CNBC\'s\ full\ interview\ with\ Apple\ CEO\ Tim\ Cook-qvR3PEPqX9A.wav                                                                                                                            
General
Complete name                            : Watch CNBC's full interview with Apple CEO Tim Cook-qvR3PEPqX9A.wav
Format                                   : Wave
File size                                : 141 MiB
Duration                                 : 13 min 59 s
Overall bit rate mode                    : Constant
Overall bit rate                         : 1 411 kb/s
Writing application                      : Lavf58.12.100

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 13 min 59 s
Bit rate mode                            : Constant
Bit rate                                 : 1 411.2 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 44.1 kHz
Bit depth                                : 16 bits
Stream size                              : 141 MiB (100%)

And in the current description of the issue, you are using that last result WAV to again convert. You should try to do the ffmpeg conversion from the m4a.

raghavk92 commented 5 years ago

@lissyx I tried with converting directly from m4a file to wav 16k file with the command: ffmpeg -i Watch\ CNBC\'s\ full\ interview\ with\ Apple\ CEO\ Tim\ Cook-qvR3PEPqX9A.f140.m4a -acodec pcm_s16le -ac 1 -ar 16000 output16k.wav

I am attaching the output in a file but the output transcription has no change than with the previous file. Please suggest if i should try something differently

timcookcnbcinterview.txt

lissyx commented 5 years ago

I am attaching the output in a file but the output transcription has no change than with the previous file. Please suggest if i should try something differently

Can you please be more exhaustive when you say "no change" ? How did you run the transcription ?

raghavk92 commented 5 years ago

@lissyx I run this command and got the transcription :

deepspeech --model models_updated/output_graph.pbmm --alphabet models_updated/alphabet.txt --lm models_updated/lm.binary --trie models_updated/trie --audio audiofiles/output16k.wav

I am using the general way of running a transcription in virtual environment. deepspeech 0.4.0 alpha 3. I am not sure if this is what you asked for . If i gave wrong information , Please tell ..i ll give whatever i can.

With no change i mean i compared both the transcription(the earlier one which i have in the earlier post and the one that i attached now) . I read both and compared them side by side. Didnt compare programmatically but just compared by reading both and their seems to be no change in the transcription by directly converting from m4a to 16k wav file)

lissyx commented 5 years ago

Honestly, I don't know. Maybe there's some weird unaudible noise in the original recording / from the original upload that breaks us ? Have you tested the ffmpeg-based VAD tool ?

raghavk92 commented 5 years ago

are you reffering to this: https://github.com/mozilla/DeepSpeech/tree/master/examples/ffmpeg_vad_streaming

I havent tried the above link but i have tried this below: https://github.com/mozilla/DeepSpeech/blob/master/examples/vad_transcriber/audioTranscript_cmd.py

I ll try the ffmpeg_vad_streaming but like do you want me to pass the audio file like node ./index.js --audio <AUDIO_FILE> --model $HOME/models/output_graph.pbmm --alphabet $HOME/models/alphabet.txt (any wav file with any sample rate will work or does this also take 16k input)

or do you want me to do this rtmp stream like: node ./index.js --audio rtmp://<IP>:1935/live/teststream --model $HOME/models/output_graph.pbmm --alphabet $HOME/models/alphabet.txt but how do i stream the youtube audio to rtmp stream?

lissyx commented 5 years ago

I ll try the ffmpeg_vad_streaming but like do you want me to pass the audio file like node ./index.js --audio <AUDIO_FILE> --model $HOME/models/output_graph.pbmm --alphabet $HOME/models/alphabet.txt (any wav file with any sample rate will work or does this also take 16k input)

Help yourself and read the documenation, as well as the code, index.js is short and will reply to all your questions.

lissyx commented 5 years ago

Even extracting the first 10 secs of audio converted directly from the AAC does not help.

lissyx commented 5 years ago

@raghavk92 A good point mentionned by @reuben and that I forgot: features computations changed, so model from the link you have with binaries 0.4.0-alpha.3 will produce broken output.

So if you could include the full output including versions, we would all win some time ...

lissyx commented 5 years ago

@raghavk92 Ok, we now have released a new 0.4.1, could you re-test on your side ? Early testing here shows it improves.

raghavk92 commented 5 years ago

@lissyx Hi, Ya the transcription is greatly improved with the new 0.4.0 update that i tested. I also tested with 0.4.1 but dont know if that has an improvement over 0.4.0 as i didnt find much difference there.Thanks for the updates.

I have 3 questions regarding a few problems i am facing :

  1. We tried to do transfer learning on the model with a few of our samples(4 large audio samples (technical talks) converted to 740 (around 5 sec chunks and 500 training samples and 100 for dev and 140 for test) that we created with around 5 sec audio with transcription in the csv file created from voice activity detection.

So some transcriptions got better some got worse. So how many files are needed for a good transfer learning to happen?

  1. While transcribing with 04.0 i found that when the person speaks fast the transcription goes wrong either two words merge and form a wrong word or to seperate wrong words. So how do i improve this or this will also happen with transfer learning with people speaking fast and how many samples are ideal

  2. I tried transcribing with files with background music but got around 75% accuracy. I removed the noise with audacity: procedure :

    • remove voice from audio
    • get noise profile
    • and remove noise from original sample with noise profile

The accuracy was 85% after this.

But i tried to automate this with sox package for ubuntu procedure:

(i also tried with different levels of aggressivness like 0.3,0.05,0.1 etc but not much change in transcription)

The trancription became bad. I think it damaged the voice audio while noise reducing with sox.Do you know a better way for noise reduction and get better transscription.? And if i need to better transcribe a file which has background music is there any other way(like would training help and how many samples would i be needing)?

Thanks

lissyx commented 5 years ago

@raghavk92 All those questions would need to be on Discourse, Github issues are really only dedicated to bugs / features in the codebase.

raghavk92 commented 5 years ago

@lissyx i think there is a bug. I ll post the rest of the queestions on discourse. Bug: I had tried with 0.4.1 but with the old models . I didnt realise it was stable so new models will be there. So with 0.4.1 the transcription has become worse. Correct words being transcribed earlier correctly are being wrongly transcribed.

So the file that i tested are as below: https://www.youtube.com/watch?v=qvR3PEPqX9A

The old transcription files and new transcription files are being attached .

You will be able to see most of the correct part is transcribed wrongly with the new version.

Only very small things like chinaman converts to china man or tradeetions has become correctly trade tensions but other things have gone wrong

attachments below: 0-4-0transcription.txt 0-4-1transcription.txt

I ll give examples:

    • Correct transcription -- traffic in our channel partner stores ah the reports of the smart phone industry ah contracting
    • 0.4.0 transcription - traffic in our channel partner stones at the report of the smart pony industry or contracting
    • 0.4.1 transcription - traffic in er channel partner stones ah the reports of the smart fondants try a contracting

I dont know if this is a bug but there is some problem that has been introduced in the new version.

lissyx commented 5 years ago

I dont know if this is a bug but there is some problem that has been introduced in the new version.

There's way too much noise on your message for me to be able to understand anything. Once you say it's okay, then you say it's not.

raghavk92 commented 5 years ago

@lissyx So i am saying that some places where 0.4.0 was correct is now wrongly tanscribed. Some places have become correct which were earlier wrong(This has a lesser frequency than the correct words becoming wrong)

I hope the examples format are clear to understand where 0.4.0 was correct has now become wrong with 0.4.1

So i dont know what the exact problem is. Shouldn't it be like the older transcription should become better with the new version without damaging the earlier correctly transcribed parts.

raghavk92 commented 5 years ago

Once you say it's okay, then you say it's not.

if you are refering to my earlier messages(before the ones i wrote today) where i said there is not much change between 0.4.0 and 0.4.1 . At that time i didnt change the models folder. So that was wrong feedback.

lissyx commented 5 years ago

I'm sorry, but your bug is really too messy right now, you refer to a lot of different trials, I have absolutely no idea of your system setup / training status at this point. Forget about 0.4.0, it was wrong.

raghavk92 commented 5 years ago

@lissyx So i thought all the other details i mentioned are in the earlier posts about the system. i thought i have already given them.But again i ll mention.

I didnt know i have to write the information with every post.Details below:

everything of the setup is as you had told me to do . Do you require any other information please tell

thanks

lissyx commented 5 years ago

@lissyx So i thought all the other details i mentioned are in the earlier posts about the system. i thought i have already given them.But again i ll mention.

Thing is you tested a lot of different combination, it's hard to know exactly what's your current status if you don't describe it.

examples are small protions from both transcripts which are attached in my first comment today

Here again, it's complicated to track "first comment today", either explicit it, or link to it. We are not all on the same timezone, your "today" might be different than mine.

So i have used both 0.4.0 and 0.4.1 on different virtual environments.both deep speech gpu

So only native_client code, no use of any of the VAD-backed stuff ?

examples i gave are for the same youtube video converted by youtube-dl package and then directly from m4a to wav16k file. ( https://www.youtube.com/watch?v=qvR3PEPqX9A )

At some point, I'm starting to wonder if this video is just a bad case, maybe the audio contains inaudible noise that breaks our current models ?

raghavk92 commented 5 years ago

@lissyx Sorry for not linking my earlier post: https://github.com/mozilla/DeepSpeech/issues/1817#issuecomment-454357117

the files are in that post and the examples(which are exerpts) are also in that post

Yes i used only native_client . deepspeech command from cmd.

Ya i dont know if this is because its a bad example or not because some of the parts which were correct in 0.4.0 now became wrong with 0.4.1

I was telling you so that you know about the problem. So it can be corrected if you see a bug.

And because i am not sure which model to do my training(transfer learning) on that is also why i wanted to let you know the problems with the current version we are having . i am thinking of doing on 0.4.0 but Which model version do you suggest and why.

And if i train from scratch from the common voice data . Which version is better to train on. Should i use 0.4.0 files or 0.4.1 files because we were thinking of using 0.4.0 for training because it had better transcription . Is that a correct metric?

reuben commented 5 years ago

Don't use 0.4.0 for anything, we uploaded the wrong checkpoint and model, from a completely unrelated training job. Either use 0.3.0, or 0.4.1.

raghavk92 commented 5 years ago

@reuben thanks for that info but just wanted to give one example In 0.4.0 the transcription is :

So its just that the wrong model is giving correct and better result. So should we wait for the next release(would that be happening soon) or train with this because if we train and again get the similar mistakes with 0.4.1 then training job from scratch will cost us in aws and wont get a good model also.

Sorry for asking again but do you recommend using 0.4.1 to train from scratch or just to transfer learn? or do you recommend to wait for next release?

Thanks

reuben commented 5 years ago

With 0.4.0 comes from 0.3.0 but was then trained further on Italian data. If it's working better for you, then try using 0.3.0. I recommend using 0.4.1 for any experiments, as it has a lot of small improvements that add up to a higher quality model.

reuben commented 5 years ago

Closing due to inactivity.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.