mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.26k stars 3.96k forks source link

Word timestamps branch - feedback needed #1887

Closed dabinat closed 5 years ago

dabinat commented 5 years ago

I've created a version of DeepSpeech with timing information exposed for each word:

https://github.com/dabinat/DeepSpeech/tree/timing-info

It was my goal to break as little backwards-compatibility as possible, so by default the DeepSpeech app functions exactly as before and you need to use a -e flag to get the extra data.

Using the -e flag produces the following output:

./deepspeech -t -e --model ../models/output_graph.pbmm --alphabet ../models/alphabet.txt --lm ../models/lm.binary --trie ../models/trie --audio ../test_files/Theresa_May_interview_on_Andrew_Marr_Show_BBC_News-short.wav

file duration: 20
word: and, timestep: 1, time: 0.02
word: now, timestep: 18, time: 0.36
word: i, timestep: 29, time: 0.58
word: am, timestep: 38, time: 0.76
word: joined, timestep: 54, time: 1.08
word: life, timestep: 71, time: 1.42
word: in, timestep: 88, time: 1.76
word: the, timestep: 94, time: 1.88
word: studio, timestep: 100, time: 2
word: by, timestep: 122, time: 2.44
word: the, timestep: 129, time: 2.58
word: prime, timestep: 136, time: 2.72
word: minister, timestep: 148, time: 2.96
word: teresa, timestep: 169, time: 3.38
word: make, timestep: 194, time: 3.88
word: good, timestep: 204, time: 4.08
word: morning, timestep: 211, time: 4.22
word: from, timestep: 224, time: 4.48
word: lin, timestep: 243, time: 4.86
word: enter, timestep: 253, time: 5.06
word: and, timestep: 292, time: 5.84
word: can, timestep: 307, time: 6.14
word: we, timestep: 316, time: 6.32
word: agree, timestep: 322, time: 6.44
word: to, timestep: 338, time: 6.76
word: start, timestep: 345, time: 6.9
word: with, timestep: 358, time: 7.16
word: it, timestep: 368, time: 7.36
word: the, timestep: 379, time: 7.58
word: one, timestep: 386, time: 7.72
word: thing, timestep: 398, time: 7.96
word: the, timestep: 405, time: 8.1
word: voters, timestep: 415, time: 8.3
word: deserving, timestep: 434, time: 8.68
word: what, timestep: 464, time: 9.28
word: you, timestep: 472, time: 9.44
word: yourself, timestep: 480, time: 9.6
word: he, timestep: 503, time: 10.06
word: said, timestep: 510, time: 10.2
word: is, timestep: 516, time: 10.32
word: going, timestep: 522, time: 10.44
word: to, timestep: 529, time: 10.58
word: be, timestep: 532, time: 10.64
word: a, timestep: 535, time: 10.7
word: very, timestep: 544, time: 10.88
word: very, timestep: 579, time: 11.58
word: important, timestep: 604, time: 12.08
word: election, timestep: 634, time: 12.68
word: is, timestep: 668, time: 13.36
word: no, timestep: 679, time: 13.58
word: son, timestep: 693, time: 13.86
word: to, timestep: 707, time: 14.14
word: bite, timestep: 712, time: 14.24
word: militis, timestep: 766, time: 15.32
word: absolutely, timestep: 795, time: 15.9
word: crucial, timestep: 824, time: 16.48
word: because, timestep: 867, time: 17.34
word: this, timestep: 881, time: 17.62
word: is, timestep: 891, time: 17.82
word: as, timestep: 914, time: 18.28
word: i, timestep: 955, time: 19.1
word: think, timestep: 962, time: 19.24
word: the, timestep: 970, time: 19.4
word: most, timestep: 978, time: 19.56
word: simple, timestep: 990, time: 19.8

cpu_time_overall=17.11157

Timings are in seconds. I cross-referenced these timings with the audio file in Adobe Audition and they are accurate. The only ones that weren't quite right were for words DeepSpeech mistranscribed, which makes sense.

Again, I was trying to break as little as possible so the extended output is controlled by a variable in StreamingState so as to avoid passing it between function declarations. I did have to modify one function though - DS_SpeechToText now has an additional extendedOutput parameter. I would be keen to hear feedback from the devs on whether this is the best way of achieving things or whether passing the variable to functions directly is preferred.

ModelState::decode functions identically to before, but most of the logic is now in ModelState::decode_raw. This returns the vector output instead of the transcription so is a useful function to call if you need to do additional processing.

I have the following questions / requests for feedback before submitting a PR:

  1. Is the current output acceptable or would something like JSON be preferred?
  2. My background is primarily with Objective-C, not C++ and std, so optimization suggestions are appreciated.
  3. I only speak English so don't have the ability to test with other languages. It'd be helpful if someone can test with non-English languages and let me know how well it works.
reuben commented 5 years ago

This is super cool, thanks for sharing!

I have the following questions / requests for feedback before submitting a PR:

1. Is the current output acceptable or would something like JSON be preferred?

A machine readable result would be nicer to have. Or at least something like CSV or TSV which balances machine/human readability.

2. My background is primarily with Objective-C, not C++ and std, so optimization suggestions are appreciated.

Rather than generating the formatted output in the native client library, it'd be cleaner to expose the timing information in the API, by creating a method that returns it in a data structure, and then format the output in the client. That way other users of the timing information don't need to parse the output, they can just use it directly.

3. I only speak English so don't have the ability to test with other languages. It'd be helpful if someone can test with non-English languages and let me know how well it works.

@lissyx might be able to help here.

reuben commented 5 years ago

Re-reading my comment, I think if we expose the timing data in the API, then the client doesn't have to worry about producing machine readable output, since those cases can just call the API directly.

dabinat commented 5 years ago

Yes, I did it that way because the API already outputs the text. But I can split it up so there's a function that returns the data and a function that prints it. That way you can get it in whatever format you want.

carlfm01 commented 5 years ago

@dabinat Nice work! I'll test it for Spanish, also I'll change the .Net client to add this awesome feature 👍

carlfm01 commented 5 years ago

The only ones that weren't quite right were for words DeepSpeech mistranscribed

I can confirm the same behavior for Spanish, for the output below with WER 0% all the timings are correct.

file duration: 8
word: que, timestep: 50, time: 0.985222
word: cerraban, timestep: 55, time: 1.08374
word: perspectiva, timestep: 79, time: 1.55665
word: al, timestep: 147, time: 2.89655
word: otro, timestep: 156, time: 3.07389
word: lado, timestep: 164, time: 3.23153
word: de, timestep: 174, time: 3.42857
word: los, timestep: 179, time: 3.52709
word: cristales, timestep: 185, time: 3.64532
word: ligeramente, timestep: 217, time: 4.27586
word: turbios, timestep: 252, time: 4.96552
word: por, timestep: 272, time: 5.35961
word: la, timestep: 280, time: 5.51724
word: humedad, timestep: 285, time: 5.61576
word: exterior, timestep: 299, time: 5.89163
dabinat commented 5 years ago

I have a new commit up with JSON output.

The output from the -e flag now looks like this:

{
    "file": {"duration":"20"},
    "words": [
        {"time":"0.020000", "word":"and"},
        {"time":"0.360000", "word":"now"},
        {"time":"0.580000", "word":"i"},
        {"time":"0.760000", "word":"am"},
        {"time":"1.080000", "word":"joined"},
        {"time":"1.420000", "word":"life"},
        {"time":"1.760000", "word":"in"},
        {"time":"1.880000", "word":"the"},
        {"time":"2.000000", "word":"studio"},
        {"time":"2.440000", "word":"by"},
        {"time":"2.580000", "word":"the"},
        {"time":"2.720000", "word":"prime"},
        {"time":"2.960000", "word":"minister"},
        {"time":"3.380000", "word":"teresa"},
        {"time":"3.880000", "word":"make"},
        {"time":"4.080000", "word":"good"},
        {"time":"4.220000", "word":"morning"},
        {"time":"4.480000", "word":"from"},
        {"time":"4.860000", "word":"lin"},
        {"time":"5.060000", "word":"enter"},
        {"time":"5.840000", "word":"and"},
        {"time":"6.140000", "word":"can"},
        {"time":"6.320000", "word":"we"},
        {"time":"6.440000", "word":"agree"},
        {"time":"6.760000", "word":"to"},
        {"time":"6.900000", "word":"start"},
        {"time":"7.160000", "word":"with"},
        {"time":"7.360000", "word":"it"},
        {"time":"7.580000", "word":"the"},
        {"time":"7.720000", "word":"one"},
        {"time":"7.960000", "word":"thing"},
        {"time":"8.099999", "word":"the"},
        {"time":"8.300000", "word":"voters"},
        {"time":"8.679999", "word":"deserving"},
        {"time":"9.280000", "word":"what"},
        {"time":"9.440000", "word":"you"},
        {"time":"9.599999", "word":"yourself"},
        {"time":"10.059999", "word":"he"},
        {"time":"10.200000", "word":"said"},
        {"time":"10.320000", "word":"is"},
        {"time":"10.440000", "word":"going"},
        {"time":"10.580000", "word":"to"},
        {"time":"10.639999", "word":"be"},
        {"time":"10.700000", "word":"a"},
        {"time":"10.880000", "word":"very"},
        {"time":"11.580000", "word":"very"},
        {"time":"12.080000", "word":"important"},
        {"time":"12.679999", "word":"election"},
        {"time":"13.360000", "word":"is"},
        {"time":"13.580000", "word":"no"},
        {"time":"13.860000", "word":"son"},
        {"time":"14.139999", "word":"to"},
        {"time":"14.240000", "word":"bite"},
        {"time":"15.320000", "word":"militis"},
        {"time":"15.900000", "word":"absolutely"},
        {"time":"16.480000", "word":"crucial"},
        {"time":"17.340000", "word":"because"},
        {"time":"17.619999", "word":"this"},
        {"time":"17.820000", "word":"is"},
        {"time":"18.279999", "word":"as"},
        {"time":"19.100000", "word":"i"},
        {"time":"19.240000", "word":"think"},
        {"time":"19.400000", "word":"the"},
        {"time":"19.559999", "word":"most"},
        {"time":"19.799999", "word":"simple"}
    ]
}

metadata_from_output now returns a vector of mapped strings (i.e. an array of dictionaries) which makes it pretty easy for other apps to use via the API as you can just call object["word"] to get the word or any other key.

json_output_from_metadata prints it as JSON (pretty-printed). It iterates through the keys as if they were an array instead of hard-coding them.

So what this means is that in future additional metadata keys can be added (like word duration, confidence) without breaking any other code and without needing to modify the JSON output function. That's why I chose to do it this way instead of using structs or objects.

One question: is pretty-printing the JSON output going to cause anyone problems?

lissyx commented 5 years ago

@dabinat Can you open a PR ? And make sure you separate adding that to the API from exposing / using it in clients.

dabinat commented 5 years ago

@lissyx Done - see #1892 and #1893

lissyx commented 5 years ago

@dabinat Thanks, though you could have made it only one PR with two commits :)

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.