Word Boundaries and Timestamp Format

speechcatcher-asr / speechcatcher

MIT License

38 stars 8 forks source link

Hey there!

Great tool and the accuracy is already quite high. I wonder if it would be possible to detect the word boundaries. So to speak the start and end timestamp of each word.

This goes with having a format that allows for begin and end of each word.

I'd suggest to use the format that vosk does use as many tools like videogrep rely on its json output.

The format is shown here: https://github.com/alphacep/vosk-api/blob/aba84973b188bac259b2914cbb1455c6c68dd9b6/src/vosk_api.h#L174

and here:

 {
        "content": "was wo du",
        "start": 5.46,
        "end": 7.02,
        "words": [
            {
                "conf": 0.674979,
                "end": 5.82,
                "start": 5.46,
                "word": "was"
            },
            {
                "conf": 0.469191,
                "end": 6.51,
                "start": 6.36,
                "word": "wo"
            },
            {
                "conf": 1.0,
                "end": 7.02,
                "start": 6.78,
                "word": "du"
            }
        ]
    },

This would allow for detection of silences between words or sentences and have better word level boundaries.

Hi!

Thanks, PRs welcome ;)

Note that the most recent version of Speechcatcher outputs timestamps of subwords/tokens in the json output. Since its also full end-to-end ASR, punctuation is part of the token vocabulary. So you would need to go from the full transcription output, segment into words with a word tokenizer and then backtrack the timestamps from the subwords.

Since videogrep has also support for .vtt files, just a quick note that subtitle2go has support for the speechcatcher engine and can export .vtt subtitles. It makes appropriate segments from the transcription output and it backtracks the token timestamps to segment boundaries: https://github.com/lecture2go/subtitle2go/blob/master/speechcatcher_decoder.py

speechcatcher-asr / speechcatcher

Word Boundaries and Timestamp Format #4