Closed i7p9h9 closed 5 years ago
Hi,
It is possible for some smaller subsets to say that they come from one speaker. This mostly is true only for audio books.
But as a general rule - we aimed to gather as much diverse data as possible.
Now there is a field speaker set in the metadata - but it's mostly useless now.
I will think about marking some parts of the dataset as reliably having the same speaker.
On April 27, 2019 6:02:11 PM GMT+03:00, i7p9h9 notifications@github.com wrote:
Is it possible extract speakers id from your dataset to using for speaker recognition tasks?
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/snakers4/open_stt/issues/1
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
We are planning to share a much larger dataset based on audio-books Please PM me (telegram), I will share a private meta-data file, from which you could extract the data you need We are not planning to share this data publicly yet
It would be great if the data came with dedicated directories for each speaker e.g.
<dataset-id>/<speaker-id>/<sample-id>.wav
<sample-id>.txt
because it makes sense to separate speakers during training and testing. Not just for speaker recognition but also for STT tasks.
However, open_stt
is an awesome dataset nevertheless. Are you planning on adding more languages?
Hi!
Doing exactly this is not feasible unfortunately due to the nature of the dataset (zero money investment into annotation).
But we could share speakers privately as meta data for a very limited subset of data if this helps. Mostly books.
I see. Well, my workaround here is throwing everything uncertain into the train
set and test on data which has speaker separation. E.g. the Common Voice dataset might be reliable enough.
If I may ask, what kind of word error rate (WER) did you get on the entire open_stt
dataset? I am currently not too far below 40% (using ~3000h of the data) which is actually not as good as I expected it to be for so many hours of speech. :)
Well, my workaround here is throwing everything uncertain into the train set and test on data which has speaker separation.
We have a small subset of the data (15 hours) manually annotated - we will be posting it soon enough
what kind of word error rate (WER) did you get on the entire open_stt dataset
Sorry for a late reply, but please refer to a ticket #5 #7 Obviously these are not the best / latest models, but you can see some patterns in the distributions You will see that the whole dataset is not consistent in the annotation quality, so it has / will be distilled
There have been reports that if you use esp-net w/o data with bad annotation, you will get a much better result
It will be the foremost focus of our future work - seeding out the bad data
Sorry for a late reply, but please refer to a ticket #5 #7
@snakers4 no worries :)
Thanks for sharing that information. Will take a look on those issues.
Thanks for doing all this great work and providing such an easy-to-use dataset!
I see. Well, my workaround here is throwing everything uncertain into the
train
set and test on data which has speaker separation. E.g. the Common Voice dataset might be reliable enough.If I may ask, what kind of word error rate (WER) did you get on the entire
open_stt
dataset? I am currently not too far below 40% (using ~3000h of the data) which is actually not as good as I expected it to be for so many hours of speech. :)
Hi, stefan: You mentationed that you have trained ASR system on common voice russia, could you share the lastest common voice russia WER performance? I do not learn a lot about russian language, and train a russian ASR system with little 60h data with RU- common voice data, now the WER is about 40% with a chain model with kaldi toolkit even with a test set text LM, do you think it's normal? I haven't found any bench mark on common voice russia, do you think it's a normal performance? I find you often evalute russia ASR with CER, wheather it is more common on russian ASR target? Thanks a lot !!!
Is it possible extract speakers id from your dataset to using for speaker recognition tasks?