What is the information of your dataset in detail?

shenasa-ai / speech2text

A Deep-Learning-Based Persian Speech Recognition System

MIT License

199 stars 27 forks source link

What is the information of your dataset in detail? #8

Open mohsenoon opened 1 year ago

mohsenoon commented 1 year ago

Apart from data size, please publish other information such as duration, number of files, average length of files, quality, number of speakers, source, and method of collection.

Also, since these data are Google's speech-to-text transcriptions, it is better to report this issue and its approximate error. The raw outputs of a speech-to-text model can be used with some considerations to train other models, but it certainly cannot be introduced as a speech-to-text dataset.

masoudMZB commented 1 year ago

Hi, [tmp note : This answer will be updated as much as I can. when todo is completed this line will be removed]

I'll write here my todo list to update and edit repo as soon as possible:

TODO

[x] Pin Issue
[ ] Edit Readme about what is the source of data
[ ] data statistical information and other type of informations
- [ ] number of files
- [x] duration
- [x] average length of files
- [ ] quality
- [ ] number of speakers
- [ ] method of collection
[ ] write about Google output results
[ ] some suggestions about how to use this data for different tasks (STT, etc. )

Thanks to @mohsenoon for analysing data

masoudMZB commented 1 year ago

Update 1 : 3/6/2023

new Stats for data is ready, these stats are not 100% accurate but they are accurate enough. you can trust these numbers :

Total Hours : 1697.1423399942473 Hour Total size : 195510797567.33728 bytes duration mean : 4.834937608942991 second size : 154718.00348617567 bytes

hamjam commented 1 year ago

Hi Masoud, I have downloaded all parts of version 2, but after removing duplicated metadata from CSVs, the remaining dataset consists of only 625h of audio clips, not 1697h. What do you think is the problem?

masoudMZB commented 1 year ago

@hamjam hi, sorry for the late response, can you add the duration of data v1 too, and say how much is data when both versions are added?

then if my information is wrong, send a pull request for reamdefile and correct it.

thanks for your attention