[proposal+infoshare] apt/pip-like manager for speech dataset

lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.

https://lhotse.readthedocs.io/en/latest/

Apache License 2.0

929 stars 212 forks source link

[proposal+infoshare] apt/pip-like manager for speech dataset #128

Open dophist opened 3 years ago

dophist commented 3 years ago

recently I came across this interesting project(https://github.com/activeloopai/Hub). It provides AI dataset (un)install/versioning/upgrade like what we do with pip/yum/apt for traditional software packages.

This is closely related to our current solution(those "download_and_untar_xxx.py" scripts + OpenSLR hosting), I believe there is something that we can learn from this to enhance lhotse usability.

The hosting part may shift from OpenSLR to blob/objects storage services(AWS/Azure/Aliyun) and so on, hence reducing the effort to maintain the OpenSLR data hosting server, and protect it from DDOS attacts (I've heard some ignorant guy in China spawned hundreds of downloading threads to OpenSLR that pull down whole kaldi website)

Say, end point K2 user needs only 1 line of python or bash command to get librispeech something like:

lothse.dataset_install('librispeech', 'v1.0', 'AWS', '/home/kaldi/database_warehouse/')

vinnie-the-pooh@ubuntu$ lothse install librispeech

this is conceptually much more friendly to new beginners, and behind the scence, the essential work is pretty much the same as those download_and_untar_xxx.py. With a bootstrap design, I believe contributors can make this better and easier along time.

another benefits of doing this is that datasets MAY evolve, bring centralized versioning and clear install/upgrade paths to speech datasets management, may be a good thing in the future.

This functionality may also apply to other speech related resources such as lexicon, vocabulary, text normalization rewrite grammars, standard benchmark test sets and so on.

I know you guys are busy preparing next-gen kaldi release, so the proposal is not urgent at all, just a thought to share : )

danpovey commented 3 years ago

Interesting...

recently I came across this interesting project(

https://github.com/activeloopai/Hub). It provides AI dataset (un)install/versioning/upgrade like what we do with pip/yum/apt for traditional software packages.

This is closely related to our current solution(those "download_and_untar_xxx.py" scripts + OpenSLR hosting), I believe there is something that we can learn from this to enhance lhotse usability.

The hosting part may shift from OpenSLR to blob/objects storage services(AWS/Azure/Aliyun) and so on, hence reducing the effort to maintain the OpenSLR data hosting server, and protect it from DDOS attacts (I've heard some ignorant guy in China spawned hundreds of downloading threads to OpenSLR that pull down whole kaldi website)

The reason we chose DigitalOcean is that it's way cheaper than AWS, Azure and Google Cloud. The cost for egress per gigabyte is much less. We (well Yenda takes care of this) already spend over $1000 per month- can't really afford those other options.

Say, end point K2 user needs only 1 line of python or bash command to get librispeech something like:

lothse.dataset_install('librispeech', 'v1.0', 'AWS', '/home/kaldi/database_warehouse/')

vinnie-the-pooh@ubuntu$ lothse install librispeech

this is conceptually much more friendly to new beginners, and behind the scence, the essential work is pretty much the same as those download_and_untar_xxx.py. With a bootstrap design, I believe contributors can make this better and easier along time.

another benefits of doing this is that datasets MAY evolve, bring centralized versioning and clear install/upgrade paths to speech datasets management, may be a good thing in the future.

Mm, the versioning is worth thinking about.

This functionality may also apply to other speech related resources such as

lexicon, vocabulary, text normalization rewrite grammars, standard benchmark test sets and so on.

I know you guys are busy preparing next-gen kaldi release, so the proposal is not urgent at all, just a thought to share : )

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO55QMD6FKZT57O3ARTSPYRN3ANCNFSM4TVL5HFA .