lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
929 stars 212 forks source link

[proposal+infoshare] apt/pip-like manager for speech dataset #128

Open dophist opened 3 years ago

dophist commented 3 years ago

recently I came across this interesting project(https://github.com/activeloopai/Hub). It provides AI dataset (un)install/versioning/upgrade like what we do with pip/yum/apt for traditional software packages.

This is closely related to our current solution(those "download_and_untar_xxx.py" scripts + OpenSLR hosting), I believe there is something that we can learn from this to enhance lhotse usability.

The hosting part may shift from OpenSLR to blob/objects storage services(AWS/Azure/Aliyun) and so on, hence reducing the effort to maintain the OpenSLR data hosting server, and protect it from DDOS attacts (I've heard some ignorant guy in China spawned hundreds of downloading threads to OpenSLR that pull down whole kaldi website)

Say, end point K2 user needs only 1 line of python or bash command to get librispeech something like:

lothse.dataset_install('librispeech', 'v1.0', 'AWS', '/home/kaldi/database_warehouse/')
vinnie-the-pooh@ubuntu$ lothse install librispeech

this is conceptually much more friendly to new beginners, and behind the scence, the essential work is pretty much the same as those download_and_untar_xxx.py. With a bootstrap design, I believe contributors can make this better and easier along time.

another benefits of doing this is that datasets MAY evolve, bring centralized versioning and clear install/upgrade paths to speech datasets management, may be a good thing in the future.

This functionality may also apply to other speech related resources such as lexicon, vocabulary, text normalization rewrite grammars, standard benchmark test sets and so on.

I know you guys are busy preparing next-gen kaldi release, so the proposal is not urgent at all, just a thought to share : )

danpovey commented 3 years ago

Interesting...

recently I came across this interesting project(

https://github.com/activeloopai/Hub). It provides AI dataset (un)install/versioning/upgrade like what we do with pip/yum/apt for traditional software packages.

This is closely related to our current solution(those "download_and_untar_xxx.py" scripts + OpenSLR hosting), I believe there is something that we can learn from this to enhance lhotse usability.

The hosting part may shift from OpenSLR to blob/objects storage services(AWS/Azure/Aliyun) and so on, hence reducing the effort to maintain the OpenSLR data hosting server, and protect it from DDOS attacts (I've heard some ignorant guy in China spawned hundreds of downloading threads to OpenSLR that pull down whole kaldi website)

The reason we chose DigitalOcean is that it's way cheaper than AWS, Azure and Google Cloud. The cost for egress per gigabyte is much less. We (well Yenda takes care of this) already spend over $1000 per month- can't really afford those other options.

Say, end point K2 user needs only 1 line of python or bash command to get librispeech something like:

lothse.dataset_install('librispeech', 'v1.0', 'AWS', '/home/kaldi/database_warehouse/')

vinnie-the-pooh@ubuntu$ lothse install librispeech

this is conceptually much more friendly to new beginners, and behind the scence, the essential work is pretty much the same as those download_and_untar_xxx.py. With a bootstrap design, I believe contributors can make this better and easier along time.

another benefits of doing this is that datasets MAY evolve, bring centralized versioning and clear install/upgrade paths to speech datasets management, may be a good thing in the future.

Mm, the versioning is worth thinking about.

This functionality may also apply to other speech related resources such as

lexicon, vocabulary, text normalization rewrite grammars, standard benchmark test sets and so on.

I know you guys are busy preparing next-gen kaldi release, so the proposal is not urgent at all, just a thought to share : )

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO55QMD6FKZT57O3ARTSPYRN3ANCNFSM4TVL5HFA .

jtrmal commented 3 years ago

I agree it might be an interesting stuff y.

On Sat, Nov 14, 2020 at 9:58 AM Daniel Povey notifications@github.com wrote:

Interesting...

recently I came across this interesting project(

https://github.com/activeloopai/Hub). It provides AI dataset (un)install/versioning/upgrade like what we do with pip/yum/apt for traditional software packages.

This is closely related to our current solution(those "download_and_untar_xxx.py" scripts + OpenSLR hosting), I believe there is something that we can learn from this to enhance lhotse usability.

The hosting part may shift from OpenSLR to blob/objects storage services(AWS/Azure/Aliyun) and so on, hence reducing the effort to maintain the OpenSLR data hosting server, and protect it from DDOS attacts (I've heard some ignorant guy in China spawned hundreds of downloading threads to OpenSLR that pull down whole kaldi website)

The reason we chose DigitalOcean is that it's way cheaper than AWS, Azure and Google Cloud. The cost for egress per gigabyte is much less. We (well Yenda takes care of this) already spend over $1000 per month- can't really afford those other options.

Say, end point K2 user needs only 1 line of python or bash command to get librispeech something like:

lothse.dataset_install('librispeech', 'v1.0', 'AWS', '/home/kaldi/database_warehouse/')

vinnie-the-pooh@ubuntu$ lothse install librispeech

this is conceptually much more friendly to new beginners, and behind the scence, the essential work is pretty much the same as those download_and_untar_xxx.py. With a bootstrap design, I believe contributors can make this better and easier along time.

another benefits of doing this is that datasets MAY evolve, bring centralized versioning and clear install/upgrade paths to speech datasets management, may be a good thing in the future.

Mm, the versioning is worth thinking about.

This functionality may also apply to other speech related resources such as

lexicon, vocabulary, text normalization rewrite grammars, standard benchmark test sets and so on.

I know you guys are busy preparing next-gen kaldi release, so the proposal is not urgent at all, just a thought to share : )

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO55QMD6FKZT57O3ARTSPYRN3ANCNFSM4TVL5HFA

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-727218997, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX5KQLT2QPPE7NGUN43SP2LIBANCNFSM4TVL5HFA .

jtrmal commented 3 years ago

I'm actually thinking about using DVC for the versioning on OpenSLR, but didn't bit the bullet yet

pzelasko commented 3 years ago

Sounds good to me too! @jtrmal @danpovey since you guys have more experience in hosting it'd be good if you took care of the storage backend; I'm happy to provide an API in Lhotse that will work with that.

jtrmal commented 3 years ago

maybe we just overthink it and exporting md5sum for each of the packages would be enough? We perhaps do not really care bout keeping the version history but finding out if we have the latest data? Where latest means our md5sums equal to server md5sum.... y.

On Mon, Nov 16, 2020 at 10:22 AM Piotr Żelasko notifications@github.com wrote:

Sounds good to me too! @jtrmal https://github.com/jtrmal @danpovey https://github.com/danpovey since you guys have more experience in hosting it'd be good if you took care of the storage backend; I'm happy to provide an API in Lhotse that will work with that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-728130159, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX4WG2P3TIZFKXJWJ63SQE7S5ANCNFSM4TVL5HFA .

jtrmal commented 3 years ago

I went ahead and created md5sum for all corpora shared by openslr it's always in https://www.openslr.org/resources//checksum.md5 what do you guys think? y.

On Mon, Nov 16, 2020 at 11:34 AM Jan Trmal jtrmal@gmail.com wrote:

maybe we just overthink it and exporting md5sum for each of the packages would be enough? We perhaps do not really care bout keeping the version history but finding out if we have the latest data? Where latest means our md5sums equal to server md5sum.... y.

On Mon, Nov 16, 2020 at 10:22 AM Piotr Żelasko notifications@github.com wrote:

Sounds good to me too! @jtrmal https://github.com/jtrmal @danpovey https://github.com/danpovey since you guys have more experience in hosting it'd be good if you took care of the storage backend; I'm happy to provide an API in Lhotse that will work with that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-728130159, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX4WG2P3TIZFKXJWJ63SQE7S5ANCNFSM4TVL5HFA .

pzelasko commented 3 years ago

TBH I'm not sure if having the md5 checksum helps that much by itself. I like the version history idea because it helps to replicate experiments that were performed on an older version of the dataset.

jtrmal commented 3 years ago

I understand but it's very impractical (And expensive for openslr) to keep all versions of the corpora. Y.

On Tue, Nov 17, 2020 at 21:22 Piotr Żelasko notifications@github.com wrote:

TBH I'm not sure if having the md5 checksum helps that much by itself. I like the version history idea because it helps to replicate experiments that were performed on an older version of the dataset.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-729333714, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX4LFAPC65TOMHUIS23SQMVVVANCNFSM4TVL5HFA .

jtrmal commented 3 years ago

I guess keeping the data is not as big issue as enabling their download. Y.

On Tue, Nov 17, 2020 at 21:26 Jan Trmal jtrmal@gmail.com wrote:

I understand but it's very impractical (And expensive for openslr) to keep all versions of the corpora. Y.

On Tue, Nov 17, 2020 at 21:22 Piotr Żelasko notifications@github.com wrote:

TBH I'm not sure if having the md5 checksum helps that much by itself. I like the version history idea because it helps to replicate experiments that were performed on an older version of the dataset.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-729333714, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX4LFAPC65TOMHUIS23SQMVVVANCNFSM4TVL5HFA .

danpovey commented 3 years ago

Yeah I think only a tiny portion of the cost is the storage.

I don't have strong opinions about these things--certainly the md5sum can't hurt.

On Wed, Nov 18, 2020 at 10:28 AM jtrmal notifications@github.com wrote:

I guess keeping the data is not as big issue as enabling their download. Y.

On Tue, Nov 17, 2020 at 21:26 Jan Trmal jtrmal@gmail.com wrote:

I understand but it's very impractical (And expensive for openslr) to keep all versions of the corpora. Y.

On Tue, Nov 17, 2020 at 21:22 Piotr Żelasko notifications@github.com wrote:

TBH I'm not sure if having the md5 checksum helps that much by itself. I like the version history idea because it helps to replicate experiments that were performed on an older version of the dataset.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub < https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-729333714 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYX4LFAPC65TOMHUIS23SQMVVVANCNFSM4TVL5HFA

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-729335491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYRHGMNH43IDTNYQXLSQMWLLANCNFSM4TVL5HFA .

galv commented 3 years ago

I like the version history idea because it helps to replicate experiments that were performed on an older version of the dataset.

One problem with the history idea is allowing "the right to be forgotten" or "data takedown". When incorporating CC0 or CC-BY data into a dataset, it is becoming considered appropriate in the community (from what I understand) to remove data that was (1) either inappropriately licensed or (2) desired by the original creator not to be included in the dataset. In this case, it is either legally or ethically incorrect to maintain histories of a corpus. Mozilla's Common Voice actually incorporates this idea, removing some inappropriate data in their releases every 6 month.

For The People's Speech, a sizable and diverse (~10TB) English corpus some people here are aware of that is coming "real soon now", the plan is to have separate mirrors in (to start with) USA and China, with hosting sponsored by some tech companies. They are not very hard to convince to do that given that it helps with their work to have a legally "clean" public dataset available without having to take on the legal liabilities of creating it themselves (due to point (1) above).

danpovey commented 3 years ago

Mm, we'd have to make some exceptions. Thanks for the info.

On Wed, Nov 18, 2020 at 11:41 AM Daniel Galvez notifications@github.com wrote:

I like the version history idea because it helps to replicate experiments that were performed on an older version of the dataset.

One problem with the history idea is allowing "the right to be forgotten" or "data takedown". When incorporating CC0 or CC-BY data into a dataset, it is becoming considered appropriate in the community (from what I understand) to remove data that was (1) either inappropriately licensed or (2) desired by the original creator not to be included in the dataset. In this case, it is either legally or ethically incorrect to maintain histories of a corpus. Mozilla's Common Voice actually incorporates this idea, removing some inappropriate data in their releases every 6 month.

For The People's Speech, a sizable and diverse (~10TB) English corpus some people here are aware of that is coming "real soon now", the plan is to have separate mirrors in (to start with) USA and China, with hosting sponsored by some tech companies. They are not very hard to convince to do that given that it helps with their work to have a legally "clean" public dataset available without having to take on the legal liabilities of creating it themselves (due to point (1) above).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/128#issuecomment-729377512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4LPOPMQVZ77R7S7BLSQM67XANCNFSM4TVL5HFA .