tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.6k stars 3.51k forks source link

Fix CommonVoice dataset for Speech Recognition problem #1852

Open RegaliaXYZ opened 4 years ago

RegaliaXYZ commented 4 years ago

Fixed Common Voice data generator by adding a flag to the datagen.py file for the language code (--language="en", if not specified it defaults to english) and dynamically downloading the correct language dataset.

Also had to rework the architecture of the data unpacking since Mozilla changed their folder architecture.

Also removed the sub-problems of Common Voice (Noisy, Clean, FullTestClean since all the previous .tsv files were merged (no more other-train, other-test etc)

googlebot commented 4 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

RegaliaXYZ commented 4 years ago

@googlebot I signed it!

googlebot commented 4 years ago

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.