tensorflow / lingvo

Lingvo
Apache License 2.0
2.82k stars 445 forks source link

About ASR dataset #143

Open alessiaatunimi opened 5 years ago

alessiaatunimi commented 5 years ago

Hi, I'm having doubts about the download and parametrization for the dataset. I successfully run the four bash files in the folder lingvo/tasks/asr/tools. Until the container remains up (and so I can access to it with _docker exec -it containername having the same container id ) , I can easily find my folder librispeech with all the data I need. But then, when the docker container goes down, and I run the container again with docker run I have all the data but the dataset folder. However when I run the model, the error I get is a segmentation fault, it doesn't say anything about the dataset missing. Can you help me? I tried to commit che container image but it didn't work.

jonathanasdf commented 5 years ago

The segfault can be fixed by my comment in the other issue https://github.com/tensorflow/lingvo/issues/136#issuecomment-520066943

alessiaatunimi commented 5 years ago

What about the persistence of the dataset?

jonathanasdf commented 5 years ago

See eg. https://www.digitalocean.com/community/tutorials/how-to-share-data-between-the-docker-container-and-the-host

You should download the dataset outside of docker, then link it into the docker instance with -v, so the dataset doesn't get removed when you quit docker.

alessiaatunimi commented 5 years ago

I downloaded the dataset outside docker (running the first two bash file: _librispeech.01.downloadtrain.sh and _librispeech.02.downloaddevtest.sh). However, for the other two (_librispeech.03.parameterizetrain.sh and l_ibrispeech.04.parameterizedevtest.sh) I think that it's necessary to do it inside a docker container, isn't it? I cannot run them every time I rerun an exited container... Really sorry for the dumb issues, I'm really appreciating your availability and kindness

jonathanasdf commented 5 years ago

Hm, sorry I've never actually tried the librispeech processing scripts myself :(

I think if you create an empty directory and then link it into docker with -v then put stuff inside the directory from inside docker it should still remain even after you exit.

galv commented 5 years ago

Jonathan Shen is correct.

I've used the docker container successfully without having to preprocess the librispeech data every time after exiting, since I used the -v option.

On Mon, Aug 12, 2019 at 2:26 PM Jonathan Shen notifications@github.com wrote:

Hm, sorry I've never actually tried the librispeech processing scripts myself :(

I think if you create the directory and then link it into docker with -v then put stuff inside the directory from inside docker it should still remain even after you exit.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/lingvo/issues/143?email_source=notifications&email_token=ABEL6UBYEHHTJQC7XURCZSLQEHIPPA5CNFSM4ILCD4PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4D4DGA#issuecomment-520602008, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEL6UCNHESYDVINZB36GYTQEHIPPANCNFSM4ILCD4PA .

-- Daniel Galvez http://danielgalvez.me https://github.com/galv

alessiaatunimi commented 5 years ago

Jonathan Shen is correct. I've used the docker container successfully without having to preprocess the librispeech data every time after exiting, since I used the -v option.

Did you:

  1. download the dataset with _librispeech.01.downloadtrain.sh and _librispeech.02.downloaddevtest.sh outside docker
  2. build a docker container with -v
  3. run _librispeech.03.parameterizetrain.sh and _librispeech.04.parameterizedevtest.sh inside that docker container?

I'm a lot confused, I'd really appreciate your help

drpngx commented 5 years ago

You can download from either inside or outside of docker, but you need to make sure that the mounted directory is outside of the container. So for instance you'd start with -v /tmp/librispeech ( https://github.com/tensorflow/lingvo/blob/master/lingvo/tasks/asr/tools/librispeech_lib.sh#L17 ) After you exit the container, the data will containe to be there.

On Wed, Aug 14, 2019 at 12:28 AM alessiaatunimi notifications@github.com wrote:

Jonathan Shen is correct. I've used the docker container successfully without having to preprocess the librispeech data every time after exiting, since I used the -v option. … <#m-1336625568552878352>

Did you:

  1. download the dataset with librispeech.01.download_train.sh http://librispeech.01.download_train.sh and librispeech.02.download_devtest.sh http://librispeech.02.download_devtest.sh outside docker
  2. build a docker container with -v
  3. run librispeech.03.parameterize_train.sh http://librispeech.03.parameterize_train.sh and librispeech.04.parameterize_devtest.sh http://librispeech.04.parameterize_devtest.sh inside that docker container?

I'm a lot confused, I'd really appreciate your help

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/lingvo/issues/143?email_source=notifications&email_token=AE75E3JKOY5QI5DCZZED4BDQELHILA5CNFSM4ILCD4PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4GA6QA#issuecomment-520884032, or mute the thread https://github.com/notifications/unsubscribe-auth/AE75E3KDZV25UKYRIULYJLTQELHILANCNFSM4ILCD4PA .

alessiaatunimi commented 5 years ago

If I'm not wrong you're saying me that I have to edit the line in this file https://github.com/tensorflow/lingvo/blob/master/lingvo/tasks/asr/tools/librispeech_lib.sh#L17 where the ROOT is specified from ROOT=/tmp/librispeech to ROOT=-v /tmp/librispeech? Once modified, run again the first 2 files to download the dataset, and inside docker container the other two. Seen that I have the -v option, what the preprocess do will be permanent?

jonathanasdf commented 5 years ago

The -v is in the docker command, as described in https://www.digitalocean.com/community/tutorials/how-to-share-data-between-the-docker-container-and-the-host