vgteam / toil-vg

Distributed and cloud computing framework for vg
Apache License 2.0
21 stars 14 forks source link

Fix command not found issue #269

Closed adamnovak closed 7 years ago

adamnovak commented 7 years ago

Getting 'vg command not found' error when running toil-vg on whole genome data using the docker image from the URL quay.io/vgteam/vg:v1.5.0-511-g6aebc5d4-t65-run. This error doesn't show up on tests using smaller datasets using the same configuration settings and version of toil and toil-vg. Will need to understand what's causing this error.

cmarkello commented 7 years ago

The only thing I can think of at the moment is that the vg container wasn't loaded prior to the singularity command or that there was an issue with the piped command.

The command that causes the error: singularity -q exec -H /data/Udpwork/usr/markellocj/UDP10618_run/workdir_UDP10618_7/toil-a01a4d93-d824-420e-a661-53066afca2c1-f190796504cc843dfa0efd14000001ff/tmp2WArih/0855239c-478d-459e-b7b4-fcba6cf2261d/t5gJT9U:/home/markellocj --pwd /home/markellocj docker://quay.io/vgteam/vg:v1.5.0-511-g6aebc5d4-t65-run vg call chunk_1_89774223.vg chunk_1_89774223.pu -t 4 -S UDP10618 -r 1 -c 1 -l 249250621 -o 89774223

glennhickey commented 7 years ago

Where's the log?

On Mon, Jul 24, 2017 at 5:14 PM, Charles Markello notifications@github.com wrote:

The only thing I can think of at the moment is that the vg container wasn't loaded prior to the singularity command or that there was an issue with the piped command.

The command that causes the error: singularity -q exec -H /data/Udpwork/usr/markellocj/ UDP10618_run/workdir_UDP10618_7/toil-a01a4d93-d824-420e-a661-53066afca2c1- f190796504cc843dfa0efd14000001ff/tmp2WArih/0855239c-478d- 459e-b7b4-fcba6cf2261d/t5gJT9U:/home/markellocj --pwd /home/markellocj docker://quay.io/vgteam/vg:v1.5.0-511-g6aebc5d4-t65-run vg call chunk_1_89774223.vg chunk_1_89774223.pu -t 4 -S UDP10618 -r 1 -c 1 -l 249250621 -o 89774223

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/toil-vg/issues/269#issuecomment-317557178, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7vVT3d8k4jeE15YH7icIEPIibLFcks5sRQlCgaJpZM4OhlvT .

cmarkello commented 7 years ago

@glennhickey Just pinged you on slack.

glennhickey commented 7 years ago

OK, it's definitely Singularity that's failing to load the docker image, judging from the log (below).

If this only happens on large datasets, I'd wager it's running out of disk space (maybe memory?) when saving or converting or compressing the docker image (hence the truncated tar.gz), which it needs to do on each call to exec.

If it's anything else, you should be able to reproduce the error just with "singularity exec docker://quay.io/vgteam/vg:v1.5.0-511-g6aebc5d4-t65-run vg call"

It should also be pretty easy to check if it's a disk space issue.

INFO:toil_vg.singularity:Calling singularity with ['singularity', '-q', 'exec', '-H', '/data/Udpwork/usr/markellocj/UDP10618_run/workdir_UDP10618_7/toil-a01a4d93-d824-420e-a661-53066afca2c1-f190796504cc843dfa0efd14000001ff/tmp2WArih/0855239c-478d-459e-b7b4-fcba6cf2261d/t5gJT9U:/home/markellocj', '--pwd', '/home/markellocj', 'docker:// quay.io/vgteam/vg:v1.5.0-511-g6aebc5d4-t65-run', 'vg', 'call', ' chunk_1_89774223.vg', 'chunk_1_89774223.pu', '-t', '4', '-S', 'UDP10618', '-r', '1', '-c', '1', '-l', '249250621', '-o', '89774223']

cn3123 2017-07-24 04:10:26,614 MainThread WARNING toil.leader: f/W/jobpCzFt6 gzip: /spin1/home/linux/markellocj/.singularity/metadata/sha256:e4f839f5df589b9e8ddc783d08bf538ccad836ccfccc6130c7afabd30dfad845.tar.gz: unexpected end of file cn3123 2017-07-24 04:10:26,614 MainThread WARNING toil.leader: f/W/jobpCzFt6 tar: This does not look like a tar archive cn3123 2017-07-24 04:10:26,614 MainThread WARNING toil.leader: f/W/jobpCzFt6 tar: Exiting with failure status due to previous errors cn3123 2017-07-24 04:10:26,614 MainThread WARNING toil.leader: f/W/jobpCzFt6 ^[[33mWARNING: Bind file source does not exist on host: /etc/resolv.conf cn3123 2017-07-24 04:10:26,614 MainThread WARNING toil.leader: f/W/jobpCzFt6 ^[[0m/.singularity.d/actions/exec: 8: exec: vg: not found

On Mon, Jul 24, 2017 at 5:28 PM, Charles Markello notifications@github.com wrote:

@glennhickey https://github.com/glennhickey Just pinged you on slack.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/toil-vg/issues/269#issuecomment-317560364, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7nAtVlkdImvqNgEJMZA8uB1-JEv2ks5sRQxigaJpZM4OhlvT .

cmarkello commented 7 years ago

If it was something to do with disk or memory limits, then I think toil should have picked that up. Previous runs I'd done when using older versions of toil would log those types of errors.

I'm running the samples again from scratch and giving them practically unlimited jobstore space since that's the only thing that was using scratch disk space.

glennhickey commented 7 years ago

I remember noticing an overhead for each exec of a Docker image when I was doing the Singularity implementation (#242). Looked like it was always converting the images to and from Docker. If it's dumb enough to make a separate image copy or equivalent for each concurrent singularity exec, that would eat a lot of space in the caller where dozens of call jobs are run at once. (much less likely: singularity bug corrupts image when run concurrently too many times)

Toil does not manage Singularity's disk space (or Docker's for that matter), and unless your jobstore is in your home directory, they are completely independent (your log is showing the problem file is here: /spin1/home/linux/markellocj/.singularity/metadata/)

Finally, Toil writes lots of data to three places, not the just the Job Store:

Once you're aware of where files are being written (ideally including image caches), you can use "df -h" on the command line to see how much room you have left.

On Mon, Jul 24, 2017 at 6:47 PM, Charles Markello notifications@github.com wrote:

If it was something to do with disk or memory limits, then I think toil should have picked that up. Previous runs I'd done when using older versions of toil would log those types of errors.

I'm running the samples again from scratch and giving them practically unlimited jobstore space since that's the only thing that was using scratch disk space.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/toil-vg/issues/269#issuecomment-317576750, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7seSD5EPKzq36qInVV6mrXU302FCks5sRR8WgaJpZM4OhlvT .

cmarkello commented 7 years ago

I've checked my disk quotas and it looks like it's just fine. the /spin1/home/linux/markellocj/ looks to be linked with my /home/markellocj/ directory and that's using 2.5 GB out of my allotted 8GB for my /home mount. My run's so far are looking good as far as mapping is concerned when I specify my Job Store in non-ephemeral disk space. I've previously made sure my Work Dir and Output Store are pointing to directories with plenty of disk quota left (in the range of terabytes).

I'll keep my whole genome runs going to make sure it's not failing at the variant calling step.

adamnovak commented 7 years ago

Did you check the usage while a singularity job was running/trying to run? 8 GB could be used up pretty quick dumping Docker images all over the place.

See if you can put the .singularity directory (or the images) onto scratch space somehow.

On Tue, Jul 25, 2017 at 4:02 PM, Charles Markello notifications@github.com wrote:

I've checked my disk quotas and it looks like it's just fine. the /spin1/home/linux/markellocj/ looks to be linked with my /home/markellocj/ directory and that's using 2.5 GB out of my allotted 8GB for my /home mount. My run's so far are looking good as far as mapping is concerned when I specify my Job Store in non-ephemeral disk space. I've previously made sure my Work Dir and Output Store are pointing to directories with plenty of disk quota left (in the range of terabytes).

I'll keep my whole genome runs going to make sure it's not failing at the variant calling step.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/toil-vg/issues/269#issuecomment-317898072, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0_X9BTCWzZP6ThIA0sz-7xY4a17k8Gks5sRnP8gaJpZM4OhlvT .

cmarkello commented 7 years ago

I'm not sure I follow. I don't believe the docker images are removed from the cache folder .singularity/docker after use. I'm guessing this is the case because when I run the container in interactive mode there's no new container image present within the cache folder. The last one that was made was July 14th. So I suspect that my interactively-running container is using a previously downloaded container image in that cache directory.

All of the images are tar gzipped so it's possible that they are untarred and decompressed when an image is in use. However, when I run a container interactively it doesn't appear to untar nor decompress a cached image. This is a little confusing since the Biowulf documentation on using singularity for running docker containers (at least in the interactive case) says that the cached images are ephemeral. https://hpc.nih.gov/apps/singularity.html#docker

glennhickey commented 7 years ago

Adam's implying that the cache can grow during use, and that it may only run out of space during execution of a bunch of jobs.

I tried to bombard my cache with a bunch of parallel execs. It grew a bit during use, but only from about 1.2G to 2G, then shrank back down after.

It seems the cached images are shared across execs, and the cache is maintained to only keep recent images.

Each call to exec does individually decompress the image. When I run 10 parallel execs, I get 10 "gzip -dc .tar.gz" processes. From your log, one of these gzips is encountering a truncated file, and whatever's reading the output of that (I can't see on ps) is getting a tar error. If this data' s landing on disk, it'll eat up a lot of space (ex each bcftools image is 400G compressed)

Anyway, I agree with Adam that it's wise to move the cache to somewhere where you're sure to have lots of space (done using SINGULARITY_CACHEDIR http://singularity.lbl.gov/user-control)

If you're still getting that error and it's not space related, I guess we'd have to suspect a bug in Singularity's cache.

On Tue, Jul 25, 2017 at 8:44 PM, Charles Markello notifications@github.com wrote:

I'm not sure I follow. I don't believe the docker images are removed from the cache folder .singularity/docker after use. I'm guessing this is the case because when I run the container in interactive mode there's no new container image present within the cache folder. The last one that was made was July 14th. So I suspect that my interactively-running container is using a previously downloaded container image in that cache directory.

All of the images are tar gzipped so it's possible that they are untarred and decompressed when an image is in use. However, when I run a container interactively it doesn't appear to untar nor decompress a cached image. This is a little confusing since the Biowulf documentation on using singularity for running docker containers (at least in the interactive case) says that the cached images are ephemeral. https://hpc.nih.gov/apps/singularity.html#docker

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/toil-vg/issues/269#issuecomment-317913690, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7q6c3QOUqMf4-r3FYaKSeoUKZPeYks5sRov1gaJpZM4OhlvT .

cmarkello commented 7 years ago

By 400GB do you mean the bcftools image is 400MB compressed?

Thanks Glenn that makes sense. I'll set the env variable to a place with plenty of disk space.

cmarkello commented 7 years ago

Changing the SINGULARITY_CACHEDIR env variable appears to work as long as the job is submitted prior to the env variable change. Moving this to finished.