seqeralabs / nf-tower

Nextflow Tower system
https://tower.nf
Mozilla Public License 2.0
144 stars 51 forks source link

Staging in large files through docker on AWS fails #253

Closed matthpich closed 3 years ago

matthpich commented 3 years ago

Hi, I am trying to run the following pipeline:

#!/usr/bin/env nextflow
refbwaindex = "s3://nextflow-tower-test/test/IGC/index/bwa/"
refbwaindexChannel      = Channel.fromPath("${refbwaindex}/*")

process ls {
    input:
        file refbwaindex from refbwaindexChannel.collect()
        """
        ls -lha
    df -h
        """   
}

The refbwaindex contains the following files:

2020-10-03 16:59:18      20244 IGC.amb
2020-10-03 16:59:18  461481802 IGC.ann
2020-10-03 16:59:18 7436156148 IGC.bwt
2020-10-03 16:59:19 1859039015 IGC.pac
2020-10-03 17:01:47 3718078080 IGC.sa

The Dockerfile contains:

FROM continuumio/miniconda3
RUN apt-get update && apt-get install -y procps 
RUN     apt-get install -y wget build-essential zlib1g-dev \
&&      cd /tmp \
&&      wget https://github.com/OpenGene/fastp/archive/v0.20.0.tar.gz\
&&      tar xf v0.20.0.tar.gz \
&&      cd fastp-0.20.0 \
&&      make \
&&      make install \
&&      cd / \
&&      rm -rf /tmp/* \
&&      apt-get autoremove -y wget build-essential zlib1g-dev \
&&      rm -rf /var/lib/apt/lists/*
RUN conda install -c bioconda -c hcc -c conda-forge python=3.6 docopt pandas bwa trim-galore kneaddata trimmomatic pysam pysamstats samtools openssl=1.0 aspera-cli awscli && conda clean -a 
ENV PATH="/opt/conda/condabin:/opt/conda/bin:${PATH}"
RUN mkdir -p /tmp
WORKDIR /tmp

Yet, run from a tower forged aws environment, the index files do not seem properly staged in and the command.log contains:

nxf-scratch-dir ip-172-31-46-10:/tmp/nxf.FQM0IILIaj
main: line 238:    60 Killed                  /home/ec2-user/miniconda/bin/aws --region eu-west-3 s3 cp --only-show-errors "$source" "$target"
total 5.9G
drwx------ 2 root root 4.0K Oct  6 20:59 .
drwxrwxrwt 4 root root 4.0K Oct  6 20:59 ..
-rw-r--r-- 1 root root    0 Oct  6 20:59 .command.err
-rw-r--r-- 1 root root    0 Oct  6 20:59 .command.out
-rw-r--r-- 1 root root  13K Oct  6 20:30 .command.run
-rw-r--r-- 1 root root   30 Oct  6 20:30 .command.sh
-rw-r--r-- 1 root root    0 Oct  6 20:59 .command.trace
-rw-r--r-- 1 root root  20K Oct  3 14:59 IGC.amb
-rw-r--r-- 1 root root 441M Oct  3 14:59 IGC.ann
-rw-r--r-- 1 root root 291M Oct  6 20:58 IGC.bwt.73af39ba
-rw-r--r-- 1 root root 1.8G Oct  3 14:59 IGC.pac
-rw-r--r-- 1 root root 3.5G Oct  3 15:01 IGC.sa
Filesystem                                                                                        Size  Used Avail Use% Mounted on
/dev/mapper/docker-259:2-394429-37312c39a67bef126047f1aac16871a368f496f8c95fcb46ab6e1a931cbf69fe  9.8G  2.5G  6.8G  27% /
tmpfs                                                                                              64M     0   64M   0% /dev
tmpfs                                                                                             3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/nvme0n1p1                                                                                    7.9G  7.4G  350M  96% /tmp
/dev/nvme2n1                                                                                       50G   22M   50G   1% /etc/hosts
shm                                                                                                64M     0   64M   0% /dev/shm
tmpfs                                                                                             3.8G     0  3.8G   0% /proc/acpi
tmpfs                                                                                             3.8G     0  3.8G   0% /sys/firmware

Any idea how to properly load all the index files?

Many thanks for your help, and for your fantastic tools.

tbugfinder commented 3 years ago

Hi @matthpich , are you running it using docker? Do you expect to download all files within the container filesystem or did you configure bind mounts from the host?

matthpich commented 3 years ago

Dear @tbugfinder, The files are downloaded within the container filesystem indeed, which may be the source of the problem. Could you advise on how to setup docker so that EBS autoscaling kicks in?

tbugfinder commented 3 years ago

Are you using docker on an EC2 instance or using AWS Batch? You'd have to use bind mount in order to map e.g. EBS /tmp to Docker /ebstmp using e.g. -v tmp:/ebstmp . This can be setup in nextflow dockerOptions. or

docker {
    enabled = true
    temp = 'auto'
}
matthpich commented 3 years ago

@tbugfinder thanks for the prompt reply I am using AWS batch.

evanfloden commented 3 years ago

Are you using EBS auto scale with the Forge setup? 505443C2-5B48-402E-B17C-6D7F133CCA96

matthpich commented 3 years ago

Yes I do. Is there a specific folder in the container that is auto expandable?

tbugfinder commented 3 years ago

I missed that...

matthpich commented 3 years ago

I get the same error with EBS auto-scale on. main: line 263: 58 Killed /home/ec2-user/miniconda/bin/aws --region eu-west-3 s3 cp --only-show-errors "$source" "$target" What could I be missing?

pditommaso commented 3 years ago

It should work, could you provide the stdout/err of the last execution including the dir listing and df output?

matthpich commented 3 years ago

Hi @pditommaso, thanks for your message. I now run:

    whoami
    pwd
    echo "============"
        ls -lha
    echo "============"
    ls -lha /tmp
    echo "============"
    ls -lha /
    echo "============"
    df -h
    echo "============"
    ls -lha /etc/hosts

command.err is empty command.out contains:

root
/tmp/nxf.ZGvamvfBeA
============
total 2.2G
drwx------ 2 root root 4.0K Oct  8 19:42 .
drwxrwxrwt 3 root root 4.0K Oct  8 19:41 ..
-rw-r--r-- 1 root root    0 Oct  8 19:42 .command.err
-rw-r--r-- 1 root root   38 Oct  8 19:42 .command.out
-rw-r--r-- 1 root root  12K Oct  8 19:30 .command.run
-rw-r--r-- 1 root root  190 Oct  8 19:30 .command.sh
-rw-r--r-- 1 root root    0 Oct  8 19:42 .command.trace
-rw-r--r-- 1 root root  20K Oct  1 17:24 IGC.amb
-rw-r--r-- 1 root root 441M Oct  1 17:24 IGC.ann
-rw-r--r-- 1 root root 1.8G Oct  1 18:58 IGC.pac
============
total 12K
drwxrwxrwt  3 root root 4.0K Oct  8 19:41 .
drwxr-xr-x 22 root root 4.0K Oct  8 19:41 ..
drwx------  2 root root 4.0K Oct  8 19:42 nxf.ZGvamvfBeA
============
total 80K
drwxr-xr-x  22 root root 4.0K Oct  8 19:41 .
drwxr-xr-x  22 root root 4.0K Oct  8 19:41 ..
-rw-r--r--   1 root root 1010 Oct  8 19:42 .command.log
-rwxr-xr-x   1 root root    0 Oct  8 19:41 .dockerenv
drwxr-xr-x   2 root root 4.0K Mar 12  2020 .empty
drwxr-xr-x   2 root root 4.0K Sep 20 06:10 bin
drwxr-xr-x   2 root root 4.0K Feb  1  2020 boot
drwxr-xr-x   5 root root  340 Oct  8 19:41 dev
drwxr-xr-x  43 root root 4.0K Oct  8 19:41 etc
drwxr-xr-x   3 root root 4.0K Oct  8 19:41 home
drwxr-xr-x   8 root root 4.0K Sep 20 19:05 lib
drwxr-xr-x   2 root root 4.0K Feb 24  2020 lib64
drwxr-xr-x   2 root root 4.0K Feb 24  2020 media
drwxr-xr-x   2 root root 4.0K Feb 24  2020 mnt
drwxr-xr-x   3 root root 4.0K Mar 12  2020 opt
dr-xr-xr-x 153 root root    0 Oct  8 19:41 proc
drwx------   3 root root 4.0K Sep 20 19:04 root
drwxr-xr-x   3 root root 4.0K Feb 24  2020 run
drwxr-xr-x   2 root root 4.0K Sep 20 06:10 sbin
drwxr-xr-x   2 root root 4.0K Feb 24  2020 srv
dr-xr-xr-x  13 root root    0 Oct  8 19:39 sys
drwxrwxrwt   3 root root 4.0K Oct  8 19:41 tmp
drwxr-xr-x  10 root root 4.0K Feb 24  2020 usr
drwxr-xr-x  11 root root 4.0K Feb 24  2020 var
============
Filesystem                                                                                        Size  Used Avail Use% Mounted on
/dev/mapper/docker-259:2-394429-91c88bc572c8ae08e81b0d7b40c7304180a685444e9f1812b5018e8e7c51bc82  9.8G  4.7G  4.7G  50% /
tmpfs                                                                                              64M     0   64M   0% /dev
tmpfs                                                                                             1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/nvme2n1                                                                                       50G   22M   50G   1% /etc/hosts
shm                                                                                                64M     0   64M   0% /dev/shm
/dev/nvme0n1p1                                                                                    7.9G  1.5G  6.3G  20% /home/ec2-user/miniconda
tmpfs                                                                                             1.9G     0  1.9G   0% /proc/acpi
tmpfs                                                                                             1.9G     0  1.9G   0% /sys/firmware
============
-rw-r--r-- 1 root root 126 Oct  8 19:41 /etc/hosts 

Finally, command.log contains additionally:

nxf-scratch-dir ip-10-0-0-215:/tmp/nxf.ZGvamvfBeA
download failed: s3://danone-nextflow/references/IGC/bwa/IGC.sa to ./IGC.sa [Errno 28] No space left on device
download failed: s3://danone-nextflow/references/IGC/bwa/IGC.bwt to ./IGC.bwt [Errno 28] No space left on device
pditommaso commented 3 years ago

Weird, I'll try to replicate it

matthpich commented 3 years ago

I just created a fresh account with all the prerequisites (roles, tower forged aws config, EBS autoscaling on), ran the pipeline and and ended up the same result. So I assume the problem does not come from a missing role. So unfortunately, the mystery remains...

pditommaso commented 3 years ago

I've isolated the problem. A patch will be available on Monday.

On Sat, 10 Oct 2020, 15:25 matthpich, notifications@github.com wrote:

I just created a fresh account with all the prerequisites (roles, tower forged aws config, EBS autoscaling on), ran the pipeline and and ended up the same result. So I assume the problem does not come from a missing role. So unfortunately, the mystery remains...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/seqeralabs/nf-tower/issues/253#issuecomment-706549041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGHOSHQIPAZGEJEXSJLLDTSKBOGDANCNFSM4SG5VDTQ .

pditommaso commented 3 years ago

There's a problem with the volume mounting. Problem solved. Let us know if now works on your side.

pichauma commented 3 years ago

It does work. You made my day! Thanks a lot for your prompt help.

And, please kindly let me know if this is the way you recommend to work with large files.

pditommaso commented 3 years ago

cool

please kindly let me know if this is the way you recommend to work with large files.

Likely for very large dataset it may be better to configure FSx shared file system instead of transferring data from S3