usegalaxy-eu / infrastructure-playbook

Ansible playbook for managing UseGalaxy.eu infrastructure.
MIT License
16 stars 91 forks source link

funannotate_annotate: run with `--writable-tmpfs` #1246

Closed bgruening closed 2 months ago

bgruening commented 2 months ago

@sanjaysrikakulam @mira-miracoli anything against this?

I see some read-only /tmp in the logs:

FileNotFoundError: [Errno 2] No such file or directory: '/data/jwd02f/main/071/349/71349627/working/output/predict_misc/protein_alignments.gff3'
galaxy@vgcnbwc-worker-c8m40g1-0000:/data/jwd02f/main/071/349/71349627$ cat outputs/tool_stderr 
-------------------------------------------------------
[Jul 02 02:04 PM]: OS: Debian GNU/Linux 10, 8 cores, ~ 41 GB RAM. Python: 3.8.15
[Jul 02 02:04 PM]: Running funannotate v1.8.15
[Jul 02 02:04 PM]: Skipping CodingQuarry as no --rna_bam passed
[Jul 02 02:04 PM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     busco          
  glimmerhmm   busco          
  snap         busco          
[Jul 02 02:11 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 616, in _run_server
    server.serve_forever()
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 182, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/usr/local/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/usr/local/lib/python3.8/shutil.py", line 718, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/local/lib/python3.8/shutil.py", line 675, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/local/lib/python3.8/shutil.py", line 673, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000068a0f1b00000001'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 616, in _run_server
    server.serve_forever()
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 182, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/usr/local/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/usr/local/lib/python3.8/shutil.py", line 718, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/local/lib/python3.8/shutil.py", line 675, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/local/lib/python3.8/shutil.py", line 673, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000068a0ce900000002'
[Jul 02 02:20 PM]: Genome loaded: 28,306 scaffolds; 775,487,987 bp; 41.27% repeats masked
[Jul 02 02:20 PM]: Mapping 557,291 proteins to genome using diamond and exonerate
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/funannotate/aux_scripts/funannotate-p2g.py", line 252, in <module>
    os.makedirs(tmpdir)
  File "/usr/local/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/tmp/p2g_65434e2d-85a3-4a9f-9652-b201aea1592c'
Traceback (most recent call last):
  File "/usr/local/bin/funannotate", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main
    mod.main(arguments)
  File "/usr/local/lib/python3.8/site-packages/funannotate/predict.py", line 1558, in main
    lib.exonerate2hints(Exonerate, hintsP)
  File "/usr/local/lib/python3.8/site-packages/funannotate/library.py", line 4600, in exonerate2hints
    with open(file, "r") as input:
FileNotFoundError: [Errno 2] No such file or directory: '/data/jwd02f/main/071/349/71349627/working/output/predict_misc/protein_alignments.gff3'
mira-miracoli commented 2 months ago

Currently we use vda as device for /tmp which is the same as the root disk and has currently 50G, however there is a vdb device for /scratch that has 1TB capacity.

Maybe it is safer to use that one?

bgruening commented 2 months ago

How do you interpret the above error? I thought its not writable at all?

sanjaysrikakulam commented 2 months ago

How do you interpret the above error? I thought its not writable at all?

I interpret it the same way. The docker is configured to use /scratch/docker as its root, so all containers use this to store data, volumes, images, tmp, etc.

-- Edit --- sorry, this is about singularity. I think if we want to use the /scratch, then this needs to be bind mounted exclusively as tmp in every singularity job. I don't know if there is a shortcut.

sanjaysrikakulam commented 2 months ago

To avoid storage problems and arbitrary mount points, can we use the tmp dir in the JWD, for example: $job_directory/tmp in the extra args?

kysrpex commented 2 months ago

To avoid storage problems and arbitrary mount points, can we use the tmp dir in the JWD, for example: $job_directory/tmp in the extra args?

+ :100: This would be best imo. For security reasons I do not think you want jobs to share /tmp. It does not seem like Singularity provides any good alternatives. There exists --writable-tmpfs, but it uses memory and provides too little space. Someone proposed --writable-scratch to use disk, but it has not been implemented.

Edit: I see actually you are enabling --writable-tmpfs via environment variable, moreover it is just for a single tool. I think it makes sense to have a look at this issue. It's ok anyway, no security concerns apply, and we are for sure better off with it enabled.

bgruening commented 2 months ago

Edit: I see actually you are enabling --writable-tmpfs via environment variable, moreover it is just for a single tool. I think it makes sense to https://github.com/apptainer/singularity/issues/5718. It's ok anyway, no security concerns apply, and we are for sure better off with it enabled.

Yes. This is really just for tools that have /tmp hardcoded or do other bad stuff. Galaxy should set by default TMPDIR and Co to the JWD already.

mira-miracoli commented 2 months ago

There exists --writable-tmpfs, but it uses memory and provides too little space.

We should just keep in mind that OOM killer can then kill the jobs if we not provision enough memory for the extra tmp.

Maybe I am blind here, but why do we not mount in the tmp from jwd as @sanjaysrikakulam suggested?

kysrpex commented 2 months ago

There exists --writable-tmpfs, but it uses memory and provides too little space.

We should just keep in mind that OOM killer can then kill the jobs if we not provision enough memory for the extra tmp.

Maybe I am blind here, but why do we not mount in the tmp from jwd as @sanjaysrikakulam suggested?

It's not either this or mounting tmp from jwd. Those are not mutually exclusive. You may check out apptainer/singularity#798. Running with --writable-tmpfs means you get a throwaway overlay mounted on / (only 64MB in size). Thus, if the program tries to write something anywhere (no matter if /tmp or somewhere else), it will be able to do so (up to 64MB, more if that location is mounted somewhere else). I hope the following example clarifies it:

centos@vgcnbwc-worker-c36m100-0013:~$ singularity shell /cvmfs/singularity.galaxyproject.org/f/u/funannotate\:1.8.13--pyhdfd78af_0 
Singularity> df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  64.0M     12.0K     64.0M   0% /
devtmpfs                  4.0M         0      4.0M   0% /dev
tmpfs                    48.9G         0     48.9G   0% /dev/shm
/dev/vda1                49.9G      8.9G     41.0G  18% /etc/localtime
/dev/vda1                49.9G      8.9G     41.0G  18% /etc/hosts
/dev/vda1                49.9G      8.9G     41.0G  18% /home/centos
/dev/vda1                49.9G      8.9G     41.0G  18% /tmp
/dev/vda1                49.9G      8.9G     41.0G  18% /var/tmp
tmpfs                    64.0M     12.0K     64.0M   0% /etc/resolv.conf
tmpfs                    64.0M     12.0K     64.0M   0% /etc/passwd
tmpfs                    64.0M     12.0K     64.0M   0% /etc/group
Singularity> echo a > /test
bash: /test: Read-only file system
Singularity> exit
centos@vgcnbwc-worker-c36m100-0013:~$ singularity shell --writable-tmpfs /cvmfs/singularity.galaxyproject.org/f/u/funannotate\:1.8.13--pyhdfd78af_0 
Singularity> df -h
Filesystem                Size      Used Available Use% Mounted on
fuse-overlayfs           64.0M     12.0K     64.0M   0% /
devtmpfs                  4.0M         0      4.0M   0% /dev
tmpfs                    48.9G         0     48.9G   0% /dev/shm
/dev/vda1                49.9G      8.9G     41.0G  18% /etc/localtime
/dev/vda1                49.9G      8.9G     41.0G  18% /etc/hosts
/dev/vda1                49.9G      8.9G     41.0G  18% /home/centos
/dev/vda1                49.9G      8.9G     41.0G  18% /tmp
/dev/vda1                49.9G      8.9G     41.0G  18% /var/tmp
tmpfs                    64.0M     12.0K     64.0M   0% /etc/resolv.conf
tmpfs                    64.0M     12.0K     64.0M   0% /etc/passwd
tmpfs                    64.0M     12.0K     64.0M   0% /etc/group
Singularity> echo a > /test
Singularity> 

Björn is suggesting to enable this for toolshed.g2.bx.psu.edu/repos/iuc/funannotate_annotate/funannotate_annotate/.*. I do not know if that solves the problem, but the cost of letting him try is close to zero.

mira-miracoli commented 2 months ago

Thank you, I think I understand this now better was confused, because I thought it is only about /tmp 64 MB should be fine memory wise, if the tmp files the tool creates are small enough, ofc

bgruening commented 2 months ago

Thanks for merging. Can I redeploy this? We just got one bug report again with this tool and the NFS lock issue today :(