metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
364 stars 97 forks source link

Disk_mem not sufficient #706

Closed SilasK closed 9 months ago

SilasK commented 9 months ago

HI all :). This thread has been super helpful for me as I have also been having problems with really strange mem_mb and disk_mb being set.

@SilasK has this been fixed at this point or is this still a working process. I am using a slurm system and am encountering the problems above (setting the reosuces manually in the actual atlas command has fixed this and I am progressing slowly, but it would be nice to not have to do this if at all possible).

I am especially having problems with the gunc_download rule, with this setting disk_mb to a value of 1000. Even specyfying larger amounts of mem_mb and disk_mb this rule keeps failing saying I am out of space. I have checked all my space allocations on my node and there is ample room for this database. I have attached my command below that has got me through the first few steps of the pipeline :).

atlas run all \
--working-dir /rc_scratch/beyo2625/sctld_patho \
--config-file /rc_scratch/beyo2625/sctld_patho/config.yaml \
--jobs 20 \
--profile cluster \
--default-resources mem_mb=250000 \
--set-resources deduplicate_reads:mem_mb=80000 dram_download:disk_mb=100000 download_gunc:disk_mb=100000 download_gunc:mem_mb=100000

If you need anymore information please let me know and I can submit it.

Originally posted by @benyoung93 in https://github.com/metagenome-atlas/atlas/issues/676#issuecomment-1793512169

SilasK commented 9 months ago

I never had issues eith disk space. I think it is set based on the input files, which might be wrong for downloads.

Does your slurm wrapper thake this value into acount?

Probably 1000 disk mem = 1gb is not enough.

Try: to increase to e.g. 50000

Try to run it locally:

atlas run None download_gunc

An no profile argument.

benyoung93 commented 9 months ago

Hi @SilasK :)

Thank you for opening the new issue, I was unsure about whether to start a new one or to comment on the default resources. I chose to do the default resources :)

Ye so with the downloads, when I check the run and the resources the disk_mb and disk_mib are given as insanely low values, ~300. This has actually been consistent with all the other downloads within the snakemake command but interestingly nothing else. I inspected the rules and I honestly could not find anything that was making this go weird.

Does your slurm wrapper thake this value into acount? Probably 1000 disk mem = 1gb is not enough.

So I bumped up the disk_mb for the rule dram_download and this worked (see command below, i concede that 100gb is overkill but it made it work soooooo). This has however not seemed to work for the download_gunc rule even with the bump up to 100gb.

atlas run all \
--working-dir /rc_scratch/beyo2625/sctld_patho \
--config-file /rc_scratch/beyo2625/sctld_patho/config.yaml \
--jobs 20 \
--profile cluster \
--default-resources mem_mb=250000 \
--set-resources deduplicate_reads:mem_mb=80000 dram_download:disk_mb=100000 download_gunc:disk_mb=100000 download_gunc:mem_mb=100000

I am now going to try and run this locally as suggested. I have had to do this a few times with generating a few of the conda libraries as well. Its super strange as running it interactively it works fine, but then in the slurm job submission script, even with crazy high memory and cores, these also failed with the memory error.

I am trying to troubleshoot and get more info, but if you have any specific things to check on my cluster I am happy to do this :).

Just as an aside, I am using 1 node with 2 sockets with 28 cores each, 2 threads per core (total core 112ish) and 500gb of avaliable RAM, and then 20Tb of scratch storage space. The only thing (?) I could thin kof was the default tmp directory may of been getting full, but that also seemed to have ample space.

Thank you for all the help I really appreciate it :).

Ben

benyoung93 commented 9 months ago

Quick update, doing the atlas run None gunc_download was throwing some super weird errors.

Instead, I went into the error file, activated the respective environment and then ran the shell command located within the error

mamba activate /rc_scratch/beyo2625/sctld_patho/databases/conda_envs/9eabbee686222ecc222d8ca745d4c8c2_
gunc download_db \
/tmp \
-db progenomes &> logs/downloads/gunc_download_progenomes.log ;mv /tmp/gunc_db_progenomes*.dmnd \
/rc_scratch/beyo2625/sctld_patho/databases/gunc_database/progenomes 2>> logs/downloads/gunc_download_progenomes.log
benyoung93 commented 9 months ago

So even running the command as above I still get the same error (pasted below). I think it may be due to the tmp directory being used, I am now going to try and do this but remove the tmp dir and have it download straight to the gunc folder in my databases.

[START] 10:57:24 2023-11-06
[INFO] DB downloading...
[INFO] DB download successful.
[INFO] Computing DB md5sum...
[INFO] md5sum check successful.
[INFO] Uncompressing file...
Traceback (most recent call last):
  File "/rc_scratch/beyo2625/sctld_patho/databases/conda_envs/9eabbee686222ecc222d8ca745d4c8c2_/bin/gunc", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/rc_scratch/beyo2625/sctld_patho/databases/conda_envs/9eabbee686222ecc222d8ca745d4c8c2_/lib/python3.11/site-packages/gunc/gunc.py", line 710, in main
    gunc_database.get_db(args.path, args.database)
  File "/rc_scratch/beyo2625/sctld_patho/databases/conda_envs/9eabbee686222ecc222d8ca745d4c8c2_/lib/python3.11/site-packages/gunc/gunc_database.py", line 118, in get_db
    decompress_gzip_file(gz_file_path, out_file)
  File "/rc_scratch/beyo2625/sctld_patho/databases/conda_envs/9eabbee686222ecc222d8ca745d4c8c2_/lib/python3.11/site-packages/gunc/gunc_database.py", line 54, in decompress_gzip_file
    shutil.copyfileobj(f_in, f_out)
  File "/rc_scratch/beyo2625/sctld_patho/databases/conda_envs/9eabbee686222ecc222d8ca745d4c8c2_/lib/python3.11/shutil.py", line 200, in copyfileobj
    fdst_write(buf)
OSError: [Errno 28] No space left on device

UPDATE

So after doing this command

mamba activate /rc_scratch/beyo2625/sctld_patho/databases/conda_envs/9eabbee686222ecc222d8ca745d4c8c2_
cd /rc_scratch/beyo2625/sctld_patho/databases/gunc_database
gunc download_db \
./ \
-db progenomes
mv gunc_db_progenomes*.dmnd progenomes

Everything has completed succesfully, woooo.

[INFO] DB download successful.
[INFO] Computing DB md5sum...
[INFO] md5sum check successful.
[INFO] Uncompressing file...
[INFO] Decompression complete.
[INFO] Computing DB md5sum...
[INFO] md5sum check successful.
[INFO] DB download successful.
[INFO] DB path: ./gunc_db_progenomes2.1.dmnd
[INFO]  11:23:01 Runtime: 0:11:28
[END]   11:23:01 2023-11-06

I think this is because of my tmp directory on the cluster potentially (?) not having enough space for the uncompression. I ran the atlas command (in comment above) and my edited one on same profile, same resources etc etc and it did not work when using the tmp directory. Any idea why this may be (maybe a size limit)?

On this, from my basic understanding and looking at the download rules, why does the gunc download use the tmp, but other downloads go straight to the database folder. Is there a reason for this?

Thank you for all the help so far, this has turned into a weird disk_mb allocation to a weird tmp space issue, but we have solved the download error wooooo.

SilasK commented 9 months ago

It is true the gunc db was sent to the temp folder before renaming. I fix this for the future

SilasK commented 9 months ago

Thank you for your help.