Closed o-william-white closed 4 months ago
I will have a look over the branch this afternoon for you.
Many thanks again @douglowe
I have a strange error, which is repeatable (I deleted all the results folder and reran the workflow, just to make sure this error was real).
The error is:
/usr/bin/bash: line 2: 20756 Killed blobtools create --fasta results/assembled_sequence/Tacola_larymna.fasta --hits results/blastn/Tacola_larymna.txt --taxrule bestsumorder --taxdump resources/taxdump --cov results/minimap/Tacola_larymna.bam results/blobtools/Tacola_larymna &> logs/blobtools/Tacola_larymna.log
[Tue Jun 4 16:12:31 2024]
Error in rule blobtools:
jobid: 258
input: resources/taxdump, resources/taxdump/citations.dmp, resources/taxdump/delnodes.dmp, resources/taxdump/division.dmp, resources/taxdump/excludedfromtype.dmp, resources/taxdump/fullnamelineage.dmp, resources/taxdump/gencode.dmp, resources/taxdump/host.dmp, resources/taxdump/images.dmp, resources/taxdump/merged.dmp, resources/taxdump/names.dmp, resources/taxdump/nodes.dmp, resources/taxdump/rankedlineage.dmp, resources/taxdump/taxidlineage.dmp, resources/taxdump/typematerial.dmp, resources/taxdump/typeoftype.dmp, results/assembled_sequence/Tacola_larymna.fasta, results/blastn/Tacola_larymna.txt, results/minimap/Tacola_larymna.bam
output: results/blobtools/Tacola_larymna/table.tsv
log: logs/blobtools/Tacola_larymna.log (check log file(s) for error details)
conda-env: /workspaces/skim2mt/.snakemake/conda/bf53d9daf317f0d585fb33bf783ba2cd_
shell:
blobtools create --fasta results/assembled_sequence/Tacola_larymna.fasta --hits results/blastn/Tacola_larymna.txt --taxrule bestsumorder --taxdump resources/taxdump --cov results/minimap/Tacola_larymna.bam results/blobtools/Tacola_larymna &> logs/blobtools/Tacola_larymna.log
blobtools filter --table results/blobtools/Tacola_larymna/table.tsv --table-fields gc,length,Tacola_larymna_cov,bestsumorder_superkingdom,bestsumorder_kingdom,bestsumorder_phylum,bestsumorder_class,bestsumorder_order,bestsumorder_family,bestsumorder_species results/blobtools/Tacola_larymna &>> logs/blobtools/Tacola_larymna.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
The logs/blobtools/Tacola_larymna.log
file contains only this message:
Loading sequences from results/assembled_sequence/Tacola_larymna.fasta
[W::hts_idx_load3] The index file is older than the data file: results/minimap/Tacola_larymna.bam.csi
I am running this on my local (apple-silicon) laptop, using a dev container - that might throw up some unusual behaviour. I'll check it again tomorrow but would be wise to get someone else to check the workflow on their setup too.
Hi @douglowe,
Thanks for highlighting this. I remember we had this issue during the hackathon. The blobtools create
step uses quite a lot of memory when reading in the taxdump from ncbi.
I run the pipeline on our HPC so it is rarely an issue. Perhaps we need to add a line to the documentation saying the minimum memory requirements. I will run it again with --report
to test the memory usage.
My log looks much the same:
Loading sequences from results/assembled_sequence/Litinga_cottini.fasta
- processing Litinga_cottini_contig0: : 1it [00:00, 25.20it/s]
[W::hts_idx_load3] The index file is older than the data file: results/minimap/Litinga_cottini.bam.csi
Loading mapping data from results/minimap/Litinga_cottini.bam as Litinga_cottini
Loading parsed taxdump
Can you increase the amount of memory available on your system, or do we need to find a workaround for this step?
Cheers Ollie
Hello again @douglowe,
See below the benchmark output (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#benchmark-rules) for blobtools on the sample Litinga_cottini.
cat benchmarks/Litinga_cottini_blobtools.txt | column -t
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
186.3125 0:03:06 3130.36 4102.34 3010.57 3012.02 417.04 0.03 86.23 161.66
Thanks for this information Ollie.
I'm trying again with limiting the number of concurrent tasks to 1 - as I recall now I had to do that avoid this problem.
I have been setting a memory maximum for snakemake, but only in total for the workflow. I wonder if we can set a minimum RAM requirement for specific tasks within the workflow?
Yes I remember that fixed it. I will also try to limit the memory using the resources tag https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources
If you set the the resources to mem_mb=1000
in the blobtools rule the memory usage appears to go down
cat benchmarks/Litinga_cottini_blobtools.txt | column -t
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
187.9244 0:03:07 2753.25 2977.14 2729.16 2730.53 58.79 0.03 83.77 157.69
I've tried setting mem_mb in the blobtools rule - seems that I can still only run the workflow locally if I run with a single thread. Even running 2 threads is too much for my laptop.
Maybe this isn't a major issue though - certainly not something which should stop this PR.
The workflow does run for me - so that is good. I'll have a proper look at the code now - see if there's any changes I'd suggest.
Hi all,
I tested the changes on a fork from @pecholleyn and created a new PR here https://github.com/o-william-white/skim2mt/pull/33 I am happy to merge and move on but if anyone wanted to test this let me know.
Is it sensible to close this PR now?
Cheers Ollie
Hello,
Would anyone be able to test this branch before I merge? @rvosa @bwprice @dickgroenenberg @douglowe @elichad @pecholleyn
It includes some fairly large changes to remove the need for .ok files, based on suggestions by @pecholleyn. Thank you again for your time @pecholleyn! This branch also includes an output multiqc file which should make it easier to identifiy samples that have worked/failed.
The only downside with this appraoch is that it is not possible to create a nice rule graph.
Cheers Ollie