o-william-white / skim2mito

A snakemake pipeline for the batch assembly, annotation, and phylogenetic analysis of mitochondrial genomes from genome skims
MIT License
7 stars 5 forks source link

Checkpoint fix #32

Closed o-william-white closed 4 months ago

o-william-white commented 4 months ago

Hello,

Would anyone be able to test this branch before I merge? @rvosa @bwprice @dickgroenenberg @douglowe @elichad @pecholleyn

It includes some fairly large changes to remove the need for .ok files, based on suggestions by @pecholleyn. Thank you again for your time @pecholleyn! This branch also includes an output multiqc file which should make it easier to identifiy samples that have worked/failed.

The only downside with this appraoch is that it is not possible to create a nice rule graph.

Cheers Ollie

douglowe commented 4 months ago

I will have a look over the branch this afternoon for you.

o-william-white commented 4 months ago

Many thanks again @douglowe

douglowe commented 4 months ago

I have a strange error, which is repeatable (I deleted all the results folder and reran the workflow, just to make sure this error was real).

The error is:

/usr/bin/bash: line 2: 20756 Killed                  blobtools create --fasta results/assembled_sequence/Tacola_larymna.fasta --hits results/blastn/Tacola_larymna.txt --taxrule bestsumorder --taxdump resources/taxdump --cov results/minimap/Tacola_larymna.bam results/blobtools/Tacola_larymna &> logs/blobtools/Tacola_larymna.log
[Tue Jun  4 16:12:31 2024]
Error in rule blobtools:
    jobid: 258
    input: resources/taxdump, resources/taxdump/citations.dmp, resources/taxdump/delnodes.dmp, resources/taxdump/division.dmp, resources/taxdump/excludedfromtype.dmp, resources/taxdump/fullnamelineage.dmp, resources/taxdump/gencode.dmp, resources/taxdump/host.dmp, resources/taxdump/images.dmp, resources/taxdump/merged.dmp, resources/taxdump/names.dmp, resources/taxdump/nodes.dmp, resources/taxdump/rankedlineage.dmp, resources/taxdump/taxidlineage.dmp, resources/taxdump/typematerial.dmp, resources/taxdump/typeoftype.dmp, results/assembled_sequence/Tacola_larymna.fasta, results/blastn/Tacola_larymna.txt, results/minimap/Tacola_larymna.bam
    output: results/blobtools/Tacola_larymna/table.tsv
    log: logs/blobtools/Tacola_larymna.log (check log file(s) for error details)
    conda-env: /workspaces/skim2mt/.snakemake/conda/bf53d9daf317f0d585fb33bf783ba2cd_
    shell:

        blobtools create             --fasta results/assembled_sequence/Tacola_larymna.fasta             --hits results/blastn/Tacola_larymna.txt             --taxrule bestsumorder             --taxdump resources/taxdump             --cov results/minimap/Tacola_larymna.bam             results/blobtools/Tacola_larymna &> logs/blobtools/Tacola_larymna.log
        blobtools filter             --table results/blobtools/Tacola_larymna/table.tsv             --table-fields gc,length,Tacola_larymna_cov,bestsumorder_superkingdom,bestsumorder_kingdom,bestsumorder_phylum,bestsumorder_class,bestsumorder_order,bestsumorder_family,bestsumorder_species             results/blobtools/Tacola_larymna &>> logs/blobtools/Tacola_larymna.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The logs/blobtools/Tacola_larymna.log file contains only this message:

Loading sequences from results/assembled_sequence/Tacola_larymna.fasta

[W::hts_idx_load3] The index file is older than the data file: results/minimap/Tacola_larymna.bam.csi

I am running this on my local (apple-silicon) laptop, using a dev container - that might throw up some unusual behaviour. I'll check it again tomorrow but would be wise to get someone else to check the workflow on their setup too.

o-william-white commented 4 months ago

Hi @douglowe,

Thanks for highlighting this. I remember we had this issue during the hackathon. The blobtools create step uses quite a lot of memory when reading in the taxdump from ncbi.

I run the pipeline on our HPC so it is rarely an issue. Perhaps we need to add a line to the documentation saying the minimum memory requirements. I will run it again with --report to test the memory usage.

My log looks much the same:

Loading sequences from results/assembled_sequence/Litinga_cottini.fasta
 - processing Litinga_cottini_contig0: : 1it [00:00, 25.20it/s]
[W::hts_idx_load3] The index file is older than the data file: results/minimap/Litinga_cottini.bam.csi
Loading mapping data from results/minimap/Litinga_cottini.bam as Litinga_cottini
Loading parsed taxdump

Can you increase the amount of memory available on your system, or do we need to find a workaround for this step?

Cheers Ollie

o-william-white commented 4 months ago

Hello again @douglowe,

See below the benchmark output (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#benchmark-rules) for blobtools on the sample Litinga_cottini.

cat benchmarks/Litinga_cottini_blobtools.txt | column -t
s         h:m:s    max_rss  max_vms  max_uss  max_pss  io_in   io_out  mean_load  cpu_time
186.3125  0:03:06  3130.36  4102.34  3010.57  3012.02  417.04  0.03    86.23      161.66
douglowe commented 4 months ago

Thanks for this information Ollie.

I'm trying again with limiting the number of concurrent tasks to 1 - as I recall now I had to do that avoid this problem.

I have been setting a memory maximum for snakemake, but only in total for the workflow. I wonder if we can set a minimum RAM requirement for specific tasks within the workflow?

o-william-white commented 4 months ago

Yes I remember that fixed it. I will also try to limit the memory using the resources tag https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources

o-william-white commented 4 months ago

If you set the the resources to mem_mb=1000 in the blobtools rule the memory usage appears to go down

cat benchmarks/Litinga_cottini_blobtools.txt | column -t
s         h:m:s    max_rss  max_vms  max_uss  max_pss  io_in  io_out  mean_load  cpu_time
187.9244  0:03:07  2753.25  2977.14  2729.16  2730.53  58.79  0.03    83.77      157.69
douglowe commented 4 months ago

I've tried setting mem_mb in the blobtools rule - seems that I can still only run the workflow locally if I run with a single thread. Even running 2 threads is too much for my laptop.

douglowe commented 4 months ago

Maybe this isn't a major issue though - certainly not something which should stop this PR.

The workflow does run for me - so that is good. I'll have a proper look at the code now - see if there's any changes I'd suggest.

o-william-white commented 4 months ago

Hi all,

I tested the changes on a fork from @pecholleyn and created a new PR here https://github.com/o-william-white/skim2mt/pull/33 I am happy to merge and move on but if anyone wanted to test this let me know.

Is it sensible to close this PR now?

Cheers Ollie