Closed jelber2 closed 2 years ago
Parallelizing DeepConsensus is actually easier than that. We plan to write a full guide, but here is a short summary:
There is a --chunk
option in ccs as mentioned in the quick start, which will produce sharded outputs -- we usually do 500 shards for a full SMRT-cell. There is no need to samtools sort
any of the bam files since they are already in sorted order by ZMW. This order MUST be the same between the subreads_to_ccs.bam and ccs.fasta files, which they are coming out of the ccs and actc steps, so it's best not to run anything like samtools sort that might change that order. Beyond that, you can follow the rest of the quick start separately for each shard.
Ok, I will try the --chunk option without samtools sort
now.
Update: I just released a major change to the DeepConsensus quick start with detailed guidance for parallelization across multiple machines: https://github.com/google/deepconsensus/blob/r0.2/docs/quick_start.md
maybe just tp add here: i tried to follow the guide and created a snakemake-based workflow for this:https://github.com/WestGermanGenomeCenter/deep_snake
Maybe someone finds this useful to break up deepconsensus jobs (pretty much following the the quick start guide) except sorting the BAM file with samtools then indexing with samtools followed by breaking the job input files into 48 parts. If one has access to a cluster, then this is very helpful to submit deepconsensus on one GPU and one CPU core to keep system RAM usage low. I had found that deepconsensus-0.2.0 would quickly uses a lot of system RAM, so for me the steps below were necessary. I also have access to two compute nodes with 8 Nvidia A10 cards each.
this is a SLURM script to submit to a cluster that has Nvidia A10 GPUs, same could be done if you have other GPUs
dc.sh
submit to SLURM scheduler