google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

deepconsensus-0.2.0 uses a lot of system RAM, so maybe best to split jobs into pieces, example provided #17

Closed jelber2 closed 2 years ago

jelber2 commented 2 years ago

Maybe someone finds this useful to break up deepconsensus jobs (pretty much following the the quick start guide) except sorting the BAM file with samtools then indexing with samtools followed by breaking the job input files into 48 parts. If one has access to a cluster, then this is very helpful to submit deepconsensus on one GPU and one CPU core to keep system RAM usage low. I had found that deepconsensus-0.2.0 would quickly uses a lot of system RAM, so for me the steps below were necessary. I also have access to two compute nodes with 8 Nvidia A10 cards each.

# let's say you start with a BAM file called 9.bam that contains the subreads to be converted to ccs/HiFi reads
MIMALLOC_PAGE_RESET=0 MIMALLOC_LARGE_OS_PAGES=1 ccs --all -j 34 9.bam 9.ccs.bam

# use pbindex to index the BAM file if need (https://github.com/PacificBiosciences/align-clr-to-ccs/issues/2)
pbindex 9.bam

# use actc 0.1.0 with 34 cores to align the subreads to the ccs reads
~/bin/actc-0.1.0 -j 34 9.bam 9.ccs.bam 9.subreads_to_ccs.bam

# sort the alignments
samtools sort -@34 9.subreads_to_ccs.bam > 9.subreads_to_ccs_sorted.bam

# index the alignments
samtools index -@34 9.subreads_to_ccs_sorted.bam

# make a FAI index
samtools faidx 9.subreads_to_ccs.fasta

# get the sorted header
samtools view -H 9.subreads_to_ccs_sorted.bam|grep "^@SQ"|cut -f 2-3 |perl -pe "s/SN://g" |perl -pe "s/LN://g" \
> regions

# break into 48 equal parts
fspec=regions
num_files=48

# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))

# Split the actual file, maintaining lines.
split --lines=${lines_per_file} --numeric-suffixes=1 ${fspec}

# get a BED file of the ccs regions to extract from each of the 48 parts
# also get the read names
# use the read names to get just the FASTA for the 48 ccs parts

for i in `ls x??`
do
    awk '{print $1,0,$2}' OFS='\t' ${i} > ${i}.bed
    cut -f 1 ${i} > ${i}-reads
    seqtk seq -l0 9.subreads_to_ccs.fasta |paste - - |grep -f "${i}-reads" |tr '\t' '\n' > ${i}.fasta
    samtools faidx ${i}.fasta
done

# break the subreads aligned to ccs reads into 48 parts

for i in `ls x??`
do
    samtools view -@20 -h 9.subreads_to_ccs_sorted.bam --regions=${i}.bed |samtools view -@20 -b > ${i}.bam
    samtools index -@20 ${i}.bam
done

this is a SLURM script to submit to a cluster that has Nvidia A10 GPUs, same could be done if you have other GPUs

dc.sh

#!/bin/bash
#
#----------------------------------------------------------------
# running a multiple independent jobs
#----------------------------------------------------------------
#

#  Defining options for slurm how to run
#----------------------------------------------------------------
#
#SBATCH --job-name=dc
#
#Number of CPU cores to use within one node
#SBATCH -c 1
#
#Define the number of hours the job should run.
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=0:30:00
#
#Define the amount of RAM used by your job in GigaBytes
#In shared memory applications this is shared among multiple CPUs
#SBATCH --mem=24G
#
#Do not requeue the job in the case it fails.
#SBATCH --no-requeue
#
# Define number of GPUs
#SBATCH --partition gpu
#SBATCH --gres=gpu:1
#SBATCH --constraint=A10
#
#Do not export the local environment to the compute nodes
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV

# load the respective software module(s) you intend to use
#----------------------------------------------------------------
module load python/3.7
module load cuda/11.2.2
module load cudnn/8.1
. "/nfs/scistore16/itgrp/jelbers/miniconda3/etc/profile.d/conda.sh"

# slurm ids are 1,2,3,4,5,6,7,8,9 whilst 
# object names are 01,02,03,04,05,06,07,08,09
# so modify with this if statement
if [[ ${SLURM_ARRAY_TASK_ID} -lt 10 ]]
then
    i=`echo "${SLURM_ARRAY_TASK_ID}"|perl -pe "s/(\d+)/0\1/g"`
else
    i=${SLURM_ARRAY_TASK_ID}
fi

export PATH="/nfs/scistore12/itgrp/jelbers/.local/bin:$PATH"

cd /nfs/scistore16/itgrp/jelbers/deepconsensus_quick_start/9.all

hostname
echo "--subreads_to_ccs=x${i}.bam"
echo "--ccs_fasta=x${i}.fasta"
echo "--output=9.deepconsensus.${i}.fastq"
deepconsensus run --cpus 1 --subreads_to_ccs=x${i}.bam \
--ccs_fasta=x${i}.fasta \
--checkpoint=/nfs/scistore16/itgrp/jelbers/deepconsensus_quick_start/model/checkpoint-50 \
--output=9.deepconsensus.${i}.fastq

submit to SLURM scheduler

sbatch --array=1-48 dc.sh
MariaNattestad commented 2 years ago

Parallelizing DeepConsensus is actually easier than that. We plan to write a full guide, but here is a short summary: There is a --chunk option in ccs as mentioned in the quick start, which will produce sharded outputs -- we usually do 500 shards for a full SMRT-cell. There is no need to samtools sort any of the bam files since they are already in sorted order by ZMW. This order MUST be the same between the subreads_to_ccs.bam and ccs.fasta files, which they are coming out of the ccs and actc steps, so it's best not to run anything like samtools sort that might change that order. Beyond that, you can follow the rest of the quick start separately for each shard.

jelber2 commented 2 years ago

Ok, I will try the --chunk option without samtools sort now.

MariaNattestad commented 2 years ago

Update: I just released a major change to the DeepConsensus quick start with detailed guidance for parallelization across multiple machines: https://github.com/google/deepconsensus/blob/r0.2/docs/quick_start.md

daaaaande commented 1 year ago

maybe just tp add here: i tried to follow the guide and created a snakemake-based workflow for this:https://github.com/WestGermanGenomeCenter/deep_snake