google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

Chunking with deepconsensus #34

Closed gevro closed 2 years ago

gevro commented 2 years ago

Hi, Is there a way to do chunking with deepconsensus? For datasets where ccs was previously run and then chunks were merged, it would help, because then deepconsensus can do some form of chunking rather than having to redo ccs chunking back on the original subreads.

kishwarshafin commented 2 years ago

hi @gevro ,

Can you please expand more on the question?

If you are asking about the chunking schema, currently DeepConsensus chunks each CCS by window of a defined length and then merges again. The CCS run that happens upstream of DeepConsensus does not preserve chunking information to use downstream.

gevro commented 2 years ago

What I mean is that the current workflow is to run ccs with chunking, which splits the data into shards and then you can run actc and deepconsensus on each shard in parallel.

However, many users have samples that were already run through ccs, and we have a final hifi ccs BAM file. It would be nice to be able to have deepconsensus have an option to split the input ccs BAM into shards, similar to how ccs does this. i.e. a parameter for --chunk 1/10, --chunk 2/10, --chunk 3/10, etc.

If that is not possible, I will just split the ccs BAM into shards myself separately.

danielecook commented 2 years ago

@gevro, we recommend running ccs with the --min-rq=0.88 flag. Running ccs in its default configuration will filter reads that could be salvaged by DeepConsensus.

Here is a snippet you can use to split manually (let me know if you have a better method for doing so!). This should work with both subread and ccs bams.

conda create -n pacbio pbbam jq zmwfilter parallel
conda activate pacbio

INPUT_BAM=n1000.subreads.bam
pbindexdump ${INPUT_BAM}.pbi | jq '.reads[].holeNumber' | uniq > /tmp/zmw_list.txt
N_CHUNKS=10
TOTAL_ZMWS=$(wc -l < /tmp/zmw_list.txt)
LINES_PER_CHUNK=$(( ${TOTAL_ZMWS} / (${N_CHUNKS} - 1) ))
split -da 4 \
      --additional-suffix='.txt' \
      -l ${LINES_PER_CHUNK} \
      /tmp/zmw_list.txt \
      /tmp/zmw.split

function extract_zmws {
    output_chunk=${1//[!0-9]/}
    zmwfilter --include ${1} n1000.subreads.bam /tmp/n1000.${output_chunk}.bam 
}

export -f extract_zmws

parallel --verbose -j $(nproc) extract_zmws ::: /tmp/zmw.split*

This will give you 10 bams that you can process independently with DeepConsensus:

/tmp/n1000.0000.bam
/tmp/n1000.0001.bam
/tmp/n1000.0002.bam
/tmp/n1000.0003.bam
/tmp/n1000.0004.bam
/tmp/n1000.0005.bam
/tmp/n1000.0006.bam
/tmp/n1000.0007.bam
/tmp/n1000.0008.bam
/tmp/n1000.0009.bam
gevro commented 2 years ago

Thanks! I will try that. Only suggestion is that I think it is technically more efficient to map all subreads to all ccs with actc, and then to split both the ccs and the subreads_to_ccs.

i.e. that is more efficient than splitting the ccs, and then mapping ALL subreads to each chunk of ccs.

armintoepfer commented 2 years ago

actc can chunk. That's all you need

gevro commented 2 years ago

I'm not sure actc chunking would work for deepconsensus.

If a user has a full (non-chunked) ccs, running actc with chunking would provide the user with the full ccs and actc chunks. But deepconsensus input requires the ccs and the actc to match in terms of their chunk.

danielecook commented 2 years ago

I just tested this approach, and it works fine.

The main downside to this approach is that the ccs bam is read in its entirety for each shard, but this may be preferable to splitting your ccs bam.

Here is the code responsible for the logic of reading in subread_to_ccs and ccs bams:

https://github.com/google/deepconsensus/blob/r0.3/deepconsensus/preprocess/utils.py#L975-L983