google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

An tutorial for running deepconsensus #25

Closed shengxinzhuan closed 2 years ago

shengxinzhuan commented 2 years ago

My computer hardware equipment look like this:

OS: Ubuntu 20.04.3 LTS (x86_64)
Python version: Python 3.8.10
CPUs: i7 10700k(8c16t, SkyLake)
Memory: 32G
GPU: 1 NVIDIA RTX A4000 8G

Install the requirement packages

Create an environment for deepconsensus using conda

mamba create -n deepconsensus -c bioconda -c conda-forge python=3.8 pbcore pbbam pbccs pbmm2 parallel jq gcc pycocotools bioconda::seqtk bioconda::unimap bioconda::bedtools bioconda::minimap2 bioconda::extracthifi bioconda::zmwfilter bioconda::pysam bioconda::samtools=1.10 bioconda::pyfastx=0.8.4

Download the ACTC for reads mapping

wget https://github.com/PacificBiosciences/align-clr-to-ccs/releases/download/0.1.0/actc 
chmod u+x actc
mv actc PATH/miniconda3/envs/deepconsensus/bin

Install the Deepconsensus[GPU] by using pip

conda activate deepconsensus
pip install deepconsensus[gpu]==0.2.0

Prepare all the needed input file for Deepconsensus

Get the ccs.bam

ccs --all -j 15 raw.subreads.bam out.ccs.bam

Get the subreads_to_ccs.bam

Tips

If you use the actc to map the subreads to ccs without chunks, then you may encounter this error when running the deepconsensus.

I0324 19:48:00.776319 140117319313216 quick_inference.py:492] Processed a batch of 100 ZMWs in 62.39794731140137 seconds
I0324 19:48:00.808807 140117319313216 quick_inference.py:570] Processed 7000 ZMWs in 4584.726703 seconds
Process ForkPoolWorker-1061:
Traceback (most recent call last):
  File "/home/wanglab/miniconda3/envs/deepconsensus/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/home/wanglab/miniconda3/envs/deepconsensus/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/home/wanglab/miniconda3/envs/deepconsensus/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/wanglab/miniconda3/envs/deepconsensus/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes
    self._send(buf)
  File "/home/wanglab/miniconda3/envs/deepconsensus/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

This error is caused by the number of stream processors reaching an upper limit as the iteration process increases. To avoid this mistake, the right way is chunking the data when using actc.

Chunking your subreads.bam

### Generating all command lines using shell
for i in {1..1000}; do echo 'actc -j 1 raw.subreads.bam out.ccs.bam subreads_to_ccs.'${i}'.bam --chunk '${i}'/1000' ; done > actc_chunk.job

### Submiting all scripts in parallel using parallel
parallel -j 15 < actc_chunk.job

### Index all the subreads_to_ccs.${i}.fasta
for i in {1..1000}; do echo 'samtools faidx subreads_to_ccs.'${i}'.fasta' ; done > samtools_index.job

parallel -j 15 < samtools_index.job

Get the model for Deepconsensus

mkdir deepconsensus_model && cd deepconsensus_model
wget https://storage.googleapis.com/brain-genomics-public/research/deepconsensus/models/v0.2/params.json
wget https://storage.googleapis.com/brain-genomics-public/research/deepconsensus/models/v0.2/checkpoint-50.index
wget https://storage.googleapis.com/brain-genomics-public/research/deepconsensus/models/v0.2/checkpoint-50.data-00000-of-00001

Run the Deepconsensus

for i in {1..1000};
do
deepconsensus run \
  --subreads_to_ccs=subreads_to_ccs.${i}.bam  \
  --ccs_fasta=subreads_to_ccs.${i}.fasta \
  --checkpoint=deepconsensus_model/checkpoint-50 \
  --output=output.${i}.fastq \
  --batch_zmws=100
done

Merge the output

cat output.*.fastq > total.fastq
MariaNattestad commented 2 years ago

Hi thanks for sharing this! Was there anything you felt was missing from the quick start? We recently updated this to include detailed parallelization instructions, showing how to run ccs with --chunk, so please take a look!

shengxinzhuan commented 2 years ago

@MariaNattestad Hi, MariaNattestad! Thanks to reply! I know how to run ccs with --chunk. But in China, we always got the ccs.bam from the sequencing company. So I don't need to filter the ccs.bam again, and move --chunk in actc. By the way, the ccs with --chunk and without --chunk are spend the same time on my test. If we need to filter the ccs.bam by ourself, --chunk may be more recommanded cause it can run again from the breakpoint.

MariaNattestad commented 2 years ago

Thanks for this context. I want to make sure to mention that in the quick start we do recommend running ccs yourself with --all so that DeepConsensus has the chance to rescue some reads that would have gotten filtered out as being below (usually) Q20 when ccs was run with default settings. Your yield above Q20 will often increase significantly if you are able to rerun ccs with --all.

shengxinzhuan commented 2 years ago

@MariaNattestad Thanks to reply! I will try again if the reads depth do not satisficated to assembly an genome.

MariaNattestad commented 2 years ago

@shengxinzhuan sounds good! I'll close this issue, but feel free to open a new one (so we get notified) if you have any new issues or questions. We want to understand the problems that our users face using DeepConsensus in practice!

shengxinzhuan commented 2 years ago

@MariaNattestad ok, I will open a new one if i meet new problem. Thanks!