google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

How to improve run time? #14

Closed MartinPippel closed 2 years ago

MartinPippel commented 2 years ago

I tried to solve my issue #10 by creating a Singularity container and it seems to work on our compute cluster. However it is very slow. Do you have any advice how to speed up the deepconsenus step?

I created a toy data set, with the following specs:

2.0G m54345U_211128_022942.chunk0.subreads.bam
60M  m54345U_211128_022942.chunk0.ccs.fasta
1.8G subreads_to_ccs.bam

I started deepconsensus (on a 24 core machine, 250Gb RAM) with the default args:

$SING_CMD python3 -m deepconsensus.scripts.run_deepconsensus --input_subreads_aligned=subreads_to_ccs.bam --input_subreads_unaligned=split/m54345U_211128_022942.chunk0.subreads.bam --input_ccs_fasta=ccs/m54345U_211128_022942.chunk0.ccs.fasta --output_directory=deepconsensus --checkpoint=${CHECKPOINT_PATH}

After almost 7 hours run time it is still in step 2 2_generate_input. It is also using only a single thread. This is a snapshot of htop:

74354 pippel     20   0 70.6G 64.8G  100M R 100. 25.8  6h26:45 python3 -m deepconsensus.preprocess.generate_input --merged_datasets_path=deepconsensus/1_merge_datasets --output_path=deepconsensus/2_generate_input --input_ccs_fasta=ccs/m54345U_211128_022942.chunk0.ccs.fas
74160 pippel     20   0 70.6G 64.8G  100M S 100. 25.8  6h37:12 python3 -m deepconsensus.preprocess.generate_input --merged_datasets_path=deepconsensus/1_merge_datasets --output_path=deepconsensus/2_generate_input --input_ccs_fasta=ccs/m54345U_211128_022942.chunk0.ccs.fas

Additionally, I do get the following tensorflow error that might be related to my problem:

 $SING_CMD python3 -m deepconsensus.preprocess.generate_input --helpfull
2021-12-03 09:26:35.456162: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2021-12-03 09:26:35.456208: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Is deepconsensus using a gpu? Any advice is highly appreciated.

Thanks, Martin

danielecook commented 2 years ago

The current release of DeepConsensus (v0.1.0) is a proof-of-principle version. In the near future we are planning on a new release (v0.2.0) which should greatly speed up DeepConsensus.

Please let us know if you have any further suggestions for future releases.