lbcb-sci / raven

De novo genome assembler for long uncorrected reads
MIT License
202 stars 21 forks source link

Small plasmid misassmbly #40

Closed jagos01 closed 3 years ago

jagos01 commented 3 years ago

Hello Robert, I am using Raven v1.4.0 to assemble a bacterial genome (using nanopore data) which contains 1 chromosome, 1 large plasmid and 1 small plasmid. When I view the gfa file in bandage, I can see 3 circular contigs are generated. The chromosome and large plasmid are the correct size but the small plasmid (~12kb) seems to assembly as a multimer (96.6kb) and is not output to the fasta file. Do you have any suggestions that might allow this plasmid to assembly correctly? A large percentage of the reads (~30%) for this dataset map to the plasmid. Thanks, Scott

rvaser commented 3 years ago

Hi Scott, not sure what happened, could you please send me the raven.cereal file via mail?

Best regards, Robert

rvaser commented 3 years ago

Below is the multimer plasmid you were talking about. It only consists of 3 reads, but the multiplicity is 8. Can you please check whether there is a read longer than 12kb mapping fully to this sequence? You can extract all sequences from GFA with awk '$1 ~/S/ {print ">"$2"\n"$3}' graph.gfa > seqs.fa and then find this weird plasmid with `grep ">Utg1074" -A1 seqs.fa > plasmid.fa".

plasmid

jagos01 commented 3 years ago

Yes there are several reads longer than 12Kb mapping to the plasmid. I have re-basecalled the data with the guppy v4.4.2. Depending on how I demultiplex (guppy_barcoder or qcat) I either get 2 circular contigs (chromosome and large plasmid) or 2 circular contigs and one linear contig. The linear contig is still larger than 12kb (~56kb or 64 kb). I have not extracted the sequences yet but suspect reads longer than 12kb will map to the linear contig.

rvaser commented 3 years ago

Are those longer reads sequencing artefacts? Should there be only one circular plasmid of 12kbp?

jagos01 commented 3 years ago

Hello Robert, Correct, there should only be one 12kb plasmid so the longer reads must be artifacts. I will look at some of the reads and see if they can be filtered out. Thanks for your help. Scott

On Tue., Feb. 23, 2021, 3:17 a.m. Robert Vaser, notifications@github.com wrote:

Are those longer reads sequencing artefacts? Should there be only one circular plasmid of 12kbp?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lbcb-sci/raven/issues/40#issuecomment-784078918, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALWFTRH4TMT37MYYHRAW2NDTAN6CNANCNFSM4X7B3LJA .