Read splitting, what subcommand do it?

nanoporetech / dorado

Oxford Nanopore's Basecaller

https://nanoporetech.com/

Other

452 stars 53 forks source link

Read splitting, what subcommand do it? #714

Closed Alteroldis closed 3 months ago

Alteroldis commented 3 months ago

Dear developers, thank you for quick improving of dorado!

But I think there is never too much documentation. For example, I still don’t understand in which subcommand the read splitting (by mid strand adapter or barcode) is performed? Can I run "dorado trim" and expect it to do the read splitting on my previously basecalled reads? Or only way is rebasecalling with "dorado duplex"? Earlier I run dorado duplex and guppy_barcoder with "--detect_mid_strand_adapter --trim_adapters" options, but after carefully reading the manual of guppy_barcoder, I'm not sure it does splitting).

tijyojwad commented 3 months ago

Hi @Alteroldis - read splitting in dorado is enabled by default. there's no other subcommand to use, and it happens during the basecaller or the duplex commands. the trim command only trims adapter/barcodes.

Splitting has been turned on since dorado v0.4.0. If your data was basecalled before, case you would need to re-basecall. But if it's after, then your reads should already be split. Please let us know if you see a large number of unspilt reads.

Alteroldis commented 3 months ago

Hi @tijyojwad, thank you for quick reply! I would check latest version. Earlier I do duplex basecalling with 0.3.4 version.)

Please let us know if you see a large number of unspilt reads.

How can you check this? I started to worry because my genome after assembly is diploid (not only in size, but also in the number of identical genes), and not mosaic haploid, as it should be after flye. I know that this can happen due to a large number of structural variations, but at least 15 representatives of this type and at least 6 representatives of this class do not have this.

tijyojwad commented 3 months ago

How can you check this?

I usually check this by aligning the reads to a reference and checking if there are a large number of supplementary reads. e.g. if there are reads where the first part aligns to one portion of the genome and the second part aligns to a completely different portion of the genome, it likely means an unspilt read