nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
486 stars 59 forks source link

Residual or excessive barcode sequences in Dorado Demux #420

Open starsyi opened 11 months ago

starsyi commented 11 months ago

I don't know if the author is aware of the issue with incomplete removal of sequence adaptors and barcodes. Specifically, when using tools such as Dorado Demux, Porechop, and Guppy_barcoder to trim adaptors and barcodes, they are unable to completely remove them. There are often residual sequences of 1-15bp remaining at the 5' end, and similar situations occur with residual sequences at the 3' end. Is it possible to optimize and solve this issue? eq: Sequencing raw data:

ATGTTATGTGGCTGCCTTCGTTCAGTTACGTATTGCTA ^^^ AGGTTAA ^^^ CCAAACCCAACAACCTAGATAGGC ^^^ CAGCACCT ^^^ CTGGACCTGAGGCCTCTGGAGGCTACTGATGATGCCTGCTGTGAACGCAGACACTGGTGTGATGCGATGCCTGCGCCTGCAGCGGCAGTGCCCTGGGCACGGTTTTGAGCTTGTACCCAGCGCTGCTTTTGCCTTGCTCTGTGACCCCAGGCAAGCTGCCTCACCTCTCTGGGCCAGTTTCCCCATCGTACAGTGGTGCTGCACACCCTGGCCCTGGCCCCGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTGTCTCATCAGTGCCCGGTGCTGGGTCAGGGATCGACTGAGGCTCTGAGCTAACTGGGAAACACAGTGGCCT ^^^ AGGTGCTG ^^^ GCCTATCTAGGTTGTTGGGTTTGGTGAGCCTTCCTGAATGGTT

Among them, the sequences of adaptor, barcode, and barcode on both sides are separated by ^^^.

trimmed sequence:

GCACCTCTGGACCTGAGGCCTCTGGAGGCTACTGATGATGCCTGCTGTGAACGCAGACACTGGTGTGATGCGATGCCTGCGCCTGCAGCGGCAGTGCCCTGGGCACGGTTTTGAGCTTGTACCCAGCGCTGCTTTTGCCTTGCTCTGTGACCCCAGGCAAGCTGCCTCACCTCTCTGGGCCAGTTTCCCCATCGTACAGTGGTGCTGCACACCCTGGCCCTGGCCCCGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTGTCTCATCAGTGCCCGGTGCTGGGTCAGGGATCGACTGAGGCTCTGAGCTAACTGGGAAACACAGTGGCC

The trimmed sequence contains a portion of the barcode flank sequence GCACCT. And one base 'T' was removed at the 3' end.

The actual insertion sequence should be as follows:

CTGGACCTGAGGCCTCTGGAGGCTACTGATGATGCCTGCTGTGAACGCAGACACTGGTGTGATGCGATGCCTGCGCCTGCAGCGGCAGTGCCCTGGGCACGGTTTTGAGCTTGTACCCAGCGCTGCTTTTGCCTTGCTCTGTGACCCCAGGCAAGCTGCCTCACCTCTCTGGGCCAGTTTCCCCATCGTACAGTGGTGCTGCACACCCTGGCCCTGGCCCCGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTGTCTCATCAGTGCCCGGTGCTGGGTCAGGGATCGACTGAGGCTCTGAGCTAACTGGGAAACACAGTGGCCT
starsyi commented 11 months ago

I believe in obtaining the accurate insertion sequence, which are crucial for genome assembly, sequence alignment, and the study of sequence features (especially cfDNA, scRNA, etc.). I hope to have better optimization and solution approaches.

tijyojwad commented 11 months ago

Hi @starsyi - thank you for the feedback and for the detailed analysis. I completely agree that we should improve the accuracy of our trimming. We have an ongoing effort to add adapter trimming as well, so we'll investigate ways to enhance the accuracy of the trim positions.