nanoporetech / pychopper

A tool to identify, orient, trim and rescue full length cDNA reads
Other
80 stars 22 forks source link

100% unclassified reads - 0 classified #12

Closed SziKayLeung closed 5 years ago

SziKayLeung commented 5 years ago

Hello,

While the pychopper ran successfully (with conda installation), all the reads from my MinION run was unclassified (from the stats.output) :

I had created my cDNA using the SMARTer cDNA Synthesis kit, and thus had adjusted the primer.fasta file accordingly to the 5' and 3' primers:

cDNA|1 AAGCAGTGGTATCAACGCAGAGTACATGGG cDNA|2 GTACTCTGCGTTGATACCACTGCTT

But when I grepped the input fastq file, there were occurrences where the primers were matched at both ends of the read (find attached). I was under the impression that Pychopper looks for the specified primers at the two ends, and if it finds both in the directions indicated at both ends, it is considered classified. So in this case, both reads in my example should be considered classified?

The command I used was: dna_classifier.py -b $REFERENCE/nanopore.primer.fasta -r $1.report.pdf -u $1.unclassified.fastq -S $1.stats.output -A $1.scores.output $RAWDATA/$1.merged.fastq $1.FL.fastq

Any guidance on this would be greatly appreciated - Thank you! match_primer.fastq.docx

bsipos commented 5 years ago

It seems that cDNA|2 is a sub-sequence of the reverse complement of cDNA|1! Under these circumstances it would be possible to detect if the read is full length, but it is impossible to figure out the strand so pychopper will not classify them.

SziKayLeung commented 5 years ago

Thank you for your reply and for clarifying about the need for strand direction.

Unfortunately due to the nature of SMARTer cDNA synthesis kit, the 5’ end of both plus and negative strands have the same sequence (cDNA|1 as above), the only difference is that the plus strand ends with ATGGG and that of negative strand ends with polyT.

Plus strand Motor-5’- AAGCAGTGGTATCAACGCAGAGTACATGGG../..AAAAAAAAGTACTCTGCGTTGATACCAACTGCTT-3’ Non Plus strand Tether -3’-TTCGTCACCATAGTTGCGTCTATGTACCCC../..TTTTTTTTCATGAGACGCAACTATGGTGACGAA- 5’

I’ve tried just including cDNA|1 and cDNA|2 with the polyA tail. However I was only able to classify 1% of total reads (14449 +, 10759 -, and 2096804 unclassified). Moreover, there was a mix of plus and negative classified reads, despite using the following sequence:

-Plus strand-

cDNA|1 CAGTGGTATCAACGCAGAGTACATGGG ####cDNA|1 as above cDNA|2 AAAAAAAAGTACTCTGCGTTGATAC ####cDNA|2 as above

What are your thoughts? Is there any way around this? Thank you!

bsipos commented 5 years ago

I am afraid there is now way to orient the reads without adding unique sequences to the two primers. You could still decide if they are full length based on the alignment scores written out by specifying -A.