Closed ligiamateiu closed 5 years ago
Hi,
Pychopper is looking for the specified primers at the two ends of the read. If it finds them in a directions corresponding to read sequenced on the forward or reverse strand then it considers the read classified. In any other case (one of the primers missing or wrong directions) the read is considered unclassified.
Did you generate a report when running pychopper? If yes, could you link page two here. Looking at that might shed some more light on the issue.
Botond
Thanks for the quick reply! Yes, I did generate the report. goo.gl/Z1m1NY
Hi,
Looks like pychopper mostly finds zero or one primers in your reads. I have some more questions in order to investigate this issue:
Botond
Hi again,
Which version of cDNA kit did you use? Did you modify the protocol in any way? kit SQK-PCS108 (PCR cDNA seq). No adaptations
Are you using only the "pass" reads from basecalling which have been filtered by quality score? It's everything, which indeed might explain a part of the issue!
Is your sample dominated by short transcripts? histo peak at 1k read length
Thanks!! Ligia
Hi,
Could you please try running on the pass reads only and let me know how much is classified?
Botond
Yup, that's the next logical step! I'll get back to you when completed. Thanks again for the prompt answers!
Presumably taking reads through Porechop would be advisable to remove the low quality bases at either end in addition to remove adapter sequences?
wdecoster advised NanoFilt -q 7


Yes, Nanofilt -q 7 should do the job. I do not recommend running it trough Porechop first (maybe after).
Ah ok interesting, I will try that, as I am also struggling with % of classified reads, ~39%. Although I am using a custom cDNA-PCR library prep, so I thought that was why.
I'll go back to the raw fastq base called reads, and try again
Hi again, I'm running now the porechop, but I see it detects and trims the adapters as listed below. Still, my understanding was that pychopper uses scikit-bio to align the reads against these adapters. Am I wrong about the process?
Porechop output Trimming adapters from read ends SQK-MAP006_Y_Top_SK63: GGTTGTTTCTGTTGGTGCTGATATTGCT SQK-MAP006_Y_Bottom_SK64: GCAATATCAGCACCAACAGAAA PCR_1_start: ACTTGCCTGTCGCTCTATCTTC PCR_1_end: GAAGATAGAGCGACAGGCAAGT PCR_2_start: TTTCTGTTGGTGCTGATATTGC PCR_2_end: GCAATATCAGCACCAACAGAAA PCR_3_start: TACTTGCCTGTCGCTCTATCTTC PCR_3_end: GAAGATAGAGCGACAGGCAAGTA
Hi,
Pychopper uses the parasail library to align the SSP and VN primers to the read, however it does not do any trimming. If you have run porechop after pychopper it would still find and trim adapters/primers.
Botond
Hey, Oh, right! I misread your porechop statement!
So after nanofilt I retained 6.7M reads. The proportion of the detected full lengths transcripts with pychopper stays the same, only 25% of the reads are classified as full-length transcripts. I append the link with the new run (after nanofilt -q 7, left) and the old one (all reads, right). https://bit.ly/2DDDjgd
Do you have any thoughts on how I can get more reads after pychopper? I assume that only the "classified" reads are meant for the downstream analysis with pinfish pipeline (https://github.com/nanoporetech/pipeline-pinfish-analysis).
Many thanks for being patient with me :)
Was the number of classified reads 25% as compared to the nanofilt retained reads or total reads?
Yes, only the classified reads go into the pinfish analyses.
Botond
Yes, it's the same proportion of reads that pass pychopper.
Just for clarification could you please paste in the absolute numbers (before filtering, after pychopper and after pychopper without filtering)?
not filtered reads 10.36M -> classified 2.21M nanofilter 6.76M -> classified 2M
Thanks! I did not expect more reads to pass pychopper after filtering, however I have expected that the proportion of full length reads will be higher in the filtered set. Indeed that is higher with about 8% (not that much to be said).
Could you please run pychopper while setting the -s parameter to 96, 95 and 94. This might squeeze out some extra classified reads.
Hello @ligiamateiu (and also @callumparr )! I have just implemented a less rigorous heuristic approach for detecting full length reads and strandedness. It can be invoked using the -x
flag.
It should give you more full length reads, however it might make more mistakes. Have a go with it if interested and let me know if you are happy with the output.
Botond
Ok, I'll give it a try with -x. Btw, using -s 96,95 or 94 generated did not make any difference in the passed number of reads (??).
Hey Botond,
This is the new number. nanofilter 6.76M ->3.72M pychopper -x
(compared with 2M obtained before) I'll use it like this for now.
Many thanks for the help! Ligia
FYI, the -x mode has an update and it is more stringent. You can set it less stringent by lowering the value of -l.
@bsipos I am unsure why but I get fewer reads when lowering the -s cutoff in the normal mode (ignoring the recent -x flag). I would expect more reads as it would be less stringent if I understand correctly. If I change -s to 100, the return reads greatly increases.
Also is there any logic in the code that looks for a string of As at the 3' end? Now in my reverse primer there is a cell barcode in between poly-dT and the PCR sequence, so I cannot add the a sequence of A's to the for the DNA linker|2
I confirm that the most recent update for -x flag increases the return full length jumping from ~39 to 49%.
Hi Botond,
Somewhat related, can the reads with only the VNP or SSP also be separated into their own files?
Cheers
@callumparr The -s parameters determines the cutoff which is used to detect "significant" alignments against the start and end of the read and its reverse complement in "classic" mode. More precisely it is the percentile of the distribution of scores of alignments of the respective primer against random sequences. If -s is too low, then the primers might be detected at multiple places which leads to classification failure. I have tuned -s on a control strand dataset and I have found that 98 performed well. I t seems that it might not generalise well, so if you get more classified reads with -s 100 then you should use that setting.
When you are saying that -s 100 greatly increases the returns is that higher than what you get in heuristic mode?
There is no logic for specifically looking for As at the 3' end so it is no problem if you have to omit them from the primer. Just out of curiosity: how did you attach the cell barcodes?
@cerdelyan The reads with only VNP or SSP cannot be separated into files at this point. The -A option can export the scores in a tsv, you could potentially use that to separate out the failed reads.
I guess theyre just UMI as I am not doing any partition. Currently reverse primer:
oligo-dT-V-random sequence-pcr handle With template Oligo
Then follow up with second strand RT primer from the 5’. This also includes a UMI in the middle.
Then do 1D ligation for nanopore sequencing
If I wanted to search with my entire primer sequences this involve placing N stretched in the fasta file.
With -s 98 (default) I get ~10% as full With -s lower I get even less With ~s 100 I get ~40% as full With -x (default stringency 0.25) I get 50% as full
@callumparr I would not include the Ns and the As in the primer sequences if your PCR handle is at least ~ 25 bases long.
@bsipos sorry one more question, about the -g flag for alignment parameters. Is it expected that we enter some value for alignment parameters or is this determined by initial score_cutoff? Or there is some default values entered?
Yes PCR handles are greater than 25 bases. In update to library prep they're shorter, so will have to think of something else for those.
@callumparr Yes, there are defaults for the -g flag ((match, mismatch,gap_open,gap_extend).", default="1,-1,2,1")). They determine the cutoff along with the -s parameter through simulation.
Hi, I'm using pychopper for a test data generated with the PCR cDNA kit for minion. Pychopper (with the default params) finds only 25% of the reads to be "full length cDNA reads". The primers are the same as the ones in the file you provide (cdna_barcodes.fas). How can I reduce the unclassified amount? What exactly are these unclassified reads?