Unclassified reads - Githubissues

ligiamateiu commented 5 years ago

Hi, I'm using pychopper for a test data generated with the PCR cDNA kit for minion. Pychopper (with the default params) finds only 25% of the reads to be "full length cDNA reads". The primers are the same as the ones in the file you provide (cdna_barcodes.fas). How can I reduce the unclassified amount? What exactly are these unclassified reads?

bsipos commented 5 years ago

Hi,

Pychopper is looking for the specified primers at the two ends of the read. If it finds them in a directions corresponding to read sequenced on the forward or reverse strand then it considers the read classified. In any other case (one of the primers missing or wrong directions) the read is considered unclassified.

Did you generate a report when running pychopper? If yes, could you link page two here. Looking at that might shed some more light on the issue.

Botond

ligiamateiu commented 5 years ago

Thanks for the quick reply! Yes, I did generate the report. goo.gl/Z1m1NY

bsipos commented 5 years ago

Hi,

Looks like pychopper mostly finds zero or one primers in your reads. I have some more questions in order to investigate this issue:

Which version of cDNA kit did you use? Did you modify the protocol in any way?
Are you using only the "pass" reads from basecalling which have been filtered by quality score?
Is your sample dominated by short transcripts?

Botond

ligiamateiu commented 5 years ago

Hi again,

Which version of cDNA kit did you use? Did you modify the protocol in any way? kit SQK-PCS108 (PCR cDNA seq). No adaptations
Are you using only the "pass" reads from basecalling which have been filtered by quality score? It's everything, which indeed might explain a part of the issue!
Is your sample dominated by short transcripts? histo peak at 1k read length

Thanks!! Ligia

bsipos commented 5 years ago

Hi,

Could you please try running on the pass reads only and let me know how much is classified?

Botond

ligiamateiu commented 5 years ago

Yup, that's the next logical step! I'll get back to you when completed. Thanks again for the prompt answers!

callumparr commented 5 years ago

Presumably taking reads through Porechop would be advisable to remove the low quality bases at either end in addition to remove adapter sequences?

ligiamateiu commented 5 years ago

wdecoster advised NanoFilt -q 7

bsipos commented 5 years ago

Yes, Nanofilt -q 7 should do the job. I do not recommend running it trough Porechop first (maybe after).

callumparr commented 5 years ago

Ah ok interesting, I will try that, as I am also struggling with % of classified reads, ~39%. Although I am using a custom cDNA-PCR library prep, so I thought that was why.

I'll go back to the raw fastq base called reads, and try again

ligiamateiu commented 5 years ago

Hi again, I'm running now the porechop, but I see it detects and trims the adapters as listed below. Still, my understanding was that pychopper uses scikit-bio to align the reads against these adapters. Am I wrong about the process?

Porechop output Trimming adapters from read ends SQK-MAP006_Y_Top_SK63: GGTTGTTTCTGTTGGTGCTGATATTGCT SQK-MAP006_Y_Bottom_SK64: GCAATATCAGCACCAACAGAAA PCR_1_start: ACTTGCCTGTCGCTCTATCTTC PCR_1_end: GAAGATAGAGCGACAGGCAAGT PCR_2_start: TTTCTGTTGGTGCTGATATTGC PCR_2_end: GCAATATCAGCACCAACAGAAA PCR_3_start: TACTTGCCTGTCGCTCTATCTTC PCR_3_end: GAAGATAGAGCGACAGGCAAGTA

bsipos commented 5 years ago

Hi,

Pychopper uses the parasail library to align the SSP and VN primers to the read, however it does not do any trimming. If you have run porechop after pychopper it would still find and trim adapters/primers.

Botond

ligiamateiu commented 5 years ago

Hey, Oh, right! I misread your porechop statement!

So after nanofilt I retained 6.7M reads. The proportion of the detected full lengths transcripts with pychopper stays the same, only 25% of the reads are classified as full-length transcripts. I append the link with the new run (after nanofilt -q 7, left) and the old one (all reads, right). https://bit.ly/2DDDjgd

Do you have any thoughts on how I can get more reads after pychopper? I assume that only the "classified" reads are meant for the downstream analysis with pinfish pipeline (https://github.com/nanoporetech/pipeline-pinfish-analysis).

Many thanks for being patient with me :)

bsipos commented 5 years ago

Was the number of classified reads 25% as compared to the nanofilt retained reads or total reads?

Yes, only the classified reads go into the pinfish analyses.

Botond

ligiamateiu commented 5 years ago

Yes, it's the same proportion of reads that pass pychopper.

bsipos commented 5 years ago

Just for clarification could you please paste in the absolute numbers (before filtering, after pychopper and after pychopper without filtering)?

ligiamateiu commented 5 years ago

not filtered reads 10.36M -> classified 2.21M nanofilter 6.76M -> classified 2M

bsipos commented 5 years ago

Thanks! I did not expect more reads to pass pychopper after filtering, however I have expected that the proportion of full length reads will be higher in the filtered set. Indeed that is higher with about 8% (not that much to be said).

Could you please run pychopper while setting the -s parameter to 96, 95 and 94. This might squeeze out some extra classified reads.

bsipos commented 5 years ago

Hello @ligiamateiu (and also @callumparr )! I have just implemented a less rigorous heuristic approach for detecting full length reads and strandedness. It can be invoked using the -x flag.

It should give you more full length reads, however it might make more mistakes. Have a go with it if interested and let me know if you are happy with the output.

Botond

ligiamateiu commented 5 years ago

Ok, I'll give it a try with -x. Btw, using -s 96,95 or 94 generated did not make any difference in the passed number of reads (??).

ligiamateiu commented 5 years ago

Hey Botond, This is the new number. nanofilter 6.76M ->3.72M pychopper -x
(compared with 2M obtained before) I'll use it like this for now.

Many thanks for the help! Ligia

bsipos commented 5 years ago

FYI, the -x mode has an update and it is more stringent. You can set it less stringent by lowering the value of -l.

callumparr commented 5 years ago

@bsipos I am unsure why but I get fewer reads when lowering the -s cutoff in the normal mode (ignoring the recent -x flag). I would expect more reads as it would be less stringent if I understand correctly. If I change -s to 100, the return reads greatly increases.

Also is there any logic in the code that looks for a string of As at the 3' end? Now in my reverse primer there is a cell barcode in between poly-dT and the PCR sequence, so I cannot add the a sequence of A's to the for the DNA linker|2

I confirm that the most recent update for -x flag increases the return full length jumping from ~39 to 49%.

cerdelyan commented 5 years ago

Hi Botond,

Somewhat related, can the reads with only the VNP or SSP also be separated into their own files?

Cheers

bsipos commented 5 years ago

@callumparr The -s parameters determines the cutoff which is used to detect "significant" alignments against the start and end of the read and its reverse complement in "classic" mode. More precisely it is the percentile of the distribution of scores of alignments of the respective primer against random sequences. If -s is too low, then the primers might be detected at multiple places which leads to classification failure. I have tuned -s on a control strand dataset and I have found that 98 performed well. I t seems that it might not generalise well, so if you get more classified reads with -s 100 then you should use that setting.

When you are saying that -s 100 greatly increases the returns is that higher than what you get in heuristic mode?

There is no logic for specifically looking for As at the 3' end so it is no problem if you have to omit them from the primer. Just out of curiosity: how did you attach the cell barcodes?

bsipos commented 5 years ago

@cerdelyan The reads with only VNP or SSP cannot be separated into files at this point. The -A option can export the scores in a tsv, you could potentially use that to separate out the failed reads.

callumparr commented 5 years ago

I guess theyre just UMI as I am not doing any partition. Currently reverse primer:

oligo-dT-V-random sequence-pcr handle With template Oligo

Then follow up with second strand RT primer from the 5’. This also includes a UMI in the middle.

Then do 1D ligation for nanopore sequencing

If I wanted to search with my entire primer sequences this involve placing N stretched in the fasta file.

With -s 98 (default) I get ~10% as full With -s lower I get even less With ~s 100 I get ~40% as full With -x (default stringency 0.25) I get 50% as full

bsipos commented 5 years ago

@callumparr I would not include the Ns and the As in the primer sequences if your PCR handle is at least ~ 25 bases long.

callumparr commented 5 years ago

@bsipos sorry one more question, about the -g flag for alignment parameters. Is it expected that we enter some value for alignment parameters or is this determined by initial score_cutoff? Or there is some default values entered?

Yes PCR handles are greater than 25 bases. In update to library prep they're shorter, so will have to think of something else for those.

bsipos commented 5 years ago

@callumparr Yes, there are defaults for the -g flag ((match, mismatch,gap_open,gap_extend).", default="1,-1,2,1")). They determine the cutoff along with the -s parameter through simulation.

nanoporetech / pychopper

Unclassified reads #9