mortazavilab / TALON

Technology agnostic long read analysis pipeline for transcriptomes
MIT License
137 stars 31 forks source link

Many reads fail QC during TALON run but meet primary, coverage, and identity filters #98

Open callumparr opened 2 years ago

callumparr commented 2 years ago

Using TALON v5 installed python setup.py install on HPC running Debian

Using python version 3.6.7

talon --f /analysisdata/fantom6/Interactome/ONT-CAGE_TALON_Callum/F6_interactome_config_run2.csv --db /analysisdata/fantom6/Interactome/ONT-CAGE_TALON_Callum/F6_interactome.db --build hg38 --threads 12 --o /analysisdata/fantom6/Interactome/ONT-CAGE_TALON_Callum/F6_interactome_run2

I kept the default 0.9 fraction alignment and 0.8 identity defaults

I was routing through the TALON QC log file because we are seeing many reads filtered out despite using cap-trap and oligo-dT alignment so sure we have good quality data. I actually found a potential issue that may account for a lot of reads having low fraction alignment due to my library prep and pychopper not trimming effectively the polyA tails from the FASTQ reads but then I saw an additional subset of alignments that were filtered out not because they were not primary alignments, nor failed either of the fraction aligned or identity filters.

I attach an upSet plot of the reasoning for an alignment passed to TALON to either pass or fail the QC step. You can see the third column has no reason to fail around 3.5M reads.

I was looking through the TALON_label log and I roughly saw around 0.5M reads with evidence of internal priming but from what I understand this doesn't factor for generating the talon database.

Is there some other behind the scenes filtering going on during database generation that isn't reported in the QC log?

iPSC_rep1_run1_UpSetR

fairliereese commented 2 years ago

Your intuition that internal priming / the reproducibility filter should not be affecting these numbers is correct.

I'm looking into it otherwise. I've checked a log file that I have lying around and have found something similar :/ It does not seem to me that this should be happening. I will update you when I have found anything.

callumparr commented 2 years ago

Your intuition that internal priming / the reproducibility filter should not be affecting these numbers is correct.

I'm looking into it otherwise. I've checked a log file that I have lying around and have found something similar :/ It does not seem to me that this should be happening. I will update you when I have found anything.

Thank you for the reply and for looking into it. When I have the time I will look into this type of read failing and read characteristics.

fairliereese commented 2 years ago

If you're also planning to look into it on your end, here's some code that might be useful as a starting point: https://github.com/fairliereese/220421_talon_debug/blob/master/check_talon_log.ipynb

callumparr commented 2 years ago

I looked into it a bit more and I am still at a loss why some reads are failing. This was consistent across multiple samples although all processed the same so there is the possibility I am doing something weird.

alexandergofton commented 5 months ago

Any update on this? Following @fairliereese 's advice I've checked my own data and revealed the same issue.

Screenshot 2024-06-12 at 4 10 19 pm