rhpvorderman / sequali

Fast sequencing data quality metrics
GNU Affero General Public License v3.0
11 stars 0 forks source link

Improve overrepresented sequence sampling so common genome repeats show up less. #173

Open rhpvorderman opened 3 months ago

rhpvorderman commented 3 months ago

This is hard for illumina. For nanopore I guess it is possible to only take the first 200 base pairs from either end, rather than the entire sequence. Currently most of the overrepresented sequences are common genome repeats.

Sebastien-Raguideau commented 4 days ago

Hi, Just wanted to add to this: I've had issue where low complexity over represented sequences (AAAAAA....) were estimated as high as 12%. This number is erroneous and somehow driven so high by a few long reads with long tail of low complexity regions. I can't trust overrepresented sequence estimate at the moment (v0.5) Best, Seb

rhpvorderman commented 3 days ago

@Sebastien-Raguideau , in the latest version, each sequence is only sampled once per read. So that will lower the percentage for low complexity repeat sequences.

However, if there is a substantial amount of long reads in the mix, the chance that these contain AAAAAAA etc. is quite great, as that has a massive amount of occurences in the human genome (and probably most eukaryotic genomes).

So I am planning to improve this feature that it only samples the extremeties of the sequences (with configurable length). This would massively lower the amount of common genome repeats including poly-A, telomeres etc.

Sebastien-Raguideau commented 3 days ago

My use case is mostly prokaryotes, so I suppose latest version will have fixed my issue. It makes sense to me that if you want to know what fraction of reads possess an overly represented motif, the number of time a motif is seen in a read is ignored. I'll get that latest version, thx.