Open rhpvorderman opened 3 months ago
Hi, Just wanted to add to this: I've had issue where low complexity over represented sequences (AAAAAA....) were estimated as high as 12%. This number is erroneous and somehow driven so high by a few long reads with long tail of low complexity regions. I can't trust overrepresented sequence estimate at the moment (v0.5) Best, Seb
@Sebastien-Raguideau , in the latest version, each sequence is only sampled once per read. So that will lower the percentage for low complexity repeat sequences.
However, if there is a substantial amount of long reads in the mix, the chance that these contain AAAAAAA etc. is quite great, as that has a massive amount of occurences in the human genome (and probably most eukaryotic genomes).
So I am planning to improve this feature that it only samples the extremeties of the sequences (with configurable length). This would massively lower the amount of common genome repeats including poly-A, telomeres etc.
My use case is mostly prokaryotes, so I suppose latest version will have fixed my issue. It makes sense to me that if you want to know what fraction of reads possess an overly represented motif, the number of time a motif is seen in a read is ignored. I'll get that latest version, thx.
This is hard for illumina. For nanopore I guess it is possible to only take the first 200 base pairs from either end, rather than the entire sequence. Currently most of the overrepresented sequences are common genome repeats.