Dealing with concatemers

mattdmem commented 4 years ago

Hi,

I've noticed with things like plasmid sequencing you sometimes get concatemers, when this happens STRique often predicts massive repeats which span multiple copies of the plasmid from the left flanking in one concatemer to the right flanking in another. I've attached an example - this is two plasmids concatenated - the repeat is around 100bp but it's predicted as 941.

Any ideas on a fix? I thought about some kind of pre-processing to split the concatemers but the tools are lacking to do this (at least on the fast5 level)

Thanks!

giesselmann commented 4 years ago

Hi,

I think a fix is currently only to filter these out. The signal alignment in STRique is semi-global, if it is matching the first prefix and last suffix, everything in between will be counted by the HMM. 1st prefix/suffix and 2nd prefix/suffix would work, and suffix before prefix match is detected and discarded. Which one you get only depends, where the signal is the cleanest.

To filter, could you try to use the log-probability (col. 6) and the ticks between prefix and suffix (col. 8). The log-p is not normalized, but if you divide it by the ticks, the value should be much lower for these concatemers, compared to regular repeats.

Pay

mattdmem commented 4 years ago

Thanks... It's not possible to filter out because all the reads contain multiple repeats. I'm going to have to work out a way to split the reads I think.

giesselmann / STRique

Dealing with concatemers #14