ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Shuffle identical alignments pseudo-randomly on query name instead? #413

Open ksahlin opened 3 months ago

ksahlin commented 3 months ago

Hi @marcelm (CC @Itolstoganov)

Currently we shuffle on chunk_ID, which makes read mappings different for different number of threads, or if reads are occurring in different chunk.

IIRC, BWA-MEM gets the pseudo random placement from the read name. Is it possible to do this instead of on chunk ID, without noticeable computational overhead? I don't think it's worth implementing if the code becomes complex or if it increases runtime.

I noticed this when running an experiment b/t symmetric and asymmetric seeds with reads simulated from either chr X or Y and mapping to only chr X and chr Y from CHM13.

When using asymmetric seeds (2*hash_s1 - hash_s2), the below read aligns to position 29094803 on chrY when aligned as the only read, but to position 29091249 on chrY when alignd as part of a file of 100k reads (using -t 2). In both cases it has CIGAR 114=1X161=1X3=1X42=1X8=1X64=1X30=1X24=1X16=1X8=1X3=1X1=1X14= and alignment score 900. The full simulated file is too large to attach here, I can provide it elsewhere if needed.

@simulated.308
TTCCTTTTGACTCCATTTCATTCGATTCCATTCCATTCCATTAATTTCCATTCCATTCGAGACCTTTCCATTGCAGTCTTTTCCCTTCGAGTCCATTCCGTTCGATTCCCTTCCTTTCGATTCCATTCCATTGGAGTCCGTACCAGTCGAGTCCATTCTATTCCAGTCCATTAGTTTCGACTCCATTGCATTCGAGTGCATTCCATTCCGTGGCTGTCCATTCCATTCCGTTTGATGCCATTCCATACGATTCCATTCAATTCGAGACCATTCTATACCTCTCCATTCCTTGTGGTTCGATTCCATTTCACTCTAGTCCATTCAATTCCATTGAATTCCATTCGACTCTATTCCGTTCCATTCAATTCCATTCCATTCGATTCCATTTTTTTCGAGATCCTTCCATTACACTCCCTTCCATTCCAGTGAATTCCATTCCAGTCTCTTCAGTTCTATTCCATTCCATTCGTATCGATTCCATTCAACTCCAGCCCATTCCA
+
HHIIIIHIHIHIHIHHIHIHIHIIIHHIHHIIHHHIHGGHHHHHIIHGIGGGIGIHIIIGHGHGIIIIFIGIIIHIIHGHCIHGIDIGGIHIHGIHIGGHIIIHIIIFHGIIHHGIIDIIIIHGHIHIFGIDIFIIIIIIFGFEFHIIEIIHHGDEIEEFIBFHIHIDIEIIHIIEGIIIIFDHIIGHFHIIIEHDIII>HIIDFIIIIEDHIFE@IICEDF@DIHFII?EDIIGHACIGBGHAIIIHDIIIDHIAIIHIBEFIID@IIHIGICDI6III>>BICGGIG:IIIIIBIHICBDGIIIIIBIHI@CIEICIIIICIEIIIBIIIGIIDADFA=>HAICI@IIABII<D=IBIIIIIIFIDIIGIDCBGII<ICI8IIBC9<IFIHFEIH@ID@;ICHDII;FIIACIIHIII?4AI@I;EIFII9IIIIFI:<II<HI<IG>I8DIAHIII6GE1=IIF<IIIIIBIII>IIIHI=I>CI<I<@=FEIII;@

Btw, for symmetric seeds (as is currently used) the read aligns with alignment score 1000 and 223=1X38=1X237= to position 44832808 on chr Y.