ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
142 stars 17 forks source link

prioritize properly mapped read pairs. #317

Closed y9c closed 1 year ago

y9c commented 1 year ago

I tested a read pair that both R1 and R2 are exactly reverse complement. This sequence has multiple copies in the reference genome. When mapped the read, strobealign return a paired of location that not properly mapped (flag 81 and 161). They are located in different chromosomes. But it would be more reasonable if a properly aligned location can be returned.

LH00139:14:225FNWLT3:3:1101:9636:13914  81      Nbe.v1.s09440   36226   60      94M     Nbe.v1.s11880   21622   0       CTTGTGTGTTGGCCGAAGTTACAATAGTAACATTGAACCCTCGAATATGTTCGAAGATCTCGAAATGATCTTCCAGTTCGGGGGAGAATTCGCA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFNM:i:0  AS:i:208
LH00139:14:225FNWLT3:3:1101:9636:13914  161     Nbe.v1.s11880   21622   60      94M     Nbe.v1.s09440   36226   0       CTTGTGTGTTGGCCGAAGTTACAATAGTAACATTGAACCCTCGAATATGTTCGAAGATCTCGAAATGATCTTCCAGTTCGGGGGAGAATTCGCA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFNM:i:0  AS:i:208
ksahlin commented 1 year ago

Excellent with these kind of examples, thank you!

Could we also get the reference sequences you are mapping to? it would be easier to fix if we can reproduce it.

Also, do you have an idea of what the expected alignment should be? For example: (i) Is the read pair from a duplicated region where they should map to the different copies (so R1 and R2 should map close to each other but not to the exact same position). (ii) or are they sequenced in forward and reverse from the same short fragment (i.e., should map to the exact same reference position)

ksahlin commented 1 year ago

Btw, I should say that strobealign prioritises properly mapped reads, so this is likely a bug or some weird corner case.

y9c commented 1 year ago

Hi @ksahlin, thank you for your reply. The testing file is attached.

query_seq.zip Ref

y9c commented 1 year ago

R1 and R2 are reverse and complement, so they should be mapped to exact the same reference position, rather than two different positions.

ksahlin commented 1 year ago

I see. Thanks for providing us with data. We will have a look at what causes this.

Currently strobealign checks if two reads within a pair is proper or not. This is based on deviation from insert size, relative orientation etc. A guess is that the paired mapping to the same chromosome was too far out the insert size threshold and was deemed 'not proper pair'. All non-proper pairs are scored the same, which could be why this happens. We could have a more nuanced approach that also takes the closest ones, occurring on the same chromosome, etc if all are non-proper.

y9c commented 1 year ago

Thank you for the information. But in this example, the distance of read pair is very close and not 'far away'. I am curious why this one was deemed 'not proper pair'.

ksahlin commented 1 year ago

Yes, I hope @marcelm can take a look in the coming week.

I should have clarified it better: I meant too far away from the mean of the fragment size distribution, i.e either < mu - X*sigma or > mu +X*sigma. But we will se wht the data says.

marcelm commented 1 year ago

Interesting! I debugged this and found the following. Both reads produce 316 seeds (NAMs) and both of the lists contain both of the mapping locations that we see in the SAM output (the one on Nbe.v1.s09440 and Nbe.v1.s11880). However, the program refuses to pair the seeds up. This happens in get_best_scoring_NAM_locations().

This is the relevant code:

https://github.com/ksahlin/strobealign/blob/93be0c284667bcac04f17e469b0114d881c30a29/src/aln.cpp#L612-L618

All except one condition are fulfilled:

The result is that neither r1_r2 nor r2_r1 become true. The fix is to use <= and >= as comparison operators. I haven’t done this because there are some other places where this same or nearly the same code is repeated and I would like to factor this out into a function instead.

Also, we should think about the case that reads overlap in this way:

R1         -------->
R2   <--------

That is, the TLEN is shorter than the read length. Then the read ends should contain adapter sequence and should not contain seeds, but possibly it could happen by chance that a seed extends a little bit into the adapter.

Note that the above does not check whether TLEN is < mu-X*sigma, not sure if this is as intended.

ksahlin commented 1 year ago

Great, so <= and >= sounds good for fixing this.

Also, we should think about the case that reads overlap in this way:

Is this something that happens with PE reads? I did not implement a check for TLEN is < mu-X*sigma because I did not know about that case. Aren't a the mates read from opposite ends of a fragment?

marcelm commented 1 year ago

A molecule prepared for paired-end sequencing on an Illumina sequencer will look like this:

5'-sequencing-adapter - fragment-of-interest/insert - 3'-sequencing-adapter
                        R1---->
                                            <----R2

The arrows show where R1 and R2 are read off: When reading R1, the sequencer starts just after the 5'-sequencing-adapter. When reading R2, it starts on the opposite strand right before the 3'-sequencing-adapter. Usually, the insert is so long that R1 is a prefix of the insert and R2 is a prefix of the reverse-complemented insert. But if the insert is short, R1 contains the entire insert plus the beginning of the 3' adapter, and R2 contains the entire reverse-complemented insert plus the beginning of the reverse-complemented 5' adapter:

5'-sequencing-adapter - insert - 3'-sequencing-adapter
                        R1--------------->
            <---------------R2

When adapters have been removed using a program such as Cutadapt, R1 should be identical to the insert and R2 identical to the reverse-complemented insert and we get the case this bug report is about. But it would also be nice to handle the case with adapters still attached. We would just need to fix this when testing whether the pairing of reads makes sense. At the alignment stage, the adapters would then be soft clipped.

On another note: Maybe it would be good to change strobealign such that it actually makes computations based on the insert size (TLEN). Currently, it uses the difference between the start positions of R1 and R2 on the reference, but the insert size would be the distance between the 5' ends of the two reads. This could be an issue when processing reads of varying lengths because the estimated insert size distribution will be a bit off.

y9c commented 1 year ago

This bug remind me the similar issue in bwa (https://github.com/lh3/bwa/pull/259).

y9c commented 1 year ago

cutadapt can get rid of most of the 'readthrough', but still some exceptions.

marcelm commented 1 year ago

cutadapt can get rid of most of the 'readthrough', but still some exceptions.

Yes, sure. Sorry if it sounded like it; I didn’t want to suggest that using cutadapt would be the solution here. This needs to be fixed in strobealign.

ksahlin commented 1 year ago

On another note: Maybe it would be good to change strobealign such that it actually makes computations based on the insert size (TLEN).

Okay @marcelm , you know more about me on this one so I'll let you decide. I thought insert size was the inner distance between reads (negative in above case) and fragment length the length of the sequenced fragment (always positive), and furthermore, that TLEN was equal to fragment size.

marcelm commented 1 year ago

I understand and remember being confused about this as well. Quoting from http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html:

The main confusion is with "insert size". The name itself suggests it is the unknown gap because it is "inserted" between R1 and R2, but this is misleading. It is more accurate to think of the insert as the piece of DNA inserted between the adaptors which enable amplification and sequencing of that piece of DNA.

"Fragment size" appears to be a bit ambiguous because for some people, the fragment includes the adapters at either end, but I consider the terms insert size, fragment size and TLEN to mean the same thing.

marcelm commented 1 year ago

With the fix suggested in #319, SAM output for the test read pair is as follows:

LH00139:14:225FNWLT3:3:1101:9636:13914  83      Nbe.v1.s00140   66357181        0       94M     =       66357181        -94     TGCGAATTCTCCCCCGAACTGGAAGATCATTTCGAGATCTTCGAACATATTCGAGGGTTCAATGTTACTATTGTAACTTCGGCCAACACACAAG       FFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:0  AS:i:208
LH00139:14:225FNWLT3:3:1101:9636:13914  163     Nbe.v1.s00140   66357181        0       94M     =       66357181        94      TGCGAATTCTCCCCCGAACTGGAAGATCATTTCGAGATCTTCGAACATATTCGAGGGTTCAATGTTACTATTGTAACTTCGGCCAACACACAAG       FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:0  AS:i:208
y9c commented 1 year ago

Hi @marcelm, I quote the specifications of TLEN in the sam format document, which might be helpful.

The intention of this field is to indicate where the other end of the template has been aligned without needing to read the remainder of the SAM file. Unfortunately there has been no clear consensus on the definitions of the template mapped start and end. Thus the exact definitions are implementation-defined.

image

In this criteria, the TLEN equal to the distance between R1 original start (R1 mapping start) and R2 original start (R2 mappinig end). But I found some tools (such as hisat2) calculate the TLEN as the distance of R1 mapping start and R2 mapping start.

marcelm commented 1 year ago

Thanks for the pointer. So if we come back to the terms, only the older TLEN definition ("TLEN#1") would be equivalent to "insert size". (And of course TLEN can be negative as well.)

I fixed the TLEN computation in strobealign a while ago. I did not take into account the ambiguous scenario above at the time, but I checked the code again just now and it appears that we actually compute TLEN#2. That said, the scenario cannot arise in strobealign at the moment because reads that overlap in such a way are not treated as a proper pair.

This issue here is about the special case that R1 is identical to the reverse complement of R2, which will be fixed by #319, so to keep track of the more general issue of allowing the above "ambiguous scenario", I have opened issue #320. I suggest we continue discussion over there if necessary.

y9c commented 1 year ago

Hi @marcelm, I would like to thank you for this important update. Using the lastest branch. The ratio of properly mapped ratio increases significantly. (79.71% to 97.76%)

Before:

72320436 + 0 in total (QC-passed reads + QC-failed reads)
72320436 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
71695228 + 0 mapped (99.14% : N/A)
71695228 + 0 primary mapped (99.14% : N/A)
72320436 + 0 paired in sequencing
36160218 + 0 read1
36160218 + 0 read2
57643528 + 0 properly paired (79.71% : N/A)
71555882 + 0 with itself and mate mapped
139346 + 0 singletons (0.19% : N/A)
13351896 + 0 with mate mapped to a different chr
12795310 + 0 with mate mapped to a different chr (mapQ>=5)

After (I do not rerun all the data):

2739779 + 0 in total (QC-passed reads + QC-failed reads)
2739779 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2720020 + 0 mapped (99.28% : N/A)
2720020 + 0 primary mapped (99.28% : N/A)
2739779 + 0 paired in sequencing
1369890 + 0 read1
1369889 + 0 read2
2678499 + 0 properly paired (97.76% : N/A)
2716063 + 0 with itself and mate mapped
3957 + 0 singletons (0.14% : N/A)
23214 + 0 with mate mapped to a different chr
15532 + 0 with mate mapped to a different chr (mapQ>=5)