timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
112 stars 24 forks source link

Sinto fragments: negative positions in output #29

Closed cflerin closed 3 years ago

cflerin commented 3 years ago

Hi @timoast ,

I have spotted a few cases where there are negative positions listed in the fragments file, especially in chrM, I think due the higher chance of finding reads overlapping the start/end of this chromosome. I tried to make a minimal example here:

Using the reads in this BAM file:

VH00445:3:AAAJTTYM5:1:1305:65210:2912   1123    chrM    1       33      32S21M  =       1       45      ATCATACTCTATTACGCAATAAACATTAACAAGTTAATGTAGCTTAATAACAA   -CC;CCCCC-CCCCCCC-CC-C;;CC;;;C;CC;CCCCCCCCCC;CCC-CCCC NM:i:0  MD:Z:21 AS:i:21 XS:i:22 XA:Z:chr16,-28715534,8S22M23S,0;        MQ:i:60 MC:Z:8S45M      ms:i:1642       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:2409:29309:43198  99      chrM    1       33      32S21M  =       1       45      ATCATACTCTATTACGCAATAAACATTAACAAGTTAATGTAGCTTAATAACAA   CCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCC;CCCC NM:i:0  MD:Z:21 AS:i:21 XS:i:22 XA:Z:chr16,-28715534,8S22M23S,0;        MQ:i:60 MC:Z:8S45M      ms:i:1794       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1305:65210:2912   1171    chrM    1       60      8S45M   =       1       -45     ATTAACAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGA   ;CCCCCCCCC-CCCCCCCCC-CCCC;CCC;C-CCCCCCCCCCCC-CCCCCCCC NM:i:0  MD:Z:45 AS:i:45 XS:i:21 MQ:i:33 MC:Z:32S21M     ms:i:1560       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:2409:29309:43198  147     chrM    1       60      8S45M   =       1       -45     ATTAACAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGA   CCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NM:i:0  MD:Z:45 AS:i:45 XS:i:21 MQ:i:33 MC:Z:32S21M     ms:i:1786       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1410:62900:52436  1187    chrM    11796   60      52M     =       11833   90      ATCCTAATTTCAATATCAAACCTAATTAAACACATCAACTTCCCACTGTACA    CCCCCCCCCCCCCCCCCC;-CCCCCCCC;CCC-CCCC-CCCCCCCCCCCCCC  NM:i:0  MD:Z:52 AS:i:52 XS:i:19 MQ:i:60 MC:Z:53M        ms:i:1684       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1410:62900:52436  1107    chrM    11833   60      53M     =       11796   -90     ACTTCCCACTGTACACCACCACATCAATCAAATTCTCCTTCATTATTAGCCTC   CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCC;CCC-CCCCCC;CCC-CCC NM:i:0  MD:Z:53 AS:i:53 XS:i:20 MQ:i:60 MC:Z:52M        ms:i:1650       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT

I run sinto fragments and get:

chrM    -28     40      ACTAGGCTTCGTATTGAGCCGAACAGTAGT  2

This read aligns to chrM position 1 and is soft-clipped by 32 bases (cigar: 32S21M). Looking at the code, I see there is correction for soft-clipping: https://github.com/timoast/sinto/blob/b57d735b78dc0742eccb4a4ba801a24f577fca35/sinto/fragments.py#L330-L338 and from this it makes sense how we get to -28 (0 - 32 + 4), but I don't quite understand why this correction is applied. I thought that the position reported in the bam is the start of the mapped portion of the read, which already takes any soft-clipped portions into account. Looking at this read in IGV, and using bamtools bamtobed seems to confirm this (though without the Tn5 offset).

Looking at another soft-clipped read, this time in the middle of chr1:

bam:

VH00445:3:AAAJTTYM5:1:1609:77708:31726  99      chr1    3012667 60      8S44M   =       3012708 94      CCGTATTTCTGATCAGTTCTGAGACAAGTTTTCACTTTATCTATGAAGCCCA    CC-;CCC-;CC-CCCCCCCCC-CCCC;CC-CC;;C-CCCCCCCCCCC;-CCC  NM:i:0  MD:Z:44 AS:i:44 XS:i:19 MQ:i:60 MC:Z:53M        ms:i:1608       CR:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT     CB:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT
VH00445:3:AAAJTTYM5:1:1609:77708:31726  147     chr1    3012708 60      53M     =       3012667 -94     CCACTAGGGTGCAGTCCTGTGCTGAACAAGTAACAATGGCCTGAGTGTGACAA   CC-CCCCCCCCCCCCCC;CCCC-CC;CCCCCCCC--CCCC;CCCC-CCCCCCC NM:i:0  MD:Z:53 AS:i:53 XS:i:20 MQ:i:60 MC:Z:8S44M      ms:i:1482       CR:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT     CB:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT

fragments:

chr1    3012662 3012755 ATTGAACCACATTCGGTCAGGTCACTCAAT  1

where it seems like the start position should be 3012666 + 4 = 3012670. Am I missing something about the soft-clipping correction?

timoast commented 3 years ago

Hi @cflerin, I finally got some time to look into this. You're right about the soft clippling, I honestly don't know why I added that but it seems to be a mistake. I've fixed it now, thanks for raising the issue

cflerin commented 3 years ago

Hi @timoast , thanks for looking at this. After posting this here, I also discovered that cellranger atac does the same adjustment for soft clipping in the fragments file. So I'm not totally sure why that is. But it does seem like the soft clipping adjustment isn't technically correct according to the sam spec, so probably best that it's not done. Thanks for the fix.

timoast commented 3 years ago

Yes, I think that's what I was looking at originally when writing this and copied over the same logic. But I agree that according to the SAM spec, POS should be the reference genome position of the first aligned base, so we shouldn't need to adjust for soft-clipped bases.