timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
118 stars 25 forks source link

Modified RG tag and duplicated entry after sinto #51

Closed Zepeng-Mu closed 2 years ago

Zepeng-Mu commented 2 years ago

Hi, I have a BAM file that contain this entry, for example:

$ samtools view possorted_bam.hornet.final.bam|grep "A01040:79:H2F2YDRXY:2:2165:10782:19977"
A01040:79:H2F2YDRXY:2:2165:10782:19977  163     chr8    120623305       60      50M     =       120623620       365     ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA      FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF:      NM:i:0  MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFF CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C
A01040:79:H2F2YDRXY:2:2165:10782:19977  83      chr8    120623620       60      50M     =       120623305       -365    ATCGCTGAGAATCTGAACAAATTAAGGGTGTGGGGGTTGGGGGAGGCAGC      :F:F,F:,:FFFF,,FF,FFFFFFF:F:F:FF,:FFFFFFFF,FF:FFFF      NM:i:1  MD:Z:13A36      AS:i:45 XS:i:23 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFF        CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C

Then I ran sinto filterbarcodes and got this in the output:

samtools view PBMC002.bam|grep "A01040:79:H2F2YDRXY:2:2165:10782:19977"           
A01040:79:H2F2YDRXY:2:2165:10782:19977  163     chr8    120623305       60      50M     =       120623620       365     ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA      FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF:      NM:i:0  MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFFCB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-3A2DA946
A01040:79:H2F2YDRXY:2:2165:10782:19977  163     chr8    120623305       60      50M     =       120623620       365     ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA      FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF:      NM:i:0  MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFFCB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-12D1C06B
A01040:79:H2F2YDRXY:2:2165:10782:19977  83      chr8    120623620       60      50M     =       120623305       -365    ATCGCTGAGAATCTGAACAAATTAAGGGTGTGGGGGTTGGGGGAGGCAGC      :F:F,F:,:FFFF,,FF,FFFFFFF:F:F:FF,:FFFFFFFF,FF:FFFF      NM:i:1  MD:Z:13A36      AS:i:45 XS:i:23 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFF        CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-12D1C06B

There's one BAM line that's duplicated, but with a different RG tag. I'm wondering why this happens? I'm worried that this will create bias when counting the BAM reads for downstream analysis.

Thanks!

Zepeng-Mu commented 2 years ago

I just saw this issue has been reported before and fixed (https://github.com/timoast/sinto/issues/17). I'm testing it using version 0.8.1 now.

Zepeng-Mu commented 2 years ago

Indeed 0.8.1 solved this issue.

Zepeng-Mu commented 2 years ago

I actually tested again and found that the problem still persists. In 0.8.1 the RG tag is no longer modified but the entry is still duplicated. Setting --nproc 1 resloves this problem.

Zepeng-Mu commented 2 years ago

An example:

A01040:79:H2F2YDRXY:2:2106:5556:14027   163 chr3    12882167    60  50M =   12882431    314 CCTGCAGTGCCTGTCACAGGGTAAATGTTCAATAAAACCTTCTAATTCCC  FFF:FFFF:FFFF:FFFFFF:FFF:FFFFFFFFFFFF:FF,F:FFFFFFF  NM:i:0  MD:Z:50 AS:i:50 XS:i:35 CR:Z:AATGTCGGTCTCTAAG   CY:Z:F:FFFFFFFFFF:FF:   CB:Z:AATGTCGGTCTCTAAG-1 BC:Z:GCTCGTCA   QT:Z:FFFFFFF:   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2
A01040:79:H2F2YDRXY:2:2106:5556:14027   83  chr3    12882431    60  50M =   12882167    -314    ACTCTGCCTTCTGGCCCATGATATCCTCGAAGGCAAGGTGGGGGCAGTTG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:0  MD:Z:50 AS:i:50 XS:i:29 CR:Z:AATGTCGGTCTCTAAG   CY:Z:F:FFFFFFFFFF:FF:   CB:Z:AATGTCGGTCTCTAAG-1 BC:Z:GCTCGTCA   QT:Z:FFFFFFF:   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2
A01040:79:H2F2YDRXY:2:2106:5556:14027   83  chr3    12882431    60  50M =   12882167    -314    ACTCTGCCTTCTGGCCCATGATATCCTCGAAGGCAAGGTGGGGGCAGTTG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:0  MD:Z:50 AS:i:50 XS:i:29 CR:Z:AATGTCGGTCTCTAAG   CY:Z:F:FFFFFFFFFF:FF:   CB:Z:AATGTCGGTCTCTAAG-1 BC:Z:GCTCGTCA   QT:Z:FFFFFFF:   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2

When setting nproc to 1:

A01040:79:H2F2YDRXY:2:2106:5556:14027   163 chr3    12882167    60  50M =   12882431    314 CCTGCAGTGCCTGTCACAGGGTAAATGTTCAATAAAACCTTCTAATTCCC  FFF:FFFF:FFFF:FFFFFF:FFF:FFFFFFFFFFFF:FF,F:FFFFFFF  NM:i:0  MD:Z:50 AS:i:50 XS:i:35 CR:Z:AATGTCGGTCTCTAAG   CY:Z:F:FFFFFFFFFF:FF:   CB:Z:AATGTCGGTCTCTAAG-1 BC:Z:GCTCGTCA   QT:Z:FFFFFFF:   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2
A01040:79:H2F2YDRXY:2:2106:5556:14027   83  chr3    12882431    60  50M =   12882167    -314    ACTCTGCCTTCTGGCCCATGATATCCTCGAAGGCAAGGTGGGGGCAGTTG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:0  MD:Z:50 AS:i:50 XS:i:29 CR:Z:AATGTCGGTCTCTAAG   CY:Z:F:FFFFFFFFFF:FF:   CB:Z:AATGTCGGTCTCTAAG-1 BC:Z:GCTCGTCA   QT:Z:FFFFFFF:   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2
timoast commented 2 years ago

Can you try installing from the develop branch and see if you still have this issue?

timoast commented 2 years ago

@Zepeng-Mu this should be fixed in the latest release, please reopen if you still see issues