haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
187 stars 20 forks source link

[BUG] possibly improper MD tag generation whej running atac data. #162

Open LinearParadox opened 4 months ago

LinearParadox commented 4 months ago

Describe the bug It seems that the MD tag generated in Chromap sometimes can incorrectly begin with a letter. It seems this behavior, to my understanding occurs when the first character is a 0. More info can be found here:

https://github.com/macs3-project/MACS/issues/643#issuecomment-2100990401

taoliu commented 4 months ago

@mourisl I tested on chromap 0.2.6.

Here is an example from samtools calmd on the SAM/BAM file generated by Chromap. We can see different types of illegal MD tags, that were fixed by samtools.

[bam_fillmd1] different MD for read 'SRR1822137.228451': '13T8GA4GT20' -> '13T8G0A4G0T20'
[bam_fillmd1] different MD for read 'SRR1822137.108273': 'T9T39' -> '0T9T39'
[bam_fillmd1] different MD for read 'SRR1822137.430621': '21T19A7C' -> '21T19A7C0'
[bam_fillmd1] different MD for read 'SRR1822137.374239': '6G8GC24^T5C3' -> '6G8G0C24^T5C3'
[bam_fillmd1] different MD for read 'SRR1822137.7023': '49G' -> '49G0'
[bam_fillmd1] different MD for read 'SRR1822137.153844': 'A7A15A25' -> '0A7A15A25'
[bam_fillmd1] different MD for read 'SRR1822137.153844': '49A' -> '49A0'
[bam_fillmd1] different MD for read 'SRR1822137.188115': 'A46' -> '0A46'
[bam_fillmd1] different MD for read 'SRR1822137.147298': 'T49' -> '0T49'
[bam_fillmd1] different MD for read 'SRR1822137.438430': '22TT3G20A1' -> '22T0T3G20A1'
[bam_fillmd1] different MD for read 'SRR1822137.3039': '32ATA1GA3T6G1' -> '32A0T0A1G0A3T6G1'
[bam_fillmd1] different MD for read 'SRR1822137.325144': '42C6G' -> '42C6G0'
[bam_fillmd1] different MD for read 'SRR1822137.254577': 'A49' -> '0A49'
[bam_fillmd1] different MD for read 'SRR1822137.68007': 'A23G1A23' -> '0A23G1A23'
[bam_fillmd1] different MD for read 'SRR1822137.435278': '14CA34' -> '14C0A34'
[bam_fillmd1] different MD for read 'SRR1822137.123068': 'A2G4A15C25' -> '0A2G4A15C25'
[bam_fillmd1] different MD for read 'SRR1822137.303383': '49A' -> '49A0'
[bam_fillmd1] different MD for read 'SRR1822137.155971': 'A31T17' -> '0A31T17'
[bam_fillmd1] different MD for read 'SRR1822137.100949': '49A' -> '49A0'
[bam_fillmd1] different MD for read 'SRR1822137.145484': 'G6C25G15' -> '0G6C25G15'
[bam_fillmd1] different MD for read 'SRR1822137.145484': '42C6T' -> '42C6T0'
[bam_fillmd1] different MD for read 'SRR1822137.310296': '22A8G7T3C3T1G' -> '22A8G7T3C3T1G0'
[bam_fillmd1] different MD for read 'SRR1822137.453342': 'T48' -> '0T48'
[bam_fillmd1] different MD for read 'SRR1822137.453342': '34C2A9^A2T' -> '34C2A9^A2T0'
...

Issues involve not only the beginning '0', but also the ending '0', and '0' between two or more bases. For example, '14CA34' should be '14C0A34'. According to Sequence Alignment/Map Optional Fields Specification, MD string should follow this format: MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*, where + means >=1.

mourisl commented 4 months ago

Thanks for identifying this issue! We will fix it.

mourisl commented 3 weeks ago

Sorry for the (much) delayed response...I've pushed an update to the li_dev8 branch. Could you please checkout this branch and give it a try? If it works, we will merge it to the master branch.