t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

Please help me understand the output format of alleyoop dump. #139

Closed realzhang closed 8 months ago

realzhang commented 8 months ago

Dear author, here are a couple of lines of the output of alleyoop dump, where I donot understand the last two columns. The header says "tcCount" "ConversionRates", but they donot look like. Thanks in advance~.

Name    Direction       Sequence        Mismatches      tcCount ConversionRates                                                                                                                                                                         
A00358:807:HL22HDSX3:4:1114:30201:6386  1       CTTGCTAGGCCCCGGCATAGTCTCACAAGAGAGAGCTATATCTGGGTCCTTTCAGCAAAACCTTGCTAGTGTATGCAATGGTGTCAGCATTTGGAAGCC     1       24,0,0,0,0,0,21,0,0,0,0,0,25,0,0,0,1,0,26,0,0,0,0,0,0   3015121,T,N,N,11,C,37,False;
A00358:807:HL22HDSX3:4:2335:8287:8046   1       CTTGCTAGGCCCCGGCATAGTCTCACAAGAGAGAGCTATATCTGGGTCCTTTCAGCAAAACCTTGCTAGTGTATGCAATGGTGTCAGCATTTGGAAGCC     1       24,0,0,0,0,0,21,0,0,0,0,0,25,0,0,0,1,0,26,0,0,0,0,0,0   3015121,T,N,N,11,C,37,False;
t-neumann commented 8 months ago

Hi @realzhang

yeah the header names are a bit misleading, what they represent are the RA:Z and MP:Z strings which encode the nucleotide-conversions as documented in the supplement of the Slamdunk paper:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2849-7#Sec25

Best,

Tobi

realzhang commented 8 months ago

I've missed the suppl. file of the paper. I copied the relevant content here:

RA:Z:
Comma-separated integer array, each position marking a specific conversion type.
Snipaste_2024-01-23_10-59-58
MP:Z:
Comma-separated array of mismatch positions, each position 3 colon-separated
values in the format of <type>:<read position>:<reference position> where type is the same as in the RA:Z tag.

I am still confused about the "25", "26" and five "0"s. The max value of conversion table is 24, so what do 25 and 26 mean? What does each 0 stand for? And for the "False"... Sorry for so many questions.

t-neumann commented 8 months ago

Hi - regarding the RA:Z: every position in the array marks a conversion type and the value of the position in the array the number of occurrences you have for this given conversion. So the 24 would mean you have 24 A>A, then the next 0 means you have 0 A>C etc...

Then the last column is actually for every conversion you find the following info. You see that from the RA:Z string you have 1 T>C conversion (in array position 16 you have a 1). And that last string tells you the following:

Reference position for conversion, Reference base of the conversion, 5' context of the reference base, 3' context of the reference base, Read position of the conversion, Read base of the conversion, base quality, True/False if there was a SNP called on this position and the conversion should be masked.

Hope that helps

realzhang commented 8 months ago

Oh, I see, they are 25 values with one for one position of the table (not position of the read). I'll close this issue with many thanks!