kgori / sigfit

Flexible Bayesian inference of mutational signatures
GNU General Public License v3.0
33 stars 8 forks source link

Bug in build_catalogues #63

Closed gevro closed 2 years ago

gevro commented 2 years ago

Hi, I think I found a bug in build_catalogues.

In the 5th column of the input: -1 should equal transcribed strand 1 should equal untranscribed strand

However, build_catalogues only puts a mutation with strand="-1" into the transcribed category if the Ref nucleotide is a pyrimidine (C or T). But if the Ref nucleotide is a purine it puts a "-1" mutation it into the untranscribed category.

This does not match the documentation. Because "-1" should always go into the transcribed category, regardless of whether Ref is a pyrimidine or a purine.

I think this is the problematic code:

                if (var[2] %in% c("C", "T")) {
                  trinuc <- strsplit(var[4], "")[[1]]
                  alt <- var[3]
                  if (strand) {
                    strd <- var[5]
                  }
                }
                else {
                  trinuc <- rev_comp(var[4])
                  alt <- rev_comp(var[3])
                  if (strand) {
                    strd <- ifelse(var[5] == "U", "T", "U") ##THIS IS THE PROBLEM I THINK
                  }
                }

I'm not sure what the purpose of the code marked with ##. It flips a mutation from untranscribed to transcribed, and vice versa, if the central nucleotide is a purine. There's no reason to change the strand of a mutation. If a mutation is annotated as transcribed, then it should be output as transcribed, and if a mutation is annotated as untranscribed then it should be annotated as untranscribed, no? Unless I'm misunderstanding something.

kgori commented 2 years ago

Hi gevro,

This is not a bug, it's the intended behaviour.

For the purposes of mutational signatures, all mutations are reported with reference to the pyrimidine base of the complementary pair. Meaning that, for example, a mutation of G->A is encoded as C->T, which is the complementary change. To add strandedness to this representation, if the aforementioned G->A were on the transcribed strand, it would be encoded as C->T (untranscribed), since the C used as the reference base is on the untranscribed strand. The line you commented in the code takes care of this change of strandedness.

This system of referring to mutations by their pyrimidine context is described in Alexandrov, et al. 2013. “Signatures of Mutational Processes in Human Cancer.” Nature 500 (7463): 415–21..

Best, Kevin

gevro commented 2 years ago

Thanks so much for explaining!