rmpiro / decompTumor2Sig

decompTumor2Sig - Decomposition of individual tumors into mutational signatures
1 stars 1 forks source link

MutationFeatureData explained #1

Closed antoine186 closed 3 years ago

antoine186 commented 5 years ago

Hello,

I have been looking at your getGenomesFromMutFeatData() function and I can't seem to understand what the MutationFeatureData is... I have been trying to understand this object for days with no luck.

Since your function operates on it, could you please help explain what the rows and columns of featureVectorList and countData are? I tried reading up on pmsignature with no luck

Thanks!

rmpiro commented 4 years ago

Dear Antoine, please excuse the very late answer. Unfortunately, I did not get any notice about your question. Maybe I haven't configured my account correctly.

You do not really need this function in case you have an MPF or VCF file. You can use readGenomesFromMPF() or readGenomesFromVCF() instead. The only purpose which getGenomesFromMutFeatData() serves, is to interface to the pmsignature package. In case you already have read in the mutation data with, e.g., pmsignature::readMPFile(), which returns an object of class 'MutationFeatureData' (used internally by the pmsignature package), you can convert the mutation data from there, instead of reading the file again for decompTumor2Sig.

Since MutationFeatureData is specific for the pmsignature package, you will probably not need it anywhere else (and I don't believe there is any documentation on it). In case you do need it, here a brief description:

The composition of MutationFeatureData is not so easy to understand (it took me quite some mining in the object and the pmsignature source code). It becomes a little clearer, when you look at what values the rows in featureVectorList can assume (this example corresponds to numBases=5 and trDir=TRUE):

> max(G@featureVectorList[1,])
[1] 6
> max(G@featureVectorList[2,])
[1] 4
> max(G@featureVectorList[3,])
[1] 4
> max(G@featureVectorList[4,])
[1] 4
> max(G@featureVectorList[5,])
[1] 4
> max(G@featureVectorList[6,])
[1] 2

In featureVectorList every column corresponds to one specific mutation type (including surrounding bases): The integer in the first row is the point mutation itself (1=C>A; 2=C>G; 3=C>T; 4=T>A; 5=T>C; 6=T>G). The integers in the next rows are the flanking bases (at -2, -1, +1, +2 bp from the mutation; 1=A, 2=C, 3=G, 4=T). The last row is the transcription strand: 1=+; 2=- Example: the column (2,1,4,3,4,2) would correspond to: AT[C>G]GT on the - strand (bases 1,4; mutation 2; bases 3,4; strand 2)

Therefore, every SNV read from an MPF file, can be described by one of the columns in the featureVectorList.

Note: only mutation types actually present in the data will end up in the featureVectorList. In my example:

> dim(G@featureVectorList)
[1]    6 2973

So in the data I'm currently looking at (20 breast cancer samples), there were 2973 distinct mutation patterns/types (SNVs including flanking bases and transcription strand).

Now to the countData: Let's look at my example:

> dim(G@countData)
[1]     3 25419

Three rows and 25419 columns. Let's look at their content:

> max(G@countData[1,])
[1] 2973
> max(G@countData[2,])
[1] 20
> max(G@countData[3,])
[1] 719

The maximum value in the first row, is exactly the number of mutation types/features in the featureVectorList! So each column in countData is associated with one specific mutation type (SNVs including flanking bases and transcription strand)!

The maximum value in the second row is exactly the number of tumors I've loaded (20)!

This means each column in countData is also associated with one specific tumor/patient.

And the last row just counts how often the given mutation type (first row) is found in the given patient (second row) ...

Let's for example look at the first few column, I have in countData:

> G@countData[,1:26]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1
[2,]    2    3    4    5    6    7    8    9   10    11    12    13    14    15
[3,]    3    4    2    5    5    7    1    1    4     5     3     2     3     3
     [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,]     1     1     1     1     2     2     2     2     2     2     2     2
[2,]    16    17    19    20     1     2     3     4     5     6     7     8
[3,]     2     2     3     3     3     3     5     3    12     3     5     4

So the first mutation type from the featureVectorList ...

> G@featureVectorList[,1]
[1] 1 1 1 1 1 1

... is present in 18 out of 20 patients (not in patient 1 and patient 18), with the respective number of occurrences (counts) being 3,4,2,5,5, ... and so on.

From these values, it is possible to compute for each tumor/patient the fraction of mutations which are of a specific mutation type (which is what we need for later decomposing into given mutational signatures).

I hope this helps!

But as said, you don't need to convert the mutation data from pmsignature, you can also directly load it from an MPF or VCF file for decompTumor2Sig.

Best regards, Rosario.