shahcompbio / MultiModalMuSig.jl

A Julia package for extracting mutation signatures using topic models
Other
18 stars 4 forks source link

Generating SV count matrices #10

Closed alhafidzhamdan closed 2 years ago

alhafidzhamdan commented 3 years ago

Hello there, I'm interested in using your algorithm to get SNV and SV signatures on my WGS dataset. Basic question first, do you have any tips on how best to generate the mutation type matrices to input into the package in julia, particularly the SV counts? As in these count data that you used in the example:

snv_counts = CSV.read("data/brca-eu_snv_counts.tsv", delim='\t')
sv_counts = CSV.read("data/brca-eu_sv_counts.tsv", delim='\t')

Thanks A

funnell commented 3 years ago

Hello! I'd recommend running your favourite SV caller on your set of WGS libraries, and then classify the calls into SV types based on what the caller gives you and your interests. Most of the time you'd be able to classify SVs into types like "tandem duplication", "translocation", etc. but you can also expand these by adding supplemental features. For example you could follow Serena Nik-Zainal's breast cancer SV paper and identify clustered breakpoints; then, you could have SV types like "clustered tandem duplication", "unclustered tandem duplication", "clustered translocation", "unclustered translocation", etc. For each library you can then count up the number of each of your SV types and format them into a matrix. I hope that helps, but let me know if there's anything I didn't explain clearly!

alhafidzhamdan commented 3 years ago

Hi @funnell, Thanks for getting back. I've now generated sbs_counts and sv_counts like so:

julia> sbs_counts = CSV.read("sbs_sigs.txt", DataFrame) ## I've had to change your command as i think  they might have updated their code base.

julia> sbs_counts
96×226 DataFrame
 Row │ term     BCCA1T  BCCA10T  BCCA12T  BCCA14T  BCCA15T  BCCA17T  BCCA19T  BCCA20T  BCCA21T  BCCA22T  BCCA23T  BCCA25T  BCCA26T  BCCA28T  BCCA29T  BCCA3T  BCCA30T  BCCA32T  BCCA36T  BCCA37T  BCCA38T  BCCA39T  BCCA4T  BCCA40T  BCCA41T  BCCA43T  BCCA5T  BCCA8T  BCCA9T ⋯
     │ String7  Int64   Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64   Int64    Int64    Int64    Int64    Int64    Int64    Int64   Int64    Int64    Int64    Int64   Int64   Int64  ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ A[C>A]A      10      150      153      543     1411      688       47       53      116      955       49       10      242       20       97      34      292       67       66       12     1331      148      90      552      402      405     123     479     898 ⋯
   2 │ A[C>A]C      14      121      142      437     1132      393       53       35      145      801       44       13      186       12       77      35      187       43       55       10     1222      138      90      511      325      372     106     339     794
   3 │ A[C>A]G       3       29       58      202      554      216       16        4       37      348       13        4       16       12       11      10      109        7       10        2      569       55      31      244      149      203      49     153     368
   4 │ A[C>A]T       8      120       99      317      772      328       29       43       87      542       23       12      207       10       84      20      161       54       76        3      814      114      43      377      241      255      59     286     503
   5 │ C[C>A]A       7      106      100      344      872      340       31       37       77      588       29        9      148       14       60      27      185       46       41        3      906      107      55      382      249      306     103     302     626 ⋯
   6 │ C[C>A]C       1      100      103      246      838      265       33       22       77      534       31        3      168        7       42      31       98       40       47        3      868       85      47      350      233      261      74     220     612
   7 │ C[C>A]G       0       23       61      160      503      147        9        6       39      333        8        5       10        6        9      10       65        8        6        1      557       63      23      224      163      210      46     160     375
   8 │ C[C>A]T       4      127       96      324      762      298       19       21       65      538       32        6      163       10       69      16      151       43       50        5      862       86      49      316      220      271      75     229     574
   9 │ G[C>A]A       7      100      112      359      954      356       32       26      130      716       33        7       98       12       42      27      173       36       33        8      957      111      87      378      257      284      85     289     629 ⋯
  10 │ G[C>A]C       2       75       86      253      719      202       34       27      109      448       19        6       92        9       31      19      106       30       23        4      707       75      75      331      188      291      52     191     598
  11 │ G[C>A]G       4       13       50      162      521      160       14        6       38      330        8        5       12        1       10      12       80        3        3        1      532       48      36      200      154      181      31     142     389
  12 │ G[C>A]T       0       76       71      210      560      213       21       27       56      381       20        2       92        3       33      17      108       29       20        4      598       62      42      225      138      166      53     148     393
  13 │ T[C>A]A       2       72       92      320      847      372       28       41       71      615       26        8      140       11       58      20      176       46       44        9      835       96      63      323      234      228      75     250     481 ⋯
  14 │ T[C>A]C       4       73       75      234      674      248       32       24       73      477       22        8      140       10       54      16       82       35       39        5      665       81      58      295      193      207      68     202     454
  15 │ T[C>A]G       1       18       20      105      299      107        8        7       18      227        7        1       21        4        7      14       58        2        5        0      312       23      16      142       92      114      20      93     198
  16 │ T[C>A]T       5       97      106      416     1041      395       48       50      123      696       30        7      181       20       71      29      198       49       51        8     1010      111      80      425      268      268      98     345     638
  17 │ A[C>G]A      76       65      138      490     1310      497       34       13       80      829       46      592       96       64       50      33      238       29       25        4     1363      489      68      536      872      976     131     404     920 ⋯
  18 │ A[C>G]C      40       30       64      273      722      234       27       11       68      461       39      350       50       30       17      11      121       12       17        1      705      248      45      273      459      583      61     190     515
  19 │ A[C>G]G      27       11       67      213      614      174       13        3       24      387       12      279       16       25        5       8       74        4        7        1      601      191      35      248      349      460      51     172     448
  20 │ A[C>G]T      55       55      111      426     1240      413       41       12       94      774       54      520      118       70       34      28      176       23       35        5     1198      415      96      504      832      867      81     352     820
  21 │ C[C>G]A       6       15       28       58      221       59       17       10       23       59       22       60       21        6       17      22       20       10        9        0      215       57      36       73      109      154       9      53     153 ⋯
  22 │ C[C>G]C       7        7       28       63      217       35       30        2       39       73       14       80       27        3       14      14       21        9        9        0      223       66      53       64      136      167       7      44     195
  23 │ C[C>G]G       7        3        8       31      111       16        8        0       18       44        9       45        4        4        5      12        6        1        1        0      112       25      22       40       64       85       9      15     103
  24 │ C[C>G]T      10       21       27       92      283       72       29       11       60      109       25       99       33        7       24      28       34        7       11        1      309       72      66       83      181      182      14      60     246
  ⋮  │    ⋮       ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮       ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮       ⋮        ⋮        ⋮        ⋮       ⋮       ⋮       ⋮    ⋱
  74 │ G[T>C]C      69       73      161      554     1528      438       29        9      101     1045       28      637       59       65       38      21      169       20       27        5     1490      460      62      556      980     1106     107     416    1105 ⋯
  75 │ G[T>C]G      32       26       68      239      857      182       45        8       89      282       49      276       51       26       33      35       65       15        9        3      906      172     122      347      513      618      42     164     711
  76 │ G[T>C]T     130       94      218      736     2127      685       61       29      183     1470       81      942      134      107       69      47      309       33       46       10     2247      660     127      881     1450     1457     147     643    1433
  77 │ T[T>C]A     121       71      207      811     2202      838       71       26      215     1480       77     1006      114      112       50      34      396       26       40        5     2291      719     151      935     1437     1496     181     684    1395
  78 │ T[T>C]C      95       81      163      710     1767      618       32       20      140     1219       50      786      100       81       42      20      261       37       46       12     1961      580      94      755     1203     1312     158     513    1251 ⋯
  79 │ T[T>C]G      28       36       87      264      923      253       40        9       91      319       46      321       43       35       34      28      116       27        9        3     1037      216     117      308      557      561      40     216     629
  80 │ T[T>C]T     153       88      252      965     2594      935       90       27      247     1735       79     1121      175      116       71      38      431       53       61        8     2699      891     162     1112     1751     1686     252     779    1604
  81 │ A[T>G]A      57       38       83      325      804      326       24        5       68      549       16      365       67       44       15      10      155        8       29        4      781      289      36      327      579      542      73     237     526
  82 │ A[T>G]C      22       16       44      177      511      167       16        4       36      318        9      214       20       33        9      13       82       12        7        2      474      165      26      200      306      289      42     144     334 ⋯
  83 │ A[T>G]G      34       34       84      287      869      296       36        7       95      577       31      395       49       41       19      20      119        8       24        3      846      289      54      385      542      619      54     265     568
  84 │ A[T>G]T      43       32       63      281      783      301       17        9       85      539       22      324       48       29       13      16       95       15       14        3      749      242      63      310      503      464      62     229     471
  85 │ C[T>G]A       5        5       16       26      109       28        7        2       19       33        8       37       22        6        4       5       13        7        3        0      129       23      23       43       67       69       5      17      84
  86 │ C[T>G]C       8        5       18       33      140       22       17        1       21       47        9       48       22        3       10      12        9        2        3        0      190       27      26       56       91       87       9      42     129 ⋯
  87 │ C[T>G]G       3       15       22       64      208       47       19        2       42       83       25       86       19        7        8      21       15        2        4        0      265       45      62       81      142      165      12      47     187
  88 │ C[T>G]T       6       14       17       70      209       61       20        5       24       64       18       79       37        8       22       9       19       10        6        1      222       50      48       67      117      110       6      54     111
  89 │ G[T>G]A      25        6       46      149      469      126       14        3       37      300        7      221       18       15        6      10       67        8        7        4      457      132      20      161      329      313      40     135     331
  90 │ G[T>G]C      20        7       48      130      455      113       12        2       25      286        8      198       15       19        3       8       56        2        1        0      455      137      16      159      314      365      22     108     376 ⋯
  91 │ G[T>G]G      36       29       72      234      614      185       16        7       57      475       22      258       41       22       14      18       86       10       12        2      705      205      30      281      467      502      47     189     518
  92 │ G[T>G]T      45       26       74      287      753      228       19        2       45      513       18      330       32       40       15       9      100        9       14        1      816      227      33      317      484      559      68     230     574
  93 │ T[T>G]A      25       25       64      277      740      306       25       11       78      504       24      351       50       50       16      15      118        8       14        2      773      257      41      315      534      448      57     231     428
  94 │ T[T>G]C      40       21       70      238      729      232       11        8       61      496       21      353       35       35       10       6      104        4        7        2      761      252      37      329      486      470      53     217     522 ⋯
  95 │ T[T>G]G      38       40       81      318      883      300       29        6       71      593       36      381       43       34       26      13      127       11       12        1      885      281      61      374      605      577      64     259     604
  96 │ T[T>G]T      56       46      103      479     1350      530       57       21      135      883       66      533       98       57       36      28      183       21       27        6     1339      413      86      536      877      802      85     368     804
                                                                                                                                                                                                                                                196 columns and 49 rows omitted
julia> sv_counts = CSV.read("sv_sigs.txt", DataFrame)
julia> sv_counts
19×225 DataFrame
 Row │ term              BCCA1T  BCCA10T  BCCA12T  BCCA14T  BCCA15T  BCCA17T  BCCA19T  BCCA20T  BCCA21T  BCCA22T  BCCA23T  BCCA25T  BCCA26T  BCCA28T  BCCA29T  BCCA3T  BCCA30T  BCCA32T  BCCA36T  BCCA37T  BCCA38T  BCCA39T  BCCA4T  BCCA40T  BCCA41T  BCCA43T  BCCA8T  BCCA9 ⋯
     │ String31          Int64   Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64    Int64   Int64    Int64    Int64    Int64    Int64    Int64    Int64   Int64    Int64    Int64    Int64   Int64 ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ DEL_0e00_1e03_bp       0        7        0        9        4        1        0        1       18       10        3        1        2        0        4       4        1        2        1        0        2        1       8        1        2        1       0        ⋯
   2 │ DEL_1e03_1e04_bp       0        4        1        5        8        1        3        0        6        2        2        0        3        0        1       1        1        2        1        2        0        3       1        0        1        3       2
   3 │ DEL_1e04_1e05_bp       0        3        1        0        1        1        1        3        0        0        0        0        1        0        2       0        0        0        0        1        0        1       1        0        0        0       1
   4 │ DEL_1e05_1e06_bp       0        8        5        1        1        3        1        1        0        0        0        0        3        0        3       0        0       10        4        3        0        0       0        0        0        0       3
   5 │ DEL_1e06_1e07_bp       1        9        3        1        0        2        0        0        0        0        0        0        6        0        7       0        0        8        0        0        0        4       0        0        0        0       0        ⋯
   6 │ DEL_1e07_Inf_bp        0       18        1        0        0        0        0        0        1        0        0        0        1        8        7       0        0        4        0        0        0        0       0        0        0        0       0
   7 │ DUP_0e00_1e03_bp       0        5        0        2        1        0        3        0        6        4        1        0       10        0        1       2        0        1        0        0        1        0       3        0        1        0       0
   8 │ DUP_1e03_1e04_bp       0        1        0        2        0        0        4        1        2        1        0        0        1        0        0       0        0        0        0        0        0        0       0        0        0        1       2
   9 │ DUP_1e04_1e05_bp       0        1        0        0        1        0        0        0        0        0        0        0        1        0        0       0        0        2        0        0        0        0       0        1        0        0       3        ⋯
  10 │ DUP_1e05_1e06_bp       0        3        1        5        0        3        0        0        0        0        0        0        4        0        1       0        0        2        1        2        0        1       0        0        0        0       2
  11 │ DUP_1e06_1e07_bp       0       11        6        3        0        0        0        0        0        0        0        0        4        0        5       0        0        8        1        1        0        5       0        0        0        1       1
  12 │ DUP_1e07_Inf_bp        0       13        2        0        0        0        0        0        0        0        0        0        4        3        7       0        0        4        0        0        0        2       0        0        0        0       0
  13 │ INV_0e00_1e03_bp       0       53        0        3        0        1        0        0        0        0        0        2        2       35        0       0        0        0        0        1        2        0       1        0        0        0       1        ⋯
  14 │ INV_1e03_1e04_bp       0      107        2        4        2        1        0        0        0        0        0        0        5       35        0       0        0        1        0        0        0        0       0        0        0        0       0
  15 │ INV_1e04_1e05_bp       0        2        0        1        0        0        0        0        0        0        0        0        1        0        0       0        0        1        1        0        0        0       0        0        0        0       0
  16 │ INV_1e05_1e06_bp       0       10        5        5        1        1        0        0        0        0        0        0        2        3       12       0        0        6        1        3        0        4       0        0        0        1       3
  17 │ INV_1e06_1e07_bp       0       29        6        0        0        4        0        0        0        0        0        0        5        0        7       0        0       12        0        0        0        1       0        0        0        1       4        ⋯
  18 │ INV_1e07_Inf_bp        0       33        5        0        0        1        0        0        0        0        0        0        6        1       13       0        0        3        0        0        0        2       0        0        0        3       2
  19 │ TRA                    0       15        5       16        0       16        2        4        5        6        0        0       21        8        6       0        0        8        2        0        1        6       4        2        2        1       1
                                                                                                                                                                                                                                                            197 columns omitted

However I encountered this error next:

julia> X = format_counts_mmctm(sbs_counts, sv_counts)
ERROR: MethodError: no method matching format_counts_mmctm(::DataFrame, ::DataFrame)
Stacktrace:
 [1] top-level scope
   @ REPL[75]:1

I'm not that familiar with Julia so I'll try and figure this out today but thought you might know a solution. Thanks! A

funnell commented 3 years ago

Hello,

Please forgive me, it looks like I haven't kept the documentation up to date. The current format_counts_mmctm function looks like this format_counts_mmctm(countdfs::Vector{DataFrame}, cols::Vector{Symbol}) so perhaps you can try something like

    samples = [c for c in propertynames(sbs_counts) if c != :term] # column names with counts
    X = MultiModalMuSig.format_counts_mmctm([sbs_counts, sv_counts], samples)
funnell commented 3 years ago

Hi @alhafidzhamdan did you manage to get this working?

alhafidzhamdan commented 2 years ago

Hi @funnell, sorry for taking so long. Thanks for your tweaks- i've got it working til the end of the codes used in your example. I've included here the modifications I added as per Julia's error messages at each particular step.

julia> samples = [c for c in propertynames(sbs_counts) if c != :term]
julia> X = MultiModalMuSig.format_counts_mmctm([sbs_counts, sv_counts], samples)
julia> model = MMCTM([7, 7], [0.1, 0.1], X)
julia> fit!(model, tol=1e-5)
julia> snv_signatures = DataFrame(hcat(model.ϕ[1]...), :auto)
julia> sv_signatures = DataFrame(hcat(model.ϕ[2]...), :auto)
julia> snv_signatures[!,:term] = sbs_counts[!,:term]

However this does not work.

julia> snv_signatures = melt(
           snv_signatures, :term, variable_name=:signature, value_name=:probability
       )
ERROR: UndefVarError: melt not defined
Stacktrace:
 [1] top-level scope
   @ REPL[120]:1

I continued anyway- SNV and SV probs for all samples:

snv_props = hcat(
    [model.props[i][1] for i in 1:length(model.props)]...
)
sv_props = hcat(
    [model.props[i][2] for i in 1:length(model.props)]...
)

I have a few questions from here: 1) How can I assign the rows and columns of snv_props and sv_props to the 7 signatures and samples?

julia> DataFrame(sv_props,:auto)
7×225 DataFrame
 Row │ x1        x2        x3        x4        x5        x6         x7        x8        x9        x10       x11       x12       x13       x14       x15       x16       x17       x18       x19       x20       x21       x22       x23       x24       x25       x26       x27       x28       x29       x30       x31    ⋯
     │ Float64   Float64   Float64   Float64   Float64   Float64    Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float6 ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 1.52523   0.892003  1.38285   1.21774   1.96964   1.17702    3.31342   2.00003   2.37593   0.639333  1.23368   1.36052   0.497386  1.82476   1.64151   1.96935   1.19951   3.44936   0.594972  2.39841   0.621306  1.14491   1.34344   1.42068   1.44479   0.898075  1.29263   4.83384   2.0986    0.730421  4.3736 ⋯
   2 │ 2.05933   6.87841   3.05881   3.70357   0.547784  0.5564     0.690446  1.52415   1.76329   1.48142   3.38483   2.31011   3.74832   1.35051   2.76729   5.11409   3.20587   2.34284   3.78887   3.18727   3.87339   2.98576   2.99264   1.90186   3.1843    1.19829   3.35416   2.95572   0.593893  1.25003   0.3137
   3 │ 2.81751   0.234477  2.82508   2.78725   0.942591  5.31433    2.16297   1.20338   2.48781   0.972929  3.23332   1.03998   3.77494   3.80404   3.66599   0.392087  3.16897   1.6235    3.88862   1.14087   3.48607   0.762118  2.85891   1.7637    3.32182   2.94322   3.13771   0.437131  1.36894   5.09371   0.9139
   4 │ 0.878145  0.104497  0.926065  0.813333  2.44788   0.509715   1.03269   0.134703  0.90565   0.202449  0.801     0.483705  1.38308   0.465626  0.864807  0.290392  0.85158   0.557338  1.17886   0.334046  0.950684  1.13196   0.946988  0.754451  0.900103  0.86928   0.62976   0.410374  1.87974   0.681321  0.9481
   5 │ 1.63174   2.98791   1.34295   1.26337   1.01148   1.42752    0.534973  4.497     1.6076    4.44152   1.14803   1.41413   1.31018   0.521403  1.24389   1.99183   1.15181   2.03737   1.2454    1.05877   1.22168   1.16412   1.4137    1.50506   1.20268   4.34465   1.11961   1.9914    0.495736  0.88545   0.3289 ⋯
   6 │ 1.1131    0.078837  0.474637  0.851791  2.21988   1.42198    1.55959   0.147198  0.658698  0.196302  0.726703  0.344573  0.728307  0.508837  0.99269   0.198602  0.699491  0.3788    0.649017  0.296185  1.03338   0.295105  0.697907  0.631958  0.534966  0.515907  0.964186  0.223724  2.41553   0.903489  2.2403
   7 │ 1.05325   0.3155    1.14679   0.193642  1.98399   0.0966345  2.97119   0.227853  1.92781   0.400524  0.886129  0.893194  0.410525  0.30377   0.982847  0.363358  0.927924  0.247319  0.446969  0.365857  0.40796   0.384164  0.972539  0.278447  1.0753    0.331141  0.816327  0.455562  2.01648   0.324886  4.3691
                                                                                                                                                                                                                                                                                                         195 columns omitted

2) And another naive question, from these 7 SNV and 7 SV signatures, how can I then assign them to the Cosmic signatures? For example, SNV Signature 1 to Age etc

Thanks! A

funnell commented 2 years ago

Hi @alhafidzhamdan,

Thanks for your patience and your list of modifications! I've updated the README to reflect your and some additional changes. Regarding your questions:

  1. I would use the rename! function to set the data frame column names to the samples like so: rename!(props, samples). You could also prepend a "signature" column to each data frame to label the rows with whatever names seem reasonable to you. For example, you could construct a vector of signature labels given a vector of modality labels modalities like this:
    sig_labels = vcat([["$(modalities[m])-$k" for k in 1:model.K[m]] for m in 1:model.M]...)
    props = hcat(DataFrame(signature=sig_labels), props)
  2. The way I match inferred signatures with those in COSMIC is to first compute the cosine distance between inferred and COSMIC signatures. Then I use a linear sum assignment problem solver to find the optimal unique matches (so, each COSMIC signature will be assigned to one and only one inferred signature). There are different packages out there to do this, the one I use is the R package clue (there's a function solve_LSAP).

I hope that helps!

funnell commented 2 years ago

Hi @alhafidzhamdan do you have any more questions? If not, I'll close this issue. Thanks!

alhafidzhamdan commented 2 years ago

Hi @funnell, I've yet to look into this but I foresee some difficulties on my part trying to do number 2. But i'll look into it and will re-open if I need some suggestions/help from you! Thanks and happy 2022! Al