Closed alhafidzhamdan closed 2 years ago
Hello! I'd recommend running your favourite SV caller on your set of WGS libraries, and then classify the calls into SV types based on what the caller gives you and your interests. Most of the time you'd be able to classify SVs into types like "tandem duplication", "translocation", etc. but you can also expand these by adding supplemental features. For example you could follow Serena Nik-Zainal's breast cancer SV paper and identify clustered breakpoints; then, you could have SV types like "clustered tandem duplication", "unclustered tandem duplication", "clustered translocation", "unclustered translocation", etc. For each library you can then count up the number of each of your SV types and format them into a matrix. I hope that helps, but let me know if there's anything I didn't explain clearly!
Hi @funnell, Thanks for getting back. I've now generated sbs_counts and sv_counts like so:
julia> sbs_counts = CSV.read("sbs_sigs.txt", DataFrame) ## I've had to change your command as i think they might have updated their code base.
julia> sbs_counts
96×226 DataFrame
Row │ term BCCA1T BCCA10T BCCA12T BCCA14T BCCA15T BCCA17T BCCA19T BCCA20T BCCA21T BCCA22T BCCA23T BCCA25T BCCA26T BCCA28T BCCA29T BCCA3T BCCA30T BCCA32T BCCA36T BCCA37T BCCA38T BCCA39T BCCA4T BCCA40T BCCA41T BCCA43T BCCA5T BCCA8T BCCA9T ⋯
│ String7 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ A[C>A]A 10 150 153 543 1411 688 47 53 116 955 49 10 242 20 97 34 292 67 66 12 1331 148 90 552 402 405 123 479 898 ⋯
2 │ A[C>A]C 14 121 142 437 1132 393 53 35 145 801 44 13 186 12 77 35 187 43 55 10 1222 138 90 511 325 372 106 339 794
3 │ A[C>A]G 3 29 58 202 554 216 16 4 37 348 13 4 16 12 11 10 109 7 10 2 569 55 31 244 149 203 49 153 368
4 │ A[C>A]T 8 120 99 317 772 328 29 43 87 542 23 12 207 10 84 20 161 54 76 3 814 114 43 377 241 255 59 286 503
5 │ C[C>A]A 7 106 100 344 872 340 31 37 77 588 29 9 148 14 60 27 185 46 41 3 906 107 55 382 249 306 103 302 626 ⋯
6 │ C[C>A]C 1 100 103 246 838 265 33 22 77 534 31 3 168 7 42 31 98 40 47 3 868 85 47 350 233 261 74 220 612
7 │ C[C>A]G 0 23 61 160 503 147 9 6 39 333 8 5 10 6 9 10 65 8 6 1 557 63 23 224 163 210 46 160 375
8 │ C[C>A]T 4 127 96 324 762 298 19 21 65 538 32 6 163 10 69 16 151 43 50 5 862 86 49 316 220 271 75 229 574
9 │ G[C>A]A 7 100 112 359 954 356 32 26 130 716 33 7 98 12 42 27 173 36 33 8 957 111 87 378 257 284 85 289 629 ⋯
10 │ G[C>A]C 2 75 86 253 719 202 34 27 109 448 19 6 92 9 31 19 106 30 23 4 707 75 75 331 188 291 52 191 598
11 │ G[C>A]G 4 13 50 162 521 160 14 6 38 330 8 5 12 1 10 12 80 3 3 1 532 48 36 200 154 181 31 142 389
12 │ G[C>A]T 0 76 71 210 560 213 21 27 56 381 20 2 92 3 33 17 108 29 20 4 598 62 42 225 138 166 53 148 393
13 │ T[C>A]A 2 72 92 320 847 372 28 41 71 615 26 8 140 11 58 20 176 46 44 9 835 96 63 323 234 228 75 250 481 ⋯
14 │ T[C>A]C 4 73 75 234 674 248 32 24 73 477 22 8 140 10 54 16 82 35 39 5 665 81 58 295 193 207 68 202 454
15 │ T[C>A]G 1 18 20 105 299 107 8 7 18 227 7 1 21 4 7 14 58 2 5 0 312 23 16 142 92 114 20 93 198
16 │ T[C>A]T 5 97 106 416 1041 395 48 50 123 696 30 7 181 20 71 29 198 49 51 8 1010 111 80 425 268 268 98 345 638
17 │ A[C>G]A 76 65 138 490 1310 497 34 13 80 829 46 592 96 64 50 33 238 29 25 4 1363 489 68 536 872 976 131 404 920 ⋯
18 │ A[C>G]C 40 30 64 273 722 234 27 11 68 461 39 350 50 30 17 11 121 12 17 1 705 248 45 273 459 583 61 190 515
19 │ A[C>G]G 27 11 67 213 614 174 13 3 24 387 12 279 16 25 5 8 74 4 7 1 601 191 35 248 349 460 51 172 448
20 │ A[C>G]T 55 55 111 426 1240 413 41 12 94 774 54 520 118 70 34 28 176 23 35 5 1198 415 96 504 832 867 81 352 820
21 │ C[C>G]A 6 15 28 58 221 59 17 10 23 59 22 60 21 6 17 22 20 10 9 0 215 57 36 73 109 154 9 53 153 ⋯
22 │ C[C>G]C 7 7 28 63 217 35 30 2 39 73 14 80 27 3 14 14 21 9 9 0 223 66 53 64 136 167 7 44 195
23 │ C[C>G]G 7 3 8 31 111 16 8 0 18 44 9 45 4 4 5 12 6 1 1 0 112 25 22 40 64 85 9 15 103
24 │ C[C>G]T 10 21 27 92 283 72 29 11 60 109 25 99 33 7 24 28 34 7 11 1 309 72 66 83 181 182 14 60 246
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
74 │ G[T>C]C 69 73 161 554 1528 438 29 9 101 1045 28 637 59 65 38 21 169 20 27 5 1490 460 62 556 980 1106 107 416 1105 ⋯
75 │ G[T>C]G 32 26 68 239 857 182 45 8 89 282 49 276 51 26 33 35 65 15 9 3 906 172 122 347 513 618 42 164 711
76 │ G[T>C]T 130 94 218 736 2127 685 61 29 183 1470 81 942 134 107 69 47 309 33 46 10 2247 660 127 881 1450 1457 147 643 1433
77 │ T[T>C]A 121 71 207 811 2202 838 71 26 215 1480 77 1006 114 112 50 34 396 26 40 5 2291 719 151 935 1437 1496 181 684 1395
78 │ T[T>C]C 95 81 163 710 1767 618 32 20 140 1219 50 786 100 81 42 20 261 37 46 12 1961 580 94 755 1203 1312 158 513 1251 ⋯
79 │ T[T>C]G 28 36 87 264 923 253 40 9 91 319 46 321 43 35 34 28 116 27 9 3 1037 216 117 308 557 561 40 216 629
80 │ T[T>C]T 153 88 252 965 2594 935 90 27 247 1735 79 1121 175 116 71 38 431 53 61 8 2699 891 162 1112 1751 1686 252 779 1604
81 │ A[T>G]A 57 38 83 325 804 326 24 5 68 549 16 365 67 44 15 10 155 8 29 4 781 289 36 327 579 542 73 237 526
82 │ A[T>G]C 22 16 44 177 511 167 16 4 36 318 9 214 20 33 9 13 82 12 7 2 474 165 26 200 306 289 42 144 334 ⋯
83 │ A[T>G]G 34 34 84 287 869 296 36 7 95 577 31 395 49 41 19 20 119 8 24 3 846 289 54 385 542 619 54 265 568
84 │ A[T>G]T 43 32 63 281 783 301 17 9 85 539 22 324 48 29 13 16 95 15 14 3 749 242 63 310 503 464 62 229 471
85 │ C[T>G]A 5 5 16 26 109 28 7 2 19 33 8 37 22 6 4 5 13 7 3 0 129 23 23 43 67 69 5 17 84
86 │ C[T>G]C 8 5 18 33 140 22 17 1 21 47 9 48 22 3 10 12 9 2 3 0 190 27 26 56 91 87 9 42 129 ⋯
87 │ C[T>G]G 3 15 22 64 208 47 19 2 42 83 25 86 19 7 8 21 15 2 4 0 265 45 62 81 142 165 12 47 187
88 │ C[T>G]T 6 14 17 70 209 61 20 5 24 64 18 79 37 8 22 9 19 10 6 1 222 50 48 67 117 110 6 54 111
89 │ G[T>G]A 25 6 46 149 469 126 14 3 37 300 7 221 18 15 6 10 67 8 7 4 457 132 20 161 329 313 40 135 331
90 │ G[T>G]C 20 7 48 130 455 113 12 2 25 286 8 198 15 19 3 8 56 2 1 0 455 137 16 159 314 365 22 108 376 ⋯
91 │ G[T>G]G 36 29 72 234 614 185 16 7 57 475 22 258 41 22 14 18 86 10 12 2 705 205 30 281 467 502 47 189 518
92 │ G[T>G]T 45 26 74 287 753 228 19 2 45 513 18 330 32 40 15 9 100 9 14 1 816 227 33 317 484 559 68 230 574
93 │ T[T>G]A 25 25 64 277 740 306 25 11 78 504 24 351 50 50 16 15 118 8 14 2 773 257 41 315 534 448 57 231 428
94 │ T[T>G]C 40 21 70 238 729 232 11 8 61 496 21 353 35 35 10 6 104 4 7 2 761 252 37 329 486 470 53 217 522 ⋯
95 │ T[T>G]G 38 40 81 318 883 300 29 6 71 593 36 381 43 34 26 13 127 11 12 1 885 281 61 374 605 577 64 259 604
96 │ T[T>G]T 56 46 103 479 1350 530 57 21 135 883 66 533 98 57 36 28 183 21 27 6 1339 413 86 536 877 802 85 368 804
196 columns and 49 rows omitted
julia> sv_counts = CSV.read("sv_sigs.txt", DataFrame)
julia> sv_counts
19×225 DataFrame
Row │ term BCCA1T BCCA10T BCCA12T BCCA14T BCCA15T BCCA17T BCCA19T BCCA20T BCCA21T BCCA22T BCCA23T BCCA25T BCCA26T BCCA28T BCCA29T BCCA3T BCCA30T BCCA32T BCCA36T BCCA37T BCCA38T BCCA39T BCCA4T BCCA40T BCCA41T BCCA43T BCCA8T BCCA9 ⋯
│ String31 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ DEL_0e00_1e03_bp 0 7 0 9 4 1 0 1 18 10 3 1 2 0 4 4 1 2 1 0 2 1 8 1 2 1 0 ⋯
2 │ DEL_1e03_1e04_bp 0 4 1 5 8 1 3 0 6 2 2 0 3 0 1 1 1 2 1 2 0 3 1 0 1 3 2
3 │ DEL_1e04_1e05_bp 0 3 1 0 1 1 1 3 0 0 0 0 1 0 2 0 0 0 0 1 0 1 1 0 0 0 1
4 │ DEL_1e05_1e06_bp 0 8 5 1 1 3 1 1 0 0 0 0 3 0 3 0 0 10 4 3 0 0 0 0 0 0 3
5 │ DEL_1e06_1e07_bp 1 9 3 1 0 2 0 0 0 0 0 0 6 0 7 0 0 8 0 0 0 4 0 0 0 0 0 ⋯
6 │ DEL_1e07_Inf_bp 0 18 1 0 0 0 0 0 1 0 0 0 1 8 7 0 0 4 0 0 0 0 0 0 0 0 0
7 │ DUP_0e00_1e03_bp 0 5 0 2 1 0 3 0 6 4 1 0 10 0 1 2 0 1 0 0 1 0 3 0 1 0 0
8 │ DUP_1e03_1e04_bp 0 1 0 2 0 0 4 1 2 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2
9 │ DUP_1e04_1e05_bp 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 0 3 ⋯
10 │ DUP_1e05_1e06_bp 0 3 1 5 0 3 0 0 0 0 0 0 4 0 1 0 0 2 1 2 0 1 0 0 0 0 2
11 │ DUP_1e06_1e07_bp 0 11 6 3 0 0 0 0 0 0 0 0 4 0 5 0 0 8 1 1 0 5 0 0 0 1 1
12 │ DUP_1e07_Inf_bp 0 13 2 0 0 0 0 0 0 0 0 0 4 3 7 0 0 4 0 0 0 2 0 0 0 0 0
13 │ INV_0e00_1e03_bp 0 53 0 3 0 1 0 0 0 0 0 2 2 35 0 0 0 0 0 1 2 0 1 0 0 0 1 ⋯
14 │ INV_1e03_1e04_bp 0 107 2 4 2 1 0 0 0 0 0 0 5 35 0 0 0 1 0 0 0 0 0 0 0 0 0
15 │ INV_1e04_1e05_bp 0 2 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0
16 │ INV_1e05_1e06_bp 0 10 5 5 1 1 0 0 0 0 0 0 2 3 12 0 0 6 1 3 0 4 0 0 0 1 3
17 │ INV_1e06_1e07_bp 0 29 6 0 0 4 0 0 0 0 0 0 5 0 7 0 0 12 0 0 0 1 0 0 0 1 4 ⋯
18 │ INV_1e07_Inf_bp 0 33 5 0 0 1 0 0 0 0 0 0 6 1 13 0 0 3 0 0 0 2 0 0 0 3 2
19 │ TRA 0 15 5 16 0 16 2 4 5 6 0 0 21 8 6 0 0 8 2 0 1 6 4 2 2 1 1
197 columns omitted
However I encountered this error next:
julia> X = format_counts_mmctm(sbs_counts, sv_counts)
ERROR: MethodError: no method matching format_counts_mmctm(::DataFrame, ::DataFrame)
Stacktrace:
[1] top-level scope
@ REPL[75]:1
I'm not that familiar with Julia so I'll try and figure this out today but thought you might know a solution. Thanks! A
Hello,
Please forgive me, it looks like I haven't kept the documentation up to date. The current format_counts_mmctm
function looks like this format_counts_mmctm(countdfs::Vector{DataFrame}, cols::Vector{Symbol})
so perhaps you can try something like
samples = [c for c in propertynames(sbs_counts) if c != :term] # column names with counts
X = MultiModalMuSig.format_counts_mmctm([sbs_counts, sv_counts], samples)
Hi @alhafidzhamdan did you manage to get this working?
Hi @funnell, sorry for taking so long. Thanks for your tweaks- i've got it working til the end of the codes used in your example. I've included here the modifications I added as per Julia's error messages at each particular step.
julia> samples = [c for c in propertynames(sbs_counts) if c != :term]
julia> X = MultiModalMuSig.format_counts_mmctm([sbs_counts, sv_counts], samples)
julia> model = MMCTM([7, 7], [0.1, 0.1], X)
julia> fit!(model, tol=1e-5)
julia> snv_signatures = DataFrame(hcat(model.ϕ[1]...), :auto)
julia> sv_signatures = DataFrame(hcat(model.ϕ[2]...), :auto)
julia> snv_signatures[!,:term] = sbs_counts[!,:term]
However this does not work.
julia> snv_signatures = melt(
snv_signatures, :term, variable_name=:signature, value_name=:probability
)
ERROR: UndefVarError: melt not defined
Stacktrace:
[1] top-level scope
@ REPL[120]:1
I continued anyway- SNV and SV probs for all samples:
snv_props = hcat(
[model.props[i][1] for i in 1:length(model.props)]...
)
sv_props = hcat(
[model.props[i][2] for i in 1:length(model.props)]...
)
I have a few questions from here: 1) How can I assign the rows and columns of snv_props and sv_props to the 7 signatures and samples?
julia> DataFrame(sv_props,:auto)
7×225 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float6 ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1.52523 0.892003 1.38285 1.21774 1.96964 1.17702 3.31342 2.00003 2.37593 0.639333 1.23368 1.36052 0.497386 1.82476 1.64151 1.96935 1.19951 3.44936 0.594972 2.39841 0.621306 1.14491 1.34344 1.42068 1.44479 0.898075 1.29263 4.83384 2.0986 0.730421 4.3736 ⋯
2 │ 2.05933 6.87841 3.05881 3.70357 0.547784 0.5564 0.690446 1.52415 1.76329 1.48142 3.38483 2.31011 3.74832 1.35051 2.76729 5.11409 3.20587 2.34284 3.78887 3.18727 3.87339 2.98576 2.99264 1.90186 3.1843 1.19829 3.35416 2.95572 0.593893 1.25003 0.3137
3 │ 2.81751 0.234477 2.82508 2.78725 0.942591 5.31433 2.16297 1.20338 2.48781 0.972929 3.23332 1.03998 3.77494 3.80404 3.66599 0.392087 3.16897 1.6235 3.88862 1.14087 3.48607 0.762118 2.85891 1.7637 3.32182 2.94322 3.13771 0.437131 1.36894 5.09371 0.9139
4 │ 0.878145 0.104497 0.926065 0.813333 2.44788 0.509715 1.03269 0.134703 0.90565 0.202449 0.801 0.483705 1.38308 0.465626 0.864807 0.290392 0.85158 0.557338 1.17886 0.334046 0.950684 1.13196 0.946988 0.754451 0.900103 0.86928 0.62976 0.410374 1.87974 0.681321 0.9481
5 │ 1.63174 2.98791 1.34295 1.26337 1.01148 1.42752 0.534973 4.497 1.6076 4.44152 1.14803 1.41413 1.31018 0.521403 1.24389 1.99183 1.15181 2.03737 1.2454 1.05877 1.22168 1.16412 1.4137 1.50506 1.20268 4.34465 1.11961 1.9914 0.495736 0.88545 0.3289 ⋯
6 │ 1.1131 0.078837 0.474637 0.851791 2.21988 1.42198 1.55959 0.147198 0.658698 0.196302 0.726703 0.344573 0.728307 0.508837 0.99269 0.198602 0.699491 0.3788 0.649017 0.296185 1.03338 0.295105 0.697907 0.631958 0.534966 0.515907 0.964186 0.223724 2.41553 0.903489 2.2403
7 │ 1.05325 0.3155 1.14679 0.193642 1.98399 0.0966345 2.97119 0.227853 1.92781 0.400524 0.886129 0.893194 0.410525 0.30377 0.982847 0.363358 0.927924 0.247319 0.446969 0.365857 0.40796 0.384164 0.972539 0.278447 1.0753 0.331141 0.816327 0.455562 2.01648 0.324886 4.3691
195 columns omitted
2) And another naive question, from these 7 SNV and 7 SV signatures, how can I then assign them to the Cosmic signatures? For example, SNV Signature 1 to Age etc
Thanks! A
Hi @alhafidzhamdan,
Thanks for your patience and your list of modifications! I've updated the README to reflect your and some additional changes. Regarding your questions:
rename!
function to set the data frame column names to the samples like so: rename!(props, samples)
. You could also prepend a "signature" column to each data frame to label the rows with whatever names seem reasonable to you. For example, you could construct a vector of signature labels given a vector of modality labels modalities
like this:
sig_labels = vcat([["$(modalities[m])-$k" for k in 1:model.K[m]] for m in 1:model.M]...)
props = hcat(DataFrame(signature=sig_labels), props)
solve_LSAP
).I hope that helps!
Hi @alhafidzhamdan do you have any more questions? If not, I'll close this issue. Thanks!
Hi @funnell, I've yet to look into this but I foresee some difficulties on my part trying to do number 2. But i'll look into it and will re-open if I need some suggestions/help from you! Thanks and happy 2022! Al
Hello there, I'm interested in using your algorithm to get SNV and SV signatures on my WGS dataset. Basic question first, do you have any tips on how best to generate the mutation type matrices to input into the package in julia, particularly the SV counts? As in these count data that you used in the example:
Thanks A