Closed kw10 closed 2 years ago
Hello Kim,
I think your code is correct, although I would not expect a dog sample to show only signatures 1, 3 and 7.
We have previously noticed an issue with normalising the COSMIC signatures to genome-independent representation. When you do this, SBS5 acquires strong peaks at N[C>T]G, because there is a small fraction of SBS1 in SBS5, which gets amplified after accounting for the very low number of CpG sites in the human genome (this is mentioned in issue #64). This distortion of SBS5 might be causing sigfit to choose SBS3 instead of SBS5 for your data. It might be worth comparing the analysis you describe with the alternative approach of converting your mutation counts to human-genome-relative representation and fitting untransformed COSMIC signatures:
catalogue_converted <- convert_signatures(catalogue,
opportunities_from=opps,
opportunities_to="human-genome")
samples_fit <- fit_signatures(counts = catalogue_converted,
cosmic_signatures_v3.2, ...)
(You can use plot_spectrum
to compare visually the different representations of catalogues and signatures.)
I should also note that sometimes the COSMIC v3.2 set can be too large to allow one to make sense of the fitting results. I sometimes find it useful to restrict the fitting by excluding signatures above SBS40 or SBS30, unless you have evidence or reason to think that higher signatures might be present in your sample.
Finally, since your data set is so small, I think you could afford running 4 chains and having more post-warmup MCMC samples. The options below would give you 12,000 samples:
iter = 5000, warmup = 2000, chains = 4
I hope this helps!
Best, Adrian
Hi Adrian,
Thanks for the explanations and advice, it's very helpful. In this case, signatures 1, 3 and 7 were just an example (I actually see 1, 6 and 15, which I expected), however, in some of my other data I do unexpectedly see signature 3, so I will try converting my counts for those samples and compare the results. Thanks again!
Best, Kim
On a related note, if I run signture extraction on a cohort, would I run:
catalogue <- build_catalogues(mutations)
extract_signatures(counts = catalogue,
nsignatures = 1:6,
opportunities = opps,
seed = 3492,
iter = 5000)
If I convert counts relative to human, would I first convert the catalogue the extract signatures without providing opportunities?
catalogue <- build_catalogues(mutations)
catalogue_converted <- convert_signatures(catalogue,
opportunities_from=opps,
opportunities_to="human-genome")
extract_signatures(counts = catalogue,
nsignatures = 1:6,
seed = 3492,
iter = 5000)
Thanks! Kim
Hi Kim,
That would be correct, except you need to multiply catalogue_converted
by rowSums(catalogue)
before the extraction. You need to have at least as many samples as the highest value in nsignatures
.
My preferred approach for extraction is not to transform the catalogues, but extract directly and then convert the resulting signatures to human representation if necessary. If you are dealing with a single species, and your opportunities do not change from sample to sample, then you don't need to use them in the extraction. In that case, the inferred signatures would be relative to your genome (i.e. the dog genome), and you could use convert_signatures
on them as follows.
convert_signatures(extracted_signatures,
opportunities_from=opps,
opportunities_to="human-genome")
However, if you did use the opportunities when extracting, then you should write opportunities_from=NULL
, because the extraction model has already accounted for them. I hope this makes sense.
Best, Adrian
Hi Adrian
Thanks for the advice. I tried running extraction this way as well, and the results look very similar. Thanks again for your help!
Best, Kim
Hello
I would like to run SigFit (single sample) for canine samples. I have read through issues #64, #58 and #56 but wanted to clarify if the following steps are correct for single sample analysis:
I have a mutation file in format:
and opportunity matrix in format
I then run
And then run fit_signatures again for the signatures found above:
and then create the plots with mcmc_samples = samples_fit_2
Is this right?
Thanks, Kim