Signature fitting for non-human data

kw10 commented 2 years ago

Hello

I would like to run SigFit (single sample) for canine samples. I have read through issues #64, #58 and #56 but wanted to clarify if the following steps are correct for single sample analysis:

I have a mutation file in format:

Sample  Ref Alt Trinuc
Sample1 C   A   CCC
Sample1 G   A   CGG
Sample1 G   A   CGA

and opportunity matrix in format

> head(opps)
     ACA      ACC      ACG      ACT      CCA      CCC 
89444615 51982186 12226057 70976798 81114094 70998488

I then run

catalogue <- build_catalogues(mutations)

# fit signatures
cosmic_converted <- convert_signatures(cosmic_signatures_v3.2,
                                       opportunities_from="human-genome")

samples_fit <- fit_signatures(counts = catalogue, 
                                   cosmic_converted,
                                   iter = 10000, 
                                   warmup = 1000, 
                                   opportunities = opps, 
                                   chains = 1, 
                                   seed = 5756)

And then run fit_signatures again for the signatures found above:

selected_signatures <- c(1, 3, 7)
samples_fit_2 <- fit_signatures(catalogue, 
                                     cosmic_converted[selected_signatures, ],
                                     iter = 5000, 
                                     warmup = 1000, 
                                     opportunities = opps, 
                                     chains = 1, 
                                     seed = 1756)

and then create the plots with mcmc_samples = samples_fit_2

Is this right?

Thanks, Kim

baezortega commented 2 years ago

Hello Kim,

I think your code is correct, although I would not expect a dog sample to show only signatures 1, 3 and 7.

We have previously noticed an issue with normalising the COSMIC signatures to genome-independent representation. When you do this, SBS5 acquires strong peaks at N[C>T]G, because there is a small fraction of SBS1 in SBS5, which gets amplified after accounting for the very low number of CpG sites in the human genome (this is mentioned in issue #64). This distortion of SBS5 might be causing sigfit to choose SBS3 instead of SBS5 for your data. It might be worth comparing the analysis you describe with the alternative approach of converting your mutation counts to human-genome-relative representation and fitting untransformed COSMIC signatures:

catalogue_converted <- convert_signatures(catalogue,
                                          opportunities_from=opps,
                                          opportunities_to="human-genome")

samples_fit <- fit_signatures(counts = catalogue_converted, 
                              cosmic_signatures_v3.2, ...)

(You can use plot_spectrum to compare visually the different representations of catalogues and signatures.)

I should also note that sometimes the COSMIC v3.2 set can be too large to allow one to make sense of the fitting results. I sometimes find it useful to restrict the fitting by excluding signatures above SBS40 or SBS30, unless you have evidence or reason to think that higher signatures might be present in your sample.

Finally, since your data set is so small, I think you could afford running 4 chains and having more post-warmup MCMC samples. The options below would give you 12,000 samples:

iter = 5000, warmup = 2000, chains = 4

I hope this helps!

Best, Adrian

kw10 commented 2 years ago

Hi Adrian,

Thanks for the explanations and advice, it's very helpful. In this case, signatures 1, 3 and 7 were just an example (I actually see 1, 6 and 15, which I expected), however, in some of my other data I do unexpectedly see signature 3, so I will try converting my counts for those samples and compare the results. Thanks again!

Best, Kim

kw10 commented 2 years ago

On a related note, if I run signture extraction on a cohort, would I run:

catalogue <- build_catalogues(mutations)

extract_signatures(counts = catalogue,
                              nsignatures = 1:6,
                              opportunities = opps,
                              seed = 3492,
                              iter = 5000)

If I convert counts relative to human, would I first convert the catalogue the extract signatures without providing opportunities?

catalogue <- build_catalogues(mutations)

catalogue_converted <- convert_signatures(catalogue,
                                          opportunities_from=opps,
                                          opportunities_to="human-genome")

extract_signatures(counts = catalogue,
                              nsignatures = 1:6,
                              seed = 3492,
                              iter = 5000)

Thanks! Kim

baezortega commented 2 years ago

Hi Kim,

That would be correct, except you need to multiply catalogue_converted by rowSums(catalogue) before the extraction. You need to have at least as many samples as the highest value in nsignatures.

My preferred approach for extraction is not to transform the catalogues, but extract directly and then convert the resulting signatures to human representation if necessary. If you are dealing with a single species, and your opportunities do not change from sample to sample, then you don't need to use them in the extraction. In that case, the inferred signatures would be relative to your genome (i.e. the dog genome), and you could use convert_signatures on them as follows.

convert_signatures(extracted_signatures,
                   opportunities_from=opps,
                   opportunities_to="human-genome")

However, if you did use the opportunities when extracting, then you should write opportunities_from=NULL, because the extraction model has already accounted for them. I hope this makes sense.

Best, Adrian

kw10 commented 2 years ago

Hi Adrian

Thanks for the advice. I tried running extraction this way as well, and the results look very similar. Thanks again for your help!

Best, Kim

kgori / sigfit

Signature fitting for non-human data #66