kgori / sigfit

Flexible Bayesian inference of mutational signatures
GNU General Public License v3.0
33 stars 8 forks source link

Fitting Samples Across Multiple Genomes #58

Closed jasonptm closed 2 years ago

jasonptm commented 2 years ago

Greetings,

We have a reasonable number of samples spread between three different genomes (hg19, hg38, and canFam3.1). We would like to fit signatures jointly to them via sigfit. However, this does not appear to me to be straightforward given the different mutational opportunities of the three genomes.

Do you have any recommendations? Is there a way to make the sample mutational catalogs genome-agnostic, similar to using convert_signatures for signatures? Alternately, is there a way to pass multiple opportunity matrices to fit_signatures? Or is there some other avenue that would be more fruitful to pursue?

I am happy to provide more details if and where that is helpful. Thanks in advance for any advice you can provide.

-Jason Turner-Maier

baezortega commented 2 years ago

Hi Jason,

For fitting signatures to different genomes simultaneously, my recommendation would be to use a matrix of mutational opportunities and a genome-agnostic version of the COSMIC signatures.

You can make the signatures genome-agnostic by doing convert_signatures(cosmic_signatures_v3, opportunities_from="human-genome")

Regarding the opportunities matrix, this should have the same dimensions as your matrix of mutation counts, with one row of 96 trinucleotide frequencies per sample. Human genomes hg19 and hg38 should have essentially the same opportunities, which you can get from sigfit using sigfit:::human_trinuc_freqs()

The frequencies in sigfit come from hg19; although the difference with hg38 in terms of trinucleotide composition is negligible, if you would like to obtain exact frequencies for hg38, you can do this from the FASTA using the trinucleotideFrequency function in the Biostrings package (but then you need to collapse and arrange the trinucleotides as you would in a mutational spectrum).

I'm attaching an RData object containing the vector of whole-genome trinucleotide frequencies for canFam3.1, as a 96-element vector: trinuc_freqs_canfam3.1.RData.zip.

Once you have the matrix of opportunities, it would be a good idea to normalise the values so that each row sums to 1.

Let me know if you find any problems with this approach.

Best, Adrian

jasonptm commented 2 years ago

Hi Adrian,

Thanks so much for your response. I did not realize that a matrix of opportunities could be passed; that seems to make the task much easier.

Thanks, Jason Turner-Maier