im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

dndsout = dndscv(df) output error messages #93

Closed HendrinaS closed 9 months ago

HendrinaS commented 9 months ago

Good day,

Could you please assist me with the following;

I’m using the dNdScv package, initially, I run the script and I got the following error messages,

dndsout = dndscv(df) [1] Loading the environment... [2] Annotating the mutations... Error in dndscv(df) : Zero coding substitutions found in this dataset. Unable to run dndscv. Common causes for this error are inputting only indels or using chromosome names different to those in the reference database (e.g. chr1 vs 1) In addition: Warning messages: 1: In dndscv(df) : Same mutations observed in different sampleIDs. Please verify that these are independent events and remove duplicates otherwise. 2: In .merge_two_Seqinfo_objects(x, y) : The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.)

I then changed the chr numbers style ( chr1 to 1, chr2 to 2, chr3 to 3…)

Then I ran the script again, however, I am getting extra error messages, as below.

dndsout = dndscv(df) [1] Loading the environment... [2] Annotating the mutations... Error in dndscv(df) : 199 (67%) mutations have a wrong reference base. Please confirm that you are not running data from a different assembly or species. In addition: Warning message: In dndscv(df) : Same mutations observed in different sampleIDs. Please verify that these are independent events and remove duplicates otherwise.

Can the dNdScv package use data mined with hg38 instead of hg37, is there a script I can use when using hg38 sequenced data? Is there a script to removed duplicated mutations in my file or I will to do this manually, (my file is big)?

Thank you for your assistance.

Hendrina

im3sanger commented 9 months ago

Hi Hendrina,

By default, dNdScv assumes that your data is mapped to GRCh37. If your mutations are in hg38, you will need to use the optional arguments in dndscv to use the correct databases and covariates. Please follow these steps:

  1. Download the hg38 reference database and the hg38 covariate files from this link: https://github.com/im3sanger/dndscv_data/tree/master/data. The files are called: "RefCDS_human_hg19_GencodeV18_newcovariates.rda", and "covariates_hg19_hg38_epigenome_pcawg.rda".
  2. You can then run dNdScv using the code below:
    load("covariates_hg19_hg38_epigenome_pcawg.rda") # Loads the covs object
    dndsout = dndscv(mutations, refdb = "RefCDS_human_GRCh38_GencodeV18_recommended.rda", cv = covs)

For more information on how to run dNdScv on other species or newer assemblies you can also see this tutorial. But for hg38, the instructions above should be sufficient.

You also ask about removing duplicates. You can use: mutations = unique(mutations), to remove duplicated rows in your data. The warning issued by dndscv tells you that two identical mutations were found in different sampleIDs. If these are genuinely independent mutations that occurred in two different patients, you can safely ignore this warning. But if your sampleIDs represent two biopsies from the same donor, it is likely that two identical mutations in the two biopsies represent the same clone. In that case, it is preferable to collapse identical mutations per donor, which you can do changing sampleIDs to represent donorIDs and then use the "unique" function above.

Best, Inigo