im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
200 stars 47 forks source link

No genes with qglobal_cv < 0.1 #69

Open kvn95ss opened 2 years ago

kvn95ss commented 2 years ago

Hello,

I ran this data set on filtered output from Mutect2 (tumor vs normal, single patient with PoN of 4 samples). I got the mutation list by querying the vcf file from bcftools so I get the columns sampleID, chr, pos, ref and mut.

I'm using hg38 reference from the precomputed rdna file in this repo - https://github.com/im3sanger/dndscv_data/tree/master/data

I'm able to get dndscv running for my data by using these commands cancer_test <- read.table("CC028_dmg_test.vcf") cancer_processed_data = dndscv(cancer_test, ref_db="data/RefCDS_human_GRCh38.p12.rda", cv=NULL) sel_cv = cancer_processed_data$sel_cv;print(head(sel_cv), digits = 3) I get this output -

      gene_name n_syn n_mis n_non n_spl n_ind wmis_cv wnon_cv wspl_cv wind_cv
8821   KRTAP5-4     0     1     0     0     2    46.7       0       0     772
16565   TAS2R30     0     2     0     0     1    61.4       0       0     276
7412      HLA-C     0     2     0     0     1    33.4       0       0     237
13331      PSG3     0     2     0     0     1    36.7       0       0     186
9056     LILRA4     0     2     0     0     1    31.2       0       0     177
18255  USP17L18     0     1     2     0     0    13.5     357     357       0
       pmis_cv ptrunc_cv pallsubs_cv  pind_cv qmis_cv qtrunc_cv qallsubs_cv
8821  0.016677  9.61e-01    5.69e-02 7.03e-05   0.897     0.983       0.996
16565 0.000788  9.54e-01    3.55e-03 3.48e-03   0.897     0.983       0.996
7412  0.002592  9.25e-01    1.06e-02 4.03e-03   0.897     0.983       0.996
13331 0.002184  9.08e-01    8.98e-03 5.08e-03   0.897     0.983       0.996
9056  0.002970  9.14e-01    1.19e-02 5.32e-03   0.897     0.983       0.996
18255 0.067056  1.96e-05    6.39e-05 1.00e+00   0.897     0.383       0.996
      pglobal_cv qglobal_cv
8821    5.37e-05          1
16565   1.52e-04          1
7412    4.71e-04          1
13331   5.01e-04          1
9056    6.77e-04          1
18255   6.81e-04          1

But when looking for significant genes, I get no output print(cancer_processed_data$sel_cv[cancer_processed_data$sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")])

<0 rows> (or 0-length row.names)

What could be the reason for this? Does this imply there are no significant genes in the data?

shaghayeghsoudi commented 1 year ago

Hey, have you been able to find an answer for your question? I am running into the exact same problem and getting no hit. Thanks

im3sanger commented 1 year ago

Hello,

Sorry for the very late reply.

Yes, this means that there are no recurrently mutated genes in your dataset reaching statistical significance. Can you explain your experimental design in more detail? From your earlier description it sounds like you are analysing data from a single patient. Is that correct? In that case it would not be unexpected not to find any significant recurrence, as this relies on finding mutations in the same gene across multiple samples or patients.

Inigo

shaghayeghsoudi commented 1 year ago

Hi Inigo, thanks for your reply. I am indeed working on 27 WES sarcoma tumours. They are multi regional and for each tumour I have 3-6 regions sampled and sequenced which I am merging them into one for each tumour by removing duplicate mutations. I was expecting to find at least a few hits as sarcomas are not normally SSMs type of tumours but I am getting all q-values equal to one, nothing significant.

im3sanger commented 1 year ago

Hello,

Thank you. Apologies, I had not realised that there were questions from separate users.

Can you confirm what value of theta you are getting? (dndsout$nbreg$theta).

Lack of significance can be caused by datasets that are too small or that do not have sufficient recurrence for any gene to reach significance. However, it is always important to check that your theta value is not very low (<<1). Very low theta values mean that there is very high variation in the density of synonymous mutations across genes. This typically reflects problems with the mutation calls, such as recurrent artefacts or SNP contamination. Large variation in the density of mutations across genes (high overdispersion) makes dNdScv be more conservative (a gene needs to have more mutations to emerge from the noise) and results in less significance.

If your dataset has good theta values (>1, or ideally >3) and your mutation calls are reliable, then the lack of significance may reflect insufficient power (small datasets or insufficient recurrence).

Best, Inigo