ramiromagno / gwasrapidd

gwasrapidd: an R package to query, download and wrangle GWAS Catalog data
https://rmagno.eu/gwasrapidd/
Other
91 stars 14 forks source link

Very Small P Values are Rounded to 0 #18

Closed MattCloward closed 3 years ago

MattCloward commented 3 years ago

A few of the studies in the GWAS catalog have p values that are smaller than 1e-300 and can even be as small as 1e-500 or 1e-600. When I use the "get_associations" function, these p values are rounded to 0.

Ex: for the study "GCST001884", rs10737680 and rs10490924 have p values of 1x10-434 and 4x10-89 respectively. (Source: https://www.ebi.ac.uk/gwas/studies/GCST001884)

However, when I run the following command: get_associations(study_id = "GCST001884")@associations[["pvalue"]] These two SNPs have p values of 0e+00.

Is there anything that can be done to fix this?

ramiromagno commented 3 years ago

Hi @MattCloward: Thank you for your question.

This rounding has to do with the representation of numbers in R. See ?.Machine, particularly .Machine$double.xmin.

Now, gwasrapidd does provide you access to these p-values if you inspect the mantissa and the exponent columns, see example below. In these cases with such low p-values, you should perhaps ask yourself if you shouldn't floor the number to zero anyways... p-values of that order are, generally, hardly meaningful: for a comparison the estimated number of atoms in the universe is of the order of 10^78 to 10^82 (https://www.universetoday.com/36302/atoms-in-the-universe/amp/), so the probability of picking one specific atom at random is 10^{-82}...

If you really want to deal with all these p-values, even those that are astronomically low, in R, then you can perhaps work with the log of the p-value instead, which lies in a range amenable to R's representation of numbers.

Otherwise, the only alternative, is to use a dedicated package that provides arbitrary-precision numbers, e.g., Rmpfr.

library(gwasrapidd)
library(dplyr, include.only = 'mutate')
foo <- get_associations(study_id = "GCST001884")
foo@associations[c('pvalue', 'pvalue_mantissa', 'pvalue_exponent')] %>%
  mutate(log_pvalue = log(pvalue_mantissa) + pvalue_exponent)
#> # A tibble: 19 x 4
#>      pvalue pvalue_mantissa pvalue_exponent log_pvalue
#>       <dbl>           <int>           <int>      <dbl>
#>  1 7.00e-16               7             -16     -14.1 
#>  2 9.00e-16               9             -16     -13.8 
#>  3 3.00e-11               3             -11      -9.90
#>  4 7.00e-11               7             -11      -9.05
#>  5 2.00e- 8               2              -8      -7.31
#>  6 2.00e-11               2             -11     -10.3 
#>  7 2.00e-11               2             -11     -10.3 
#>  8 9.00e-11               9             -11      -8.80
#>  9 5.00e- 9               5              -9      -7.39
#> 10 2.00e- 8               2              -8      -7.31
#> 11 0.                     4            -540    -539.  
#> 12 0.                     1            -434    -434   
#> 13 4.00e-89               4             -89     -87.6 
#> 14 1.00e-41               1             -41     -41   
#> 15 2.00e-26               2             -26     -25.3 
#> 16 2.00e-20               2             -20     -19.3 
#> 17 3.00e-15               3             -15     -13.9 
#> 18 4.00e-13               4             -13     -11.6 
#> 19 3.00e-11               3             -11      -9.90
MattCloward commented 3 years ago

Thank you very much for this detailed explanation, including work arounds and especially reasons not to do any work arounds! The p values I have been looking at are waaay more than astronomically low (4x10^22 stars in the universe) https://robertkaplinsky.com/work/stars-in-the-universe/

Thanks for your help!