kordk / torch-ecpg

(GPU accelerated) eCpG mapper
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

P-value output precision #36

Open kordk opened 1 year ago

kordk commented 1 year ago

The p-values below a threshold 10^-8 are currently reported in the output at 0.0.

> summary(df$mt_p)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 5.960e-08 1.013e-06 2.436e-06 4.292e-06 9.954e-06 

> max(df[df$mt_p < 0.00000001, ]$mt_p)
[1] 0
> max(df[df$mt_p < 0.0000001, ]$mt_p)
[1] 5.960465e-08

I expect there will be memory issues if higher precision is used to store p-values, alongside potential performance issues.

Given it may be useful for users to have access to more precise p-values, it's worth reviewing the costs and benefits of our options.

REF: https://stackoverflow.com/questions/63818676/what-is-the-machine-precision-in-pytorch-and-when-should-one-use-doubles

liamgd commented 10 months ago

All of the computation is currently done with the torch.float32 datatype, so the speed of the computation of the regressions would remain the same. The speed at which the output is saved to the disk would be reduced. I suspect the reduction in performance would be negligible, and storing the data as 32 bit floating point numbers would be advantageous. I will test the speed of both options (8 bit and 32 bit output).

liamgd commented 10 months ago

I have confirmed that the output dataframe uses the float32 datatype to store all values aside from the index. To display more digits, the float_format keyword argument can be used.

liamgd commented 10 months ago

Saving time for dummy data with 1000 gene expression loci, 1000 methylation loci, and 300 samples with the command tecpg -F <float-format> run mlr (no chunking or filtration):

Float format Saving time (seconds) Example output for numpy.float32 0.000000001234567890123456789
No parameter provided 3.0989 1.2345679e-09
%.8f 6.2078 0.00000000
%.16f 6.1495 0.0000000012345679
%.32f 7.8806 0.00000000123456789236087161043542
%.8e 6.4554 1.23456789e-09
%.16e 8.4926 1.2345678923608716e-09
%.32e 9.0908 1.23456789236087161043542437255383e-09

As of ca86fc2 on the development branch, there is an option to specific a float format. If none is provided, the default float format is used. @kordk Which format should be the default?

kordk commented 7 months ago

The default float format (also %.8e) is sufficient and is also the smallest in terms of characters.

liamgd commented 7 months ago

Just to clarify, is it acceptable that the issue mentioned in the original comment will continue to occur? Another option would be to create an optional flag to increase the float precision if the user desires.