samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
285 stars 243 forks source link

Rounding doubles in VCF output #793

Open nh13 opened 7 years ago

nh13 commented 7 years ago

I am trying to divine the purpose of rounding doubles for some values in VCFEncoder.java. For example 0.0825 gets rounded to 0.083, which is a significant difference. There's not a way to control those, and I am not sure if there is as sensible default. Perhaps keep at least n digits after the leading zeros? Or have at least three decimal points? @droazen @yfarjoun @tfenne what are your thoughts?

lbergelson commented 7 years ago

This has annoyed me when writing tests because the results of saving and reloading a variant context are not identical to the original in memory context. I had just assumed there was a good reason.

yfarjoun commented 7 years ago

I suspect that it's for rounding off machine errors (1->0.9999999999999) , which could cause a significant increase in storage footprint. I'm in favor of a better fix than seems to be currently implemented.

Y

On Tue, Jan 31, 2017 at 2:48 PM, Louis Bergelson notifications@github.com wrote:

This has annoyed me when writing tests because the results of saving and reloading a variant context are not identical to the original in memory context. I had just assumed there was a good reason.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/793#issuecomment-276471201, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0krhHD9_HzehZucdfK85_Z0K8N5lks5rX4_xgaJpZM4LzEC3 .

d-cameron commented 7 years ago

If possible, could the 2dp rounding of the QUAL field be looked at as well?

tfenne commented 7 years ago

I'm thinking about this again as I've just run into it, writing p-values for various tests into the info field. I think my preference would be to change the implementation to do something like the following: 1) If the value is >= 0.1 store it in normal numerical format with some fixed number of digits of precision (I might suggest 5 instead of 3) 2) If the value is < 0.1 but can be accurately represented with fixed number of digits, output in normal numerical format otherwise output in scientific notation, allowing for the same number of fixed digits.

Some examples to make clear what I'm saying:

Number Stored As
1.0 1.0
1.12345 1.12345
1.33333333333 1.33333
0.1 0.1
0.01 0.01
0.001 0.001
0.00001 0.00001 (edited by @yfarjoun as there was a 0 missing)
0.50000001 0.5
0.012345 1.2345e-2
0.0033333333 3.33333e-3

Obviously the cut-offs of 0.01 and 5 digits are somewhat arbitrary and I could see changing either one, but something like this would give a fixed number of digits of precision from the first significant digit and limit the amount of space taken up in the VCF.

nh13 commented 7 years ago

@tfenne I really like this idea. I'll take it on to submit a PR.

yfarjoun commented 5 years ago

this has been quiet for a while...trying to resurrect it.

  1. We should deal with large numbers as well
  2. We should deal with negative numbers as well
  3. I am pretty happy with 3 digits...any good reason to go to 5?
  4. We should use Math.round to remove digits, that way we are consistent with commonly accepted standards.