radmeth-adjust replaces significant p-values in scientific notation with '1' and thus hides best DMC

iromeo commented 1 year ago

'methpipe' to 'dnmtools' code refactoring introduced a critical bug in radmeth-adjust. In 1.2.1 version & current master branch the adjust step replaces significant p-values written in scientific notation with '1.0' and thus hides best DMC and mark them as totally insignificant.

Example: Fragment of DMC table after regression step:

chr5    167087304       +       CpG     0.944042        2100    390     2421    453
chr5    167087308       +       CpG     0.000236524     2109    769     2432    724
chr5    167087366       +       CpG     0.00873441      2289    238     2633    211
chr5    167087627       +       CpG     2.22045e-16     2015    473     2366    1151
chr5    167087636       +       CpG     2.22045e-16     2000    1087    2359    1797
chr5    167087645       +       CpG     1.9873e-14      1949    964     2330    1619
chr5    167087722       +       CpG     3.29441e-11     1781    1102    2106    1587
chr5    167087880       +       CpG     0.464406        1605    1237    1864    1459
chr5    167087934       +       CpG     0.779924        1723    1609    1972    1846
chr5    167087938       +       CpG     0.743816        1732    1644    1991    1885

After adjust step:

chr5    167087304       +       CpG     0.944042        0.000158571     0.0199993       2100    390     2421    453
chr5    167087308       +       CpG     0.000236524     0.000158571     0.0199993       2109    769     2432    724
chr5    167087366       +       CpG     0.00873441      0.000158571     0.0199993       2289    238     2633    211
chr5    167087627       +       CpG     1       1       1       2015    473     2366    1151
chr5    167087636       +       CpG     1       1       1       2000    1087    2359    1797
chr5    167087645       +       CpG     1       1       1       1949    964     2330    1619
chr5    167087722       +       CpG     1       1       1       1781    1102    2106    1587
chr5    167087880       +       CpG     0.464406        0.794533        0.910647        1605    1237    1864    1459
chr5    167087934       +       CpG     0.779924        0.794533        0.910647        1723    1609    1972    1846
chr5    167087938       +       CpG     0.743816        0.794533        0.910647        1732    1644    1991    1885

the bug is in is_number function, it isn't aware that number could have '-' and 'e' chars as it is scientific notation:

static bool
is_number(const string& str) {
  for (const char &c : str)
    if (c != '.' && !std::isdigit(c)) return false;
  return true;
}

Example DMC table attached. dmc.txt

I've implemented & tested a fix, will attach it as pull request

andrewdavidsmith commented 1 year ago

@iromeo Could this relate to speed of the code? I looked at your solution and I think it looks fine (at least it would be no worse), but I wonder if it would be more robust to just use istringstream and parse the number without any worries? It would incur the overhead of creating the object, copying the string, but the code already looks quite slow.

andrewdavidsmith commented 1 year ago

@iromeo I'll just pull and we can enhance this later if you think it's warranted.

iromeo commented 1 year ago

@andrewdavidsmith I agree that is_number is not optimal, but I believe that its optimizing will be minor optimization that will not affect the whole run time. I suppose that the rest part of the adjustment algorithm works much slower compared to is_number (I mean cross correlation and liptak-stouffer test)

istringstream

AFAIU the is_number check was introduced only in DNMTOOLS version and is missing in previous methpipe radmeth impl. The check is required because regression step could return NA instead of numbers. Thus we could just check the first char of string, if it is digit, then it is likely a valid number, otherwise some NA like string when we were unable to calculate p-value.

At the moment the whole adjustment step works pretty fast, compared to regression step or reads alignment. So as a user I don't see much sense in adjustment step optimization.

P.S: I could compare performance of several approaches on my WGBS samples.

andrewdavidsmith commented 1 year ago

@iromeo is there a reason this is still open or did we just forget to close it? I recall the problem was fixed, right?

iromeo commented 1 year ago

I think it is ok to close it. I use methpipe with this fix (from my fork) and everything is ok.

As for performance issue - as I mentioned before the adjustment step is much faster compared to very slow regression step, so I don't think that anything is needed to be optimised here.

andrewdavidsmith commented 1 year ago

They’re still the open PR. I can try to merge it. The code has changed much, but it should be ok.On Jul 12, 2023, at 8:46 AM, Roman Chernyatchik @.***> wrote: I think it is ok to close it. I use methpipe with this fix (from my fork) and everything is ok. As for performance issue - as I mentioned before the adjustment step is much faster compared to very slow regression step, so I don't think that anything is needed to be optimised here.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

smithlabcode / dnmtools

radmeth-adjust replaces significant p-values in scientific notation with '1' and thus hides best DMC #31