greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

P_values of zero in the hetmech database #153

Closed ben-heil closed 3 years ago

ben-heil commented 5 years ago

SELECT * FROM dj_hetmech_app_pathcount WHERE source_id = 14421 AND target_id = 14792;

id path_count dwpc p_value metapath_id source_id target_id dgp_id
14927 2 4.22700018268129 0.0378205049612811 CrCtD 14421 14792 20495
31357 90 3.33906031212536 0.0125763977488916 CrCbGaD 14421 14792 163888
59342 1 2.67809412135379 0.0166159667835124 CbGaDrD 14421 14792 40803
84589 8 3.6438294986658 0.048865029398301 CbGaD 14421 14792 2841
127899 33 3.43243143754024 0.0119735568763205 CbGiGaD 14421 14792 55155
316012 1294 3.50943383023676 0 CcSEcCtD 14421 14792 82209

SELECT * FROM dj_hetmech_app_degreegroupedpermutation WHERE id = 82209;

id source_degree target_degree n_dwpcs n_nonzero_dwpcs nonzero_mean nonzero_sd metapath_id
82209 115 25 800 800 2.29988868478163 0.125316938712694 CcSEcCtD

We could do the math to calculate the log of the p-value initially, or we could use the Decimal package built into python to get arbitrarily large floats, then take the log of that.

In playing around with the built in floats and decimal floats, the built in floats are set to zero somewhere around 1/2^1070, while Decimal was able to represent anything, though it slowed down around 1/2^500000.

@zietzm @dhimmel do you guys have a preference?

dhimmel commented 5 years ago

Here's the hetmatpy code that calculates the p-value:

            row['p_value'] = None if row['sum'] == 0 else (
                row['nnz'] / row['n'] *
                (1 - scipy.special.gammainc(row['alpha'], row['beta'] * row['dwpc']))
            )

Let's find where the precision bottleneck is. If scipy.special.gammainc is outputting a number near 1, floating points will have less precision at that high of a number. We will also have to look at what scipy.special.gammainc supports in terms of input / output and consider any limitations of that function.

ben-heil commented 5 years ago

Since our number nonzero and our total number are the same for the example value, the bottleneck is likely in scipy's implementation of the regularized gamma function. We could use this implementation for an arbitrary precision version, but I have no idea how fast or slow it is.

We could use the scipy version to calculate the p_value, then try again with the mpmath version if it returns a p_value of zero

dhimmel commented 5 years ago

Okay let's have @zietzm take a look at mpmath and alternatives to scipy.special.gammainc. We probably want to return a float to the user in -log10 space. But any reasonable way of getting there is OKAY with me.

zietzm commented 5 years ago

If we are using the mpmath version, I'd say we might be better off transitioning to using the incomplete upper gamma function. Our formula would become

row['p_value'] = None if row['sum'] == 0 else (
                row['nnz'] / row['n'] *
                (mpmath.gammainc(
                    z=row['alpha'], 
                    a=row['beta'] * row['dwpc'], 
                    b=numpy.inf,
                    regularized=True))
            )

I think this could be preferable because more of the arithmetic would be done in whatever implementation is underlying mpmath and we could avoid subtraction, which may have some kind of weird floating point behavior.

Edit sorry the formulae are a bit different. Should be updated now

dhimmel commented 5 years ago

Okay, I think we will want to set row['nlog10_p_value'], which should return a python float. I am a bit nervous about returning an mpmath number. I still think it would be most precise to use a formulation that returns log (rather than taking it after the computation of the floating-point p-value). However, it does look like the mpmath solution will get us more precision than we currently have.

@zietzm feel free to open a PR with the change (unless you want @ben-heil to do it). Let's also make sure it's returning the same results as before. Also we should consider pulling the p-value calculation out to its own function so we can more easily test it.

zietzm commented 5 years ago

Turns out the fix is extremely simple. I have a mpmath solution ready for a PR but it isn't necessary at all. The error is coming from floating point arithmetic, though the SciPy lower function behavior isn't ideal either. See below where I use the values @ben-heil gave above

image

I will open a PR where we just make the switch from (1 - lower) to upper (regularized incomplete gamma), but I think we can stick with SciPy for now.

Edit: SciPy even makes clear this equivalence in the documentation:

The function satisfies the relation gammainc(a, x) + gammaincc(a, x) = 1 where gammaincc is the regularized upper incomplete gamma function.

ben-heil commented 5 years ago

We are no longer getting 0 p_values in the database due to imprecision. We do, however, have some metapaths with a standard deviation of zero, leading to a p_value of zero. Is this the intended behavior?

SELECT *
FROM dj_hetmech_app_degreegroupedpermutation dgp
JOIN dj_hetmech_app_pathcount pc
  ON pc.dgp_id = dgp.id
WHERE pc.p_value = 0;

I'm having trouble getting the example table to format and have to run to class, will add it later

zietzm commented 5 years ago

Zero standard deviation indicates that the nodes were not connected along the metapath in any of the permuted networks. The remedy for this would be further permuted networks. We can generate many very quickly, though I'm not sure how easy it would be to add DWPC information from these to the database.

dhimmel commented 5 years ago

We do, however, have some metapaths with a standard deviation of zero, leading to a p_value of zero.

I ran the query, and it looks like there are 16 p-values in the current database that are zero. Here are the first rows:

 id  | source_degree | target_degree | n_dwpcs | n_nonzero_dwpcs |   nonzero_mean   | nonzero_sd | metapath_id |   id   | path_count |       dwpc       | p_value | metapath_id | source_id | target_id | dgp_id 
-----+---------------+---------------+---------+-----------------+------------------+------------+-------------+--------+------------+------------------+---------+-------------+-----------+-----------+--------
  27 |             1 |             5 |  108800 |            1347 | 7.67030208261062 |          0 | CpD         | 522775 |          1 | 7.67030208261062 |       0 | CpD         |     13364 |     14752 |     27
 403 |             8 |            13 |     800 |             110 | 5.73621913579885 |          0 | CtD         | 415353 |          1 | 5.73621913579885 |       0 | CtD         |     13779 |     14855 |    403
 595 |            17 |             5 |     400 |              63 | 5.83708705473093 |          0 | CtD         | 415510 |          1 | 5.83708705473093 |       0 | CtD         |     14025 |     14793 |    595
  27 |             1 |             5 |  108800 |            1347 | 7.67030208261062 |          0 | CpD         | 522803 |          1 | 7.67030208261062 |       0 | CpD         |     13423 |     14752 |     27
  27 |             1 |             5 |  108800 |            1347 | 7.67030208261062 |          0 | CpD         | 522916 |          1 | 7.67030208261062 |       0 | CpD         |     13755 |     14752 |     27
  34 |             1 |            15 |   54400 |            2100 | 7.12099637347757 |          0 | CpD         | 522942 |          1 | 7.12099637347756 |       0 | CpD         |     13833 |     14826 |     34
  34 |             1 |            15 |   54400 |            2100 | 7.12099637347757 |          0 | CpD         | 522950 |          1 | 7.12099637347756 |       0 | CpD         |     13856 |     14826 |     34
  27 |             1 |             5 |  108800 |            1347 | 7.67030208261062 |          0 | CpD         | 522973 |          1 | 7.67030208261062 |       0 | CpD         |     13964 |     14828 |     27
  27 |             1 |             5 |  108800 |            1347 | 7.67030208261062 |          0 | CpD         | 522997 |          1 | 7.67030208261062 |       0 | CpD         |     14029 |     14828 |     27
  34 |             1 |            15 |   54400 |            2100 | 7.12099637347757 |          0 | CpD         | 523024 |          1 | 7.12099637347756 |       0 | CpD         |     14171 |     14784 |     34
  34 |             1 |            15 |   54400 |            2100 | 7.12099637347757 |          0 | CpD         | 523029 |          1 | 7.12099637347756 |       0 | CpD         |     14186 |     14826 |     34
  34 |             1 |            15 |   54400 |            2100 | 7.12099637347757 |          0 | CpD         | 523055 |          1 | 7.12099637347756 |       0 | CpD         |     14259 |     14826 |     34
  34 |             1 |            15 |   54400 |            2100 | 7.12099637347757 |          0 | CpD         | 523065 |          1 | 7.12099637347756 |       0 | CpD         |     14349 |     14826 |     34
  27 |             1 |             5 |  108800 |            1347 | 7.67030208261062 |          0 | CpD         | 523066 |          1 | 7.67030208261062 |       0 | CpD         |     14350 |     14752 |     27

n_nonzero_dwpcs is not zero for any of these values. Perhaps it's that the DWPC is the same for every permuted path, such that the sd is zero? I'd like to understand why this happens. I wonder if an easy fix is to set the p-value in this case to the percent nonzero (since all nonzeros are the same)?

ben-heil commented 5 years ago

All the p_values in the database that are the result of a zero standard deviation are from length one paths. This makes sense, as the DWPC for all length one paths with the same target degree should always be the same.

Upon looking into this, I found that many of the standard deviations that should be zero are set to NaN, or random-seeming values around 1e-7. @dhimmel suggested this could be the result of compounding float imprecision.

SELECT * 
FROM dj_hetmech_app_degreegroupedpermutation 
WHERE LENGTH(metapath_id)=3 LIMIT 30;
id source_degree target_degree n_dwpcs n_nonzero_dwpcs nonzero_mean nonzero_sd metapath_id
1 0 0 23159400 0 NaN NaN CpD
2 0 1 2662000 0 NaN NaN CpD
3 0 2 2129600 0 NaN NaN CpD
20 0 28 266200 0 NaN NaN CpD
21 0 30 266200 0 NaN NaN CpD
22 1 0 2366400 0 NaN NaN CpD
23 1 1 272000 636 8.47502086474711 4.2817194615655e-07 CpD
24 1 2 217600 1065 8.12844731798729 NaN CpD
25 1 3 81600 589 7.92571480745336 NaN CpD
26 1 4 81600 805 7.78187381474759 NaN CpD
27 1 5 108800 1347 7.67030208261062 0 CpD
28 1 6 54400 776 7.57914134773377 NaN CpD
29 1 7 54400 957 7.50206605134026 8.7240095853109e-08 CpD
30 1 8 27200 536 7.43530039854811 3.08544073706209e-07 CpD
ben-heil commented 5 years ago

Related, is it possible we have other columns with nonzero standard deviations that should be zero? There are a few metapaths in the degree grouped permutation table that are longer than length one, but still have zero sds. For example:

SELECT * 
FROM dj_hetmech_app_degreegroupedpermutation 
WHERE nonzero_sd = 0;
  id   | source_degree | target_degree | n_dwpcs | n_nonzero_dwpcs |   nonzero_mean   | nonzero_sd | metapath_id 
-------|---------------|---------------|---------|-----------------|------------------|------------|-------------
    27 |             1 |             5 |  108800 |            1347 | 7.67030208261062 |          0 | CpD
    34 |             1 |            15 |   54400 |            2100 | 7.12099637347757 |          0 | CpD
   170 |             9 |             1 |    2000 |              65 | 7.37640892424003 |          0 | CpD
   403 |             8 |            13 |     800 |             110 | 5.73621913579885 |          0 | CtD
   433 |             9 |            22 |     400 |             110 | 5.41429048170385 |          0 | CtD
   459 |            10 |            25 |     400 |             127 | 5.29769874446799 |          0 | CtD
   502 |            12 |            12 |     200 |              36 |  5.5735119405849 |          0 | CtD
   518 |            13 |             3 |     800 |              62 | 6.22662725407162 |          0 | CtD
   595 |            17 |             5 |     400 |              63 | 5.83708705473093 |          0 | CtD
  5639 |            17 |             1 |    3000 |               4 | 5.97371704797009 |          0 | CbGdD
  5802 |            26 |             2 |    2600 |               4 | 4.18410264978063 |          0 | CbGdD
  7044 |            50 |             2 |     200 |               2 | 6.13241014989191 |          0 | CbGuD
 11678 |            55 |             2 |     200 |               2 | 4.81505933426334 |          0 | CdGaD
 15665 |            12 |             1 |    1400 |               2 | 6.04933476914506 |          0 | CdGdD
 16889 |           173 |             1 |     200 |               2 | 4.71521697744948 |          0 | CdGdD
 17879 |            50 |             1 |     200 |               2 | 6.05280360808334 |          0 | CdGuD
 30761 |            26 |             1 |     800 |               2 | 6.86219977694712 |          0 | CuGdD
 31211 |            59 |             1 |     400 |               2 | 6.45248071430402 |          0 | CuGdD
 32579 |            34 |             1 |     200 |               2 | 5.90369490352837 |          0 | CuGuD
dhimmel commented 5 years ago

Notes from meeting

@ben-heil @zietzm @vincerubinetti and @dhimmel met today. Here are the takeaways from our discussion:

For nonzero_sd equal to zero, we should switch to empirically identifying the p-value rather than trying to fit a gamma distribution.

Let's look at everything with a small nonzero_sd. It may well be the case that there is clear separation of nonzero_sd that should be 0 versus those that are legitimately nonzero.

When nonzero_mean exists because n_nonzero_dwpcs is greater than zero, we should not have nonzero_sd equal to NaN.

For metapaths with length > 1 that have nonzero_sd equal to 0, we can use an empirical p-value calculation. It looks like the CdGdD metapaths with nonzero_sd=0, don't have many permuted values, which combines with the intermediate Gene node having the same degree resulting in the same permuted DWPCs.

@zietzm, it would be cool to plot a source degree by target degree heatmap to show the hurdle, alpha, and beta values for the gamma-hurdle distribution.

ben-heil commented 5 years ago

It looks like in addition to the NaNs and the nonzero values that should be zeros for the standard deviation, we also have a handful of Infinity values.

Also there's a large gap between 1e-7 standard deviations and the others. I'll post a notebook with plots later today

https://github.com/ben-heil/hetmech/blob/sd_distribution/small_standard_deviations.ipynb

dhimmel commented 3 years ago

Since we recently recomputed the database, I wanted to check back in here.

nonzero_sd = NaN

When nonzero_mean exists because n_nonzero_dwpcs is greater than zero, we should not have nonzero_sd equal to NaN.

We still have cases where nonzero_sd is NaN when n_nonzero_dwpcs > 0. These occur when n_nonzero_dwpcs is equal to 1:

PGPASSWORD=tm8ut9uzqx7628swwkb9 psql \
  --username=read_only_user \
  --dbname=connectivity_db \
  --host=search-db.het.io --port=5432 \
  --command "
SELECT *
FROM dj_hetmech_app_degreegroupedpermutation
WHERE n_nonzero_dwpcs > 0 AND nonzero_sd = 'NaN'
LIMIT 5;
"
  id  | metapath_id | source_degree | target_degree | n_dwpcs | n_nonzero_dwpcs |   nonzero_mean    | nonzero_sd 
------+-------------+---------------+---------------+---------+-----------------+-------------------+------------
 1280 | AeG         |             1 |             5 |  854000 |               1 | 8.758084551152342 |        NaN
 1289 | AeG         |             1 |            14 |  632000 |               1 | 8.243274887034808 |        NaN
 1363 | AeG         |             1 |            90 |    2000 |               1 | 7.312899092227238 |        NaN

Modifying the query for n_nonzero_dwpcs > 1, and 0 rows are found. Since the standard deviation of a single number is undefined, this NaN value is not incorrect. Another option would be to explicitly set the nonzero_sd to 0 when n_nonzero_dwpcs is 1.

Since we now use the empirical p-value "When all nonzero null DWPCs have the same positive value (standard deviation = 0)", I don't think this will cause any issues.

nonzero_sd = Infinity

It looks like in addition to the NaNs and the nonzero values that should be zeros for the standard deviation, we also have a handful of Infinity values.

I don't think we have any infinite nonzero_sd values anymore. No results for the following

PGPASSWORD=tm8ut9uzqx7628swwkb9 psql \
  --username=read_only_user \
  --dbname=connectivity_db \
  --host=search-db.het.io --port=5432 \
  --command "
SELECT *
FROM dj_hetmech_app_degreegroupedpermutation
WHERE nonzero_sd = 'Infinity'
LIMIT 5;"

Zero p-values

We still have zero p-values in the database.

PGPASSWORD=tm8ut9uzqx7628swwkb9 psql \
  --username=read_only_user \
  --dbname=connectivity_db \
  --host=search-db.het.io --port=5432 \
  --command "
SELECT *
FROM dj_hetmech_app_degreegroupedpermutation dgp
JOIN dj_hetmech_app_pathcount pc
  ON pc.dgp_id = dgp.id
WHERE pc.p_value = 0
LIMIT 30;
"

Select rows:

    id    | metapath_id | source_degree | target_degree |  n_dwpcs   | n_nonzero_dwpcs |     nonzero_mean     |      nonzero_sd       |    id     | metapath_id | source_id | target_id |  dgp_id  | path_count |        dwpc        | p_value 
----------+-------------+---------------+---------------+------------+-----------------+----------------------+-----------------------+-----------+-------------+-----------+-----------+----------+------------+--------------------+---------
 25330587 | DaGeAeG     |           496 |            11 |      68800 |           68800 |   1.5926319619531022 |  0.022615561515144177 |  14920750 | DaGeAeG     |      3969 |      2645 | 25330587 |       2821 | 2.8212962437489706 |       0
  1214139 | AeGuD       |            26 |             5 |        200 |               1 |   3.2074129619546627 |                   NaN |  19615665 | AeGuD       |       546 |      7922 |  1214139 |          1 |  4.678080440171242 |       0
  4124491 | CuGuD       |            24 |             5 |        200 |               0 |                  NaN |                   NaN |  27256409 | CuGuD       |     46428 |      7922 |  4124491 |          1 |  5.926254792525074 |       0
     6046 | AeG         |           108 |             1 |      86000 |               0 |                  NaN |                   NaN |  54912997 | AeG         |      1312 |     42918 |     6046 |          1 |  7.221738402776237 |       0
     4701 | AeG         |            58 |             6 |      76000 |               0 |                  NaN |                   NaN |  54795430 | AeG         |     32025 |     23041 |     4701 |          1 |  6.636703962395173 |       0

The first row here for DaGeAeG has a zero p-value due to imprecision. The actual p-value is super small. The remaining rows use the empiric p-value method and observe a DWPC higher than on any permuted network, hence have a p-value of zero.

While zero p-values are annoying for downstream analyses, I don't see any methodological issues here, so will go ahead and close this issues.