pnnl-predictive-phenomics / emll

GNU General Public License v2.0
4 stars 0 forks source link

Reproduction error on hackett.ipynb #4

Open mcnaughtonadm opened 5 months ago

mcnaughtonadm commented 5 months ago

In the exercise of re-implementing the EMLL example notebooks with PyTensor, we encountered a small bug in reproduction.

The cell:

with sns.plotting_context('notebook'):

    fig = plt.figure(figsize=(40,8))
    ax = fig.add_subplot(111, aspect='equal', adjustable='box')

sns.heatmap(fcc_med_measured[fcc_consistent_measured].values,
            center=0, robust=True, cmap='RdBu', cbar=True, rasterized=True, lw=1)

_= ax.set_yticks(np.arange(sorted_corr_df.shape[0]) + 0.5)
_= ax.set_xticks(np.arange(sorted_corr_df.shape[1]) + 0.5)
_ = ax.set_yticklabels(sorted_corr_df.index, rotation=0)
_ = ax.set_xticklabels(sorted_corr_df.columns)

ax.set_xlabel('Enzyme')
ax.set_ylabel('Boundary Flux')

plt.tight_layout()

yields the figure: image

whereas the original PyMC3 & Theano implementation yields the figure: image

The data populating this figure comes from the code snippet:

fccs_hpd = pm.hpd(fccs)
fcc_consistent = np.sign(fccs_hpd[:, :, 0]) == np.sign(fccs_hpd[:, :, 1])
fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)

sorted_corr_df = sort_df(corr_df)

fcc_med_measured = fcc_med.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)
fcc_consistent_measured = fcc_consistent_df.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)

which currently takes the fccs with a shape of (500,240,240) - which is the same in both new and old versions - and calculates the highest posterior density using pymc3.hpd(fccs) in the original implementation but pymc.hdi(fccs) for highest density interval in the new implementation. These methods are in all respects the same at this level of use, but replacing pm.hpd(fccs) with pm.hdi(fccs) in the above snippet returns the error:

/Users/mcna892/mambaforge/envs/emll_dev/lib/python3.11/site-packages/arviz/data/base.py:221: UserWarning: More chains (500) than draws (240). Passed array should have shape (chains, draws, *shape)
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[298], line 4
      2 fccs_hdi_repeated = np.repeat(fccs_hpd[np.newaxis, ...], 500, axis=0)
      3 fcc_consistent = np.sign(fccs_hdi_repeated[:, :, 0]) == np.sign(fccs_hdi_repeated[:, :, 1])
----> 4 fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)
      6 sorted_corr_df = sort_df(corr_df)
      8 fcc_med_measured = fcc_med.reindex(
      9     columns=sorted_corr_df.columns, index=sorted_corr_df.index)

File ~/mambaforge/envs/emll_dev/lib/python3.11/site-packages/pandas/core/frame.py:782, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    771         mgr = dict_to_mgr(
    772             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    773             # attribute "name"
   (...)
    779             copy=_copy,
    780         )
    781     else:
--> 782         mgr = ndarray_to_mgr(
    783             data,
    784             index,
    785             columns,
    786             dtype=dtype,
    787             copy=copy,
    788             typ=manager,
    789         )
    791 # For data is list-like, or Iterable (will consume into list)
    792 elif is_list_like(data):

File ~/mambaforge/envs/emll_dev/lib/python3.11/site-packages/pandas/core/internals/construction.py:336, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    331 # _prep_ndarraylike ensures that values.ndim == 2 at this point
    332 index, columns = _get_axes(
    333     values.shape[0], values.shape[1], index=index, columns=columns
    334 )
--> 336 _check_values_indices_shape_match(values, index, columns)
    338 if typ == "array":
    339     if issubclass(values.dtype.type, str):

File ~/mambaforge/envs/emll_dev/lib/python3.11/site-packages/pandas/core/internals/construction.py:420, in _check_values_indices_shape_match(values, index, columns)
    418 passed = values.shape
    419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (500, 240), indices imply (240, 240)

The problem stems from

fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)

as using r_labels for both columns and index implies a 240 by 240 dataframe, but the data in fcc_consistent is in the shape (500,240) from the sample trace used to calculate this. I tried to force a workaround which produced the incorrect image with the following:

# Initialize an empty array to store HDI results
fccs_hdi = np.zeros((240, 240, 2))

# Calculate HDI for each slice along the first dimension
for i in range(fccs.shape[1]):
    for j in range(fccs.shape[2]):
        hdi = pm.hdi(fccs[:, i, j])  # Compute HDI for each element (across the 500 dimension)
        fccs_hdi[i, j] = hdi  # Store HDI results

fcc_consistent = np.sign(fccs_hdi[:, :, 0]) == np.sign(fccs_hdi[:, :, 1])
fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)

sorted_corr_df = sort_df(corr_df)

fcc_med_measured = fcc_med.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)
fcc_consistent_measured = fcc_consistent_df.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)

but this is clearly incorrect. This issue needs to revisit the old implementation with PyMC3 to see what is going on and how we can extend this to using the new data structures present in the latest versions of PyMC with PyTensor.