pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.62k stars 1.99k forks source link

compute_log_likelihood for large datasets #6864

Open danjenson opened 1 year ago

danjenson commented 1 year ago

Describe the issue:

Process memory grows steadily while computing log likelihood until it consumes all available memory (and swap). Replicated on linux and M1 Mac.

PYMC version: 5.7.2

Linux system:

Void Linux Kernel 6.3.12_1 64 GB DDR5 RAM (64 GB SWAP) 24 GB RTX 4090 GPU AMD Ryzen 9 7950X 16 core, 32 threads

Mac System:

16 GB memory 8 Cores

Dataset: ~161 mb total.

Reproducible code example:

#!/usr/bin/env python3
import numpy as np
import pandas as pd
import pymc as pm

def pymc_bayes(df: pd.DataFrame):
    a, b, c, i = df.a.values, df.b.values, df.c.values, df.i.values
    n_i = int(i.max() + 1)
    with pm.Model() as m:
        alpha = pm.Normal("alpha", 0, 1, shape=[n_i])
        beta_b = pm.HalfNormal("beta_b", 1)
        beta_c = pm.HalfNormal("beta_c", 1)
        beta_int = pm.Normal("beta_int", 0, 1)
        mu = alpha[i] + beta_b * b + beta_c * c + beta_int * b * c
        sigma = pm.Exponential("sigma", 1)
        a_hat = pm.Normal("a_hat", mu, sigma, observed=a)
        idata = pm.sample(mp_ctx="spawn", idata_kwargs={"log_likelihood": True})
        idata.to_netcdf("pymc_bayes.nc")
    print("finished!")

if __name__ == "__main__":
    n, n_int = 2618018, 17  # to match the real dataset I care about
    df = pd.DataFrame(np.random.randn(n, 3), columns=["a", "b", "c"])
    df["i"] = np.random.randint(0, n_int, size=n)
    pymc_bayes(df)

Error message:

Killed by OS.

PyMC version information:

PYMC version: 5.7.2

Context for the issue:

Trying to use this with arviz.compare(...)

ricardoV94 commented 1 year ago

The log-likelihood computation raises the same exact memory issues than the Deterministic you had in the other issue.

It's going to be a chains draws len(dataset) array of float64 numbers. That's exactly as large as the Deterministic that was consuming the whole RAM.

This is not uncommon and it's one of the reasons we don't compute it hy default. I've had to resort to libraries like dask to compute this quantity without running out of RAM (as well as the subsequent model comparison statistics in arviz).

ricardoV94 commented 1 year ago

There's a related discussion here: https://github.com/pymc-devs/pymc/discussions/5371

danjenson commented 1 year ago

Ok, so is there a workaround? I see the discussion but it seems inconclusive. It seems like the primary solution is to cultivate a much smaller dataset so I don't run out of memory?

ricardoV94 commented 1 year ago

There is no workaround yet. The long-term solution is to allow the log-likelihood to be computed in dask (pymc-side), as well as the model comparison statistics (arviz-side), so that you can perform model comparison on large datasets. CC @OriolAbril

For now you can try to work with a smaller dataset or less chains/draws if the subset converges equally well as the default number of chains/draws

OriolAbril commented 1 year ago

if your likelihood is easy-ish to compute by hand using distributions, you can also use xarray-einstats to help the process which is already compatible with dask. The process should be similar to https://oriolabrilpla.cat/en/blog/posts/2022/einstats-hmm-cmdstanpy.html but using logpdf method instead of rvs.

There is also a PR open in ArviZ to compute loo/waic using dask, but it needs testing and don't have much time to dedicate to it now. If anyone can try it out it would be extremely helpful. https://github.com/arviz-devs/arviz/pull/2205