pyro-ppl / pyro

Deep universal probabilistic programming with Python and PyTorch
http://pyro.ai
Apache License 2.0
8.59k stars 988 forks source link

MCMC num_chains>1 does not work on Windows #2315

Open Garve opened 4 years ago

Garve commented 4 years ago

Hi!

I tried to implement some very simple Bayesian Regression via NUTS/MCMC. It works well, if I use a single Markov Chain, however, when I increase the number, the program does not stop anymore (but also doesn't yield any error message).

import torch
import pyro
import pyro.distributions as dist
from pyro.infer.mcmc import MCMC, NUTS

X = torch.FloatTensor([[float(i), 1.] for i in range(100)]).reshape(-1, 2)
y = torch.FloatTensor([float(i) for i in range(100)])

def linreg(x, y):
    w = pyro.sample('w', dist.Normal(torch.zeros(x.size(1))+5., 10.0))
    return pyro.sample('measurement', dist.Normal(x@w, 1.), obs=y)

hmc_kernel = NUTS(linreg)
posterior = MCMC(hmc_kernel, num_samples=10, warmup_steps=10, num_chains=2)
posterior.run(X, y)

If you set num_chains to 1, it will work.

Pyro 1.2.1 PyTorch 1.4.0+cpu Python 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] Windows 10

Thanks for help!

fehiepsi commented 4 years ago

This might be a problem of multiprocessing in windows. I can't reproduce the issue in Linux.

fehiepsi commented 4 years ago

@Garve could you add the following lines at the top of your script to see if the issue persists:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

I don't have Windows environment so it's hard to guess the issue. My best bet is due to this issue.

Garve commented 4 years ago

Hi! Sorry for the late answer. No, it still doesn't work, sadly.

ecotner commented 4 years ago

Hi, I am also having this same problem, but I am on Ubuntu 18.04, not Windows. When I run with num_chains=1, everything is fine, the progress bar fills up. When I run with num_chains=2, it displays two empty progress bars (at least in jupyter lab), after waiting a little while a third progress bar pops up (using stderr), that immediately crashes the kernel:

mcmc_error

Sometimes the third progress bar does not pop up at all and the script just hangs forever. I tried @fehiepsi 's cuda devices fix but that did not change anything.

fehiepsi commented 4 years ago

@ecotner Could you paste part of error message in the console? It might give us some hints. Also, did you run mcmc two times? IIRC there is a limitation (see also this topic) of using PyTorch multiprocessing in jupyter lab.

fonnesbeck commented 3 years ago

I'm getting similar behavior on Mac, Linux and Windows. The sampler just hangs when the progress bar appears for multiple chains. Single chains work fine. I have tried using CPU on all platforms and GPU on Linux.

fehiepsi commented 3 years ago

Hi @fonnesbeck, I just installed a fresh pyro on a new conda environment on Linux. The topic model works for me in jupyterlab and jupyter notebook. But if I make a second mcmc run, I got [ERROR LOG CHAIN:0]Unable to handle autograd's threading in combination with fork-based multiprocessing. See https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork. This could be a hint I guess.

image

fonnesbeck commented 3 years ago

OK, thanks. Interestingly, I can get it going on Linux with GPU if I remove the mp_context="spawn" flag that is recommended in the docstring. However, after 10 or so iterations on each chain the MCMC run really slows down, to the point where its actually much faster to run a single chain. You can see this in the screen capture below, which shows 2-chain and single-chain sampling rates being drastically different:

Screen Shot 2021-04-09 at 3 07 13 PM
fehiepsi commented 3 years ago

I guess it is expected if moving tensors in/out/across processes is costly. From our experience, making multiprocess works on PyTorch is quite tricky and sadly, we don't know what is the best practice to apply for MCMC (probably it is just a matter of changing a few lines of code to make MCMC run more efficiently). :(