OOM after ~ 50 epochs - Githubissues

sboehringer commented 3 months ago

When running BayesFlow to analyze regression models, I get OOM errors after about 50-60 epochs. The simulations need a data set of covariates from which a one-dimensional outcome is simulated. Here are my batched Prior/Simulation classes

class MyPrior:
    def __init__(self, prior):
        self.mu = prior['mu']
        self.sigma = prior['sigma']
        self.rng = np.random.default_rng().normal
    def single(self):
        return self.rng(self.mu, scale = self.sigma)
    def batch(self, N):
        return [self.single() for i in range(0, N)]
    def __call__(self, batch_size):
        pars = self.batch(batch_size)
        return np.array(pars)

class RegressionSimulator:
    def __init__(self, dCov, model):
        self.dCov = dCov
        self.model = model
    def single(self, par):
        d = simulateOutcome(par, self.dCov, self.model)
        return d.reshape([d.shape[0], 1])
    def batch(self, par):
        return [self.single(p) for p in par]
    def __call__(self, par, *args):
        return np.array(self.batch(par))

The simulation is set up as:

def doRegress(o, a,
    # pass-through args
    model, N, pars, Par,
    Nepochs, Nbatch, NsumD, Nit, Nval, Nho, Npost, weight,
    post_height, post_alpha, post_color):

    Log(2, Sprintf('Running regression model: %{regress}s, sample size N=%{N}d', o));

    # <p> simulate real data set
    d = simulateFromSpec(model, N = int(N))

    rSim = RegressionSimulator(d['dCov'], model['outcome']) 
    simulator = bf.simulation.Simulator(rSim)

    prior = bf.simulation.Prior(MyPrior(model['prior']));
    generative_model = bf.simulation.GenerativeModel(prior, simulator, simulator_is_batched = True)

    summary_net = bf.networks.SetTransformer(input_dim = NsumD)
    inference_net = bf.networks.InvertibleNetwork(num_params = len(Par))
    amortized_posterior = bf.amortizers.AmortizedPosterior(inference_net, summary_net)

    trainer = bf.trainers.Trainer(amortizer=amortized_posterior, generative_model=generative_model)
    losses = trainer.train_online(epochs=Nepochs, iterations_per_epoch=Nit, batch_size=Nbatch, validation_sims=Nval)

This is on a NVIDIA GeForce RTX 3070 with 8Gb of RAM. The sample is ~300 with up to 4 covariates, thus a small data set.

Thank you.

stefanradev93 commented 3 months ago

Can you please paste the error stack here to investigate? Do you run out of GPU memory or RAM?

sboehringer commented 3 months ago

It is GPU RAM. Please find the log below (wrapped as captured from a tmux session). I can also provide a self-contained example for reproduction, if helpful.

debug-log-20240415.txt

sboehringer commented 3 months ago

I should add that the suggested fix from the output (TF_GPU_ALLOCATOR=cuda_malloc_async) did not change anything.

vpratz commented 3 months ago

Given that the errors occurs only after 50 or 60 epochs, it looks as if the memory accumulates somehow, though I don't see why or where this might be. Does the code run fast enough to test it without a GPU? If it does, could you run it on CPU only (using $ CUDA_VISIBLE_DEVICES='' ./bayesFlow.py --regress linearMVmi) and monitor RAM usage to see if the memory usage increases with epochs? Does it run out of memory in that case as well? This would help to pinpoint whether the error is GPU related or more general.

If this is not possible, a self-contained example for reproduction would be a great help.

stefanradev93 commented 3 months ago

Are you running the training from a script or from a Jupyter notebook?

vpratz commented 3 months ago

Are you running the training from a script or from a Jupyter notebook?

The command is at the top of the log, it is a training script

stefanradev93 commented 3 months ago

I see. I would need to run it on my GPU workstation to reproduce the problem.

sboehringer commented 3 months ago

@vpratz: here is memory usage using the CPU

End Epoch 4: 2.564g
Beginning Epoch 5: 2.682g
~ 100/1000 Iterations: 2.690g
End Epoch 5: 2.693g

Memory from the previous epoch seems not to be freed/reused. Then there is some further initial consumption of 8Mb after which memory usage becomes stable. This pattern seems to repeat for ensuing epochs.

These numbers do seem to be compatible with what I see under the GPU as 7.5G/130M would equate to ~ 57 epochs until OOM.

I will put together a self-contained example for reproduction. Thank you.

sboehringer commented 3 months ago

Here is an example that has been tested for OOM. bayesFlow-debug.txt

It can be run straight without arguments. It shouldn't touch the disk (read/write).

stefanradev93 commented 3 months ago

Thanks, I will investigate if that's a problem that specifically affects the SetTransformer!

stefanradev93 / BayesFlow

OOM after ~ 50 epochs #160