Pitplatsch commented 1 year ago

I am currently exploring the use of GPDC for causal discovery in tigramite using the pytorch implementation for increased speed for discovery of long time series with many variables and large tau_max. However, I run out of VRAM, even using modern GPUs (in my case: A100 with 80 GB of VRAM). See minimal working (crashing) example below for the technical details.

Is this a known issue? I am of the impression that this is due to the many copies of the data loaded onto the GPU for conditional independence testing. Can this be avoided, or are routines necessary to clean the data from the GPUs after conditional independence is calculated? Or is pre-computing the null-distributions a solution to the problem?

Thank you so much for your assistance and keep up the great developing work on tigramite!

Example

Example of overloading VRAM during causal discovery using GPDCtorch. This is done by 3650 timesteps of 5 variables with taum_max = 7

Test run on a single Nvidia A100 WITH 80 GB of VRAM

Process crashes after less than 5 minutes.

Software used

tigramite 5.2.0.4 gpytorch 1.10 pytorch 2.0.1 py3.10_cuda11.7_cudnn8.5.0_0

Code

import math
import torch
import gpytorch
import os
import sys
from matplotlib import pyplot as plt
import time

import numpy as np
import matplotlib
import tigramite
from tigramite import data_processing as pp
from tigramite.toymodels import structural_causal_processes as toys

from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.independence_tests.gpdc_torch import GPDCtorch

def overload_testing(series_length:int = 1000, tau_max:int = 7) -> None:
    """Testing PCMCI with GPDC using Pytorch until GPU runs out of VRAM

    Synthetic testing data consists of 5 variables. All variables are autocorrelated are interdependent

    Parameters
    ----------
    series_length : int, optional
        series lenght, by default 1000
    tau_max : int, optional
        tau_max, by default 7
    """
    start_time = time.time()
    print(f'running vram overload testing with {series_length} time steps')
    print(f'        current vram free:{torch.cuda.mem_get_info()}, that is {torch.cuda.mem_get_info()[0]/torch.cuda.mem_get_info()[1]*100} percent free')
    # make data, 5 dependent random variables with time lag up to 7, all variables are autocorrelated
    random_state = np.random.default_rng(seed=4122345)
    data = random_state.standard_normal((series_length, 5))
    for t in range(tau_max, series_length):
        data[t, 1] += 0.5*data[t-1, 1] + 0.2*data[t-1, 0]
        data[t, 2] += 0.2*data[t-1, 2] + 0.1*data[t-1, 1] + 0.1*data[t-2, 1]
        data[t, 3] += 0.1*data[t-1, 3] + 0.2*data[t-2, 1] + 0.2*data[t-5, 2] + 0.2*data[t-7, 2]
        data[t, 4] += 0.3*data[t-1, 4] + 0.3*data[t-2, 4] + 0.2*data[t-2, 1]
    var_names = [r'$X^0$', r'$X^1$', r'$X^2$',r'$X^3$',r'$X^4$']
    dataframe = pp.DataFrame(data, var_names=var_names)

    print('     data generation done - modelling now')

    gpdc = GPDCtorch()
    pcmci_gpdc = PCMCI(
        dataframe=dataframe, 
        cond_ind_test=gpdc,
        verbosity=0)

    results = pcmci_gpdc.run_pcmci(tau_max=tau_max)
    print('     modelling done - plotting now')
    print(f'        current vram free:{torch.cuda.mem_get_info()}, that is {torch.cuda.mem_get_info()[0]/torch.cuda.mem_get_info()[1]*100} percent free')
    tp.plot_graph(
        val_matrix=results['val_matrix'],
        graph=results['graph'],
        var_names=var_names,
        show_colorbar=False,
        save_name=f'/work/miersch/logs/vramtest_{series_length}.png'
        )

    end_time = time.time()
    print(f'     all done in {end_time - start_time}s - returning')
    return

def main() -> None:
    # testing with 10 'years' of daily data
    overload_testing(10*365,7)
    return

if __name__ == '__main__':
    main()

output short

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 49.25 GiB (GPU 0; 79.18 GiB total capacity; 49.41 GiB already allocated; 29.14 GiB free; 49.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

output complete

running vram overload testing with 3650 time steps
        current vram free:(84442152960, 85024112640), that is 99.31553572047959 percent free
     data generation done - modelling now
Number of devices: 1 -- Kernel partition size: 0
Traceback (most recent call last):
  File "/gpfs0/home/user/project/scripts/test_vram_overload.py", line 73, in <module>
    main()
  File "/gpfs0/home/user/project/scripts/test_vram_overload.py", line 68, in main
    overload_testing(10*365,7)
  File "/gpfs0/home/user/project/scripts/test_vram_overload.py", line 52, in overload_testing
    results = pcmci_gpdc.run_pcmci(tau_max=tau_max)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/pcmci.py", line 1913, in run_pcmci
    all_parents = self.run_pc_stable(link_assumptions=link_assumptions,
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/pcmci.py", line 687, in run_pc_stable
    self._run_pc_stable_single(j,
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/pcmci.py", line 418, in _run_pc_stable_single
    val, pval = self.cond_ind_test.run_test(X=[parent],
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/independence_tests/independence_tests_base.py", line 377, in run_test
    val = self._get_dependence_measure_recycle(X, Y, Z, xyz, array, type_mask)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/independence_tests/independence_tests_base.py", line 520, in _get_dependence_measure_recycle
    return self.get_dependence_measure(array, xyz)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/independence_tests/gpdc_torch.py", line 697, in get_dependence_measure
    x_vals = self._get_single_residuals(array, target_var=0)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/independence_tests/gpdc_torch.py", line 644, in _get_single_residuals
    return self.gauss_pr._get_single_residuals(
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/tigramite/independence_tests/gpdc_torch.py", line 358, in _get_single_residuals
    mean = model(train_x).loc.detach()
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/models/exact_gp.py", line 332, in __call__
    ) = self.prediction_strategy.exact_prediction(full_mean, full_covar)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/models/exact_prediction_strategies.py", line 272, in exact_prediction
    self.exact_predictive_mean(test_mean, test_train_covar),
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/models/exact_prediction_strategies.py", line 288, in exact_predictive_mean
    res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/utils/memoize.py", line 59, in g
    return _add_to_cache(self, cache_name, method(self, *args, **kwargs), *args, kwargs_pkl=kwargs_pkl)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/models/exact_prediction_strategies.py", line 239, in mean_cache
    mean_cache = train_train_covar.evaluate_kernel().solve(train_labels_offset).squeeze(-1)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py", line 2226, in solve
    return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/functions/_solve.py", line 53, in forward
    solves = _solve(linear_op, right_tensor)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/functions/_solve.py", line 20, in _solve
    preconditioner = linear_op.detach()._solve_preconditioner()
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py", line 756, in _solve_preconditioner
    base_precond, _, _ = self._preconditioner()
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/added_diag_linear_operator.py", line 117, in _preconditioner
    self._piv_chol_self = self._linear_op.pivoted_cholesky(rank=max_iter)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py", line 1870, in pivoted_cholesky
    res, pivots = func(self.representation_tree(), rank, error_tol, *self.representation())
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/functions/_pivoted_cholesky.py", line 78, in forward
    row = apply_permutation(matrix, pi_m.unsqueeze(-1), right_permutation=None).squeeze(-2)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/utils/permutation.py", line 80, in apply_permutation
    matrix.__getitem__(
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped
    output = method(self, *args, **kwargs)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 426, in __getitem__
    return super().__getitem__(index)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py", line 2718, in __getitem__
    res = self._get_indices(new_row_index, new_col_index, *new_batch_indices)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py", line 432, in _get_indices
    .to_dense()
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/utils/memoize.py", line 59, in g
    return _add_to_cache(self, cache_name, method(self, *args, **kwargs), *args, kwargs_pkl=kwargs_pkl)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py", line 2487, in to_dense
    res = self.matmul(eye)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/interpolated_linear_operator.py", line 476, in matmul
    base_res = self.base_linear_op.matmul(right_interp_res)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py", line 1742, in matmul
    return Matmul.apply(self.representation_tree(), other, *self.representation())
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/linear_operator/functions/_matmul.py", line 21, in forward
    res = linear_op._matmul(rhs)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 264, in _matmul
    self.kernel(
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/kernels/kernel.py", line 524, in __call__
    super(Kernel, self).__call__(x1_, x2_, last_dim_is_batch=last_dim_is_batch, **params)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/module.py", line 31, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/kernels/multi_device_kernel.py", line 70, in forward
    return self.module.forward(*inputs[0], **self._kwargs[0])
  File "/home/user/.conda/envs/PCMCI_TORCH310/lib/python3.10/site-packages/gpytorch/kernels/scale_kernel.py", line 118, in forward
    return orig_output.mul(outputscales)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 49.25 GiB (GPU 0; 79.18 GiB total capacity; 49.41 GiB already allocated; 29.14 GiB free; 49.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
srun: error: node035: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=37341779.0

jakobrunge commented 1 year ago

I am not really an expert on the torch aspects here, but I'll try to answer:

Is this a known issue?

I wasn't aware and I wonder why. It should only load X and Z and then Y and Z into the memory which shouldn't be much.

I am of the impression that this is due to the many copies of the data loaded onto the GPU for conditional independence testing. Can this be avoided, or are routines necessary to clean the data from the GPUs after conditional independence is calculated?

I would hope that the data is cleaned from the GPU, but I don't know whether it does.

Or is pre-computing the null-distributions a solution to the problem?

No, the null distribution only pertains to distance correlation and this is computed on the CPU.

Not sure this helps, but I would welcome any improvements on the torch part. This is the _get_single_residuals function in the class.

asha24choudhary commented 11 months ago

I am facing the same problem, have gpu with 8 cores.

""" Number of devices: 8 -- Kernel partition size: 0 Number of devices: 8 -- Kernel partition size: 42625 Number of devices: 8 -- Kernel partition size: 21313 Number of devices: 8 -- Kernel partition size: 10657 Number of devices: 8 -- Kernel partition size: 5329 Number of devices: 8 -- Kernel partition size: 2665 Number of devices: 8 -- Kernel partition size: 1333 Number of devices: 8 -- Kernel partition size: 667 Number of devices: 8 -- Kernel partition size: 334 Number of devices: 8 -- Kernel partition size: 167 Number of devices: 8 -- Kernel partition size: 84 Number of devices: 8 -- Kernel partition size: 42 Number of devices: 8 -- Kernel partition size: 21 Number of devices: 8 -- Kernel partition size: 11 Number of devices: 8 -- Kernel partition size: 6 Number of devices: 8 -- Kernel partition size: 3 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.07 GiB (GPU 0; 79.19 GiB total capacity; 57.58 GiB already allocated; 4.62 GiB free; 57.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"""

I do not understand what shall I do? I even tried to clear the cache using 'torch.cuda.empty_cache()'. But it doesn't help. Please help

945716994 commented 6 months ago

I ran into the same problem, my dataset shape is (20000, 4), and using GDPCtorch() requires 305G of memory allocation, which is not usable at all. I run the PCMCI with max lag eq 1。

wkqian06 commented 3 months ago

I had the same problem. About 980G of the memory allocation is required.

Pitplatsch commented 3 months ago

I have found a solution that is currently missing multi GPU support (using pytorch lightning). I will add this to achieve feature parity with the current version, and then do a pull request.

jakobrunge / tigramite

Running out of memory for GPDC-torch causal discovery #308

Example

Software used

Code

output short

output complete