CUDA out of memory - Githubissues

Hi, I'm trying to run dataset ogbn_papers100m using marius with PARTITION_BUFFER in main branch. But I cannot find an example about it. So I followed the example in eurosys_2023_artifact branch and rewrote a yaml file.

# examples/configuration/ogbn_paper100m_disk.yaml
model:
  learning_task: NODE_CLASSIFICATION
  encoder:
    train_neighbor_sampling:
      - type: ALL
      - type: ALL
      - type: ALL
    layers:
      - - type: FEATURE
          output_dim: 128
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 128
          output_dim: 128
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 128
          output_dim: 128
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 128
          output_dim: 40
          bias: true
  decoder:
    type: NODE
  loss:
    type: CROSS_ENTROPY
    options:
      reduction: SUM
  dense_optimizer:
    type: ADAM
    options:
      learning_rate: 0.01
storage:
  device_type: cuda
  dataset:
    dataset_dir: ./datasets/marius/ogbn_papers100m/
  edges:
    type: FLAT_FILE
  nodes:
    type: HOST_MEMORY
  features:
    type: PARTITION_BUFFER
    options:
      num_partitions: 8192
      buffer_capacity: 3584
      prefetching: false
      fine_to_coarse_ratio: 646
      num_cache_partitions: 0
      node_partition_ordering: SEQUENTIAL
  prefetch: false
  shuffle_input: true
  full_graph_evaluation: true
training:
  batch_size: 1000
  num_epochs: 10
  pipeline:
    sync: false
    staleness_bound: 8
    batch_host_queue_size: 8
    batch_device_queue_size: 8
    batch_loader_threads: 4
    batch_transfer_threads: 4
  epochs_per_shuffle: 1
  logs_per_epoch: 10
evaluation:
  batch_size: 1000
  pipeline:
    sync: true
  epochs_per_eval: 11

I have tested the example for fb15k_237 and it works well. However, it didn't work for large datasets, such as ogbn_papers100m. I used the following commands for ogbn_papers100m dataset.

$ marius_preprocess --dataset ogbn_papers100m --output_dir datasets/marius/ogbn_papers100m/ --num_partitions 8192 --sequential_train_nodes 
$ marius_train examples/configuration/ogbn_paper100m_disk.yaml

Then, an error CUDAOutOfMemoryError happened.

$ marius_train examples/configuration/ogbn_paper100m_disk.yaml 
[2023-03-10 00:33:44.267] [info] [marius.cpp:41] Start initialization
[03/10/23 00:33:47.939] Initialization Complete: 3.672s
[03/10/23 00:33:47.941] Generating Sequential Ordering
[03/10/23 00:33:47.941] Num Train Partitions: 90
[03/10/23 00:35:13.841] ################ Starting training epoch 1 ################
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
  what():  CUDA out of memory. Tried to allocate 15.48 GiB (GPU 0; 23.70 GiB total capacity; 11.88 GiB already allocated; 10.78 GiB free; 12.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:578 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f6490b5e1ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1966f (0x7f64e20c066f in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x49738 (0x7f64e20f0738 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x49962 (0x7f64e20f0962 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::resize_bytes_cuda(c10::StorageImpl*, unsigned long) + 0x5d (0x7f64d277e2bd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #5: <unknown function> + 0x1558c0 (0x7f64d27808c0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #6: at::native::resize_cuda_(at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>) + 0x4d (0x7f64d277e5dd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #7: at::_ops::resize_::call(at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>) + 0x179 (0x7f64baa49c59 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::resize_output(at::Tensor const&, c10::ArrayRef<long>) + 0x51 (0x7f64ba1c8141 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: void at::native::(anonymous namespace)::index_select_out_cuda_impl<float>(at::Tensor&, at::Tensor const&, long, at::Tensor const&) + 0x19e (0x7f64928f5c2e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #10: at::native::index_select_out_cuda(at::Tensor const&, long, at::Tensor const&, at::Tensor&) + 0x49d (0x7f64927b46ad in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #11: at::native::index_select_cuda(at::Tensor const&, long, at::Tensor const&) + 0xf1 (0x7f64927b4971 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #12: <unknown function> + 0x2d636d8 (0x7f64939156d8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #13: <unknown function> + 0x2d63743 (0x7f6493915743 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #14: at::_ops::index_select::redispatch(c10::DispatchKeySet, at::Tensor const&, long, at::Tensor const&) + 0x7c (0x7f64ba5a809c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x35600bd (0x7f64bbb3d0bd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x35606e6 (0x7f64bbb3d6e6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::index_select::call(at::Tensor const&, long, at::Tensor const&) + 0x17b (0x7f64ba6269ab in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #18: GraphSageLayer::forward(at::Tensor, DENSEGraph, bool) + 0xd5 (0x7f63e3cba9a5 in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #19: GeneralEncoder::forward(c10::optional<at::Tensor>, c10::optional<at::Tensor>, DENSEGraph, bool) + 0x1937 (0x7f63e3ca6077 in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #20: Model::forward_nc(c10::optional<at::Tensor>, c10::optional<at::Tensor>, DENSEGraph, bool) + 0xa4 (0x7f63e3cc4ca4 in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #21: Model::train_batch(std::shared_ptr<Batch>, bool) + 0x114 (0x7f63e3cc5cf4 in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #22: ComputeWorkerGPU::run() + 0x1e4 (0x7f63e3ce7984 in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #23: <unknown function> + 0xd6de4 (0x7f64e22aade4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #24: <unknown function> + 0x8609 (0x7f65018df609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #25: clone + 0x43 (0x7f6501a19133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

It seems marius still tries to allocate memory in GPU and does not use the partition buffer? Could you please tell me whether the configuration is correct in yaml? BTW, is there any examples for python API to use partition buffer directly rather than yaml? m.storage.tensor_from_file seems only support the device memory?

Thanks for replying.

My environment is the following.

$ marius_env_info
cmake:
  version: 3.20.0
cpu_info:
  num_cpus: 96
  total_memory: 503GB
cuda:
  version: '11.6'
gpu_info:
  - memory: 24GB
    name: NVIDIA GeForce RTX 3090
  - memory: 24GB
    name: NVIDIA GeForce RTX 3090
marius:
  bindings_installed: true
  install_path: /usr/local/lib/python3.8/dist-packages/marius
  version: 0.0.2
openmp:
  version: '201511'
operating_system:
  platform: Linux-5.19.0-32-generic-x86_64-with-glibc2.29
pybind:
  PYBIND11_BUILD_ABI: _cxxabi1013
  PYBIND11_COMPILER_TYPE: _gcc
  PYBIND11_STDLIB: _libstdcpp
python:
  deps:
    numpy_version: 1.24.2
    omegaconf_version: 2.3.0
    pandas_version: 2.0.0rc0
    pip_version: 20.0.2
    pyspark_version: 3.3.2
    pytest_version: 7.2.2
    torch_version: !!python/object/new:torch.torch_version.TorchVersion
      - 1.12.0+cu116
    tox_version: 4.4.6
  version: "3.8.10 (default, Nov 14 2022, 12:59:47) \n[GCC 9.4.0]"
pytorch:
  install_path: /usr/local/lib/python3.8/dist-packages/torch
  version: !!python/object/new:torch.torch_version.TorchVersion
    - 1.12.0+cu116

Thanks so much for replying. It works for OOM. However, there are several new errors. When I set device_type: cuda, the error is:

root@c54d23ae2acd:/working_dir# marius_train examples/configuration/ogbn_paper100m_disk.yaml 
[2023-03-13 09:11:20.123] [info] [marius.cpp:41] Start initialization
[03/13/23 09:11:23.806] Initialization Complete: 3.682s
[03/13/23 09:11:23.807] Generating Sequential Ordering
[03/13/23 09:11:23.808] Num Train Partitions: 90
[03/13/23 09:12:51.593] ################ Starting training epoch 1 ################
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from launch_vectorized_kernel at ../aten/src/ATen/native/cuda/CUDALoops.cuh:98 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f7c974c41ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) + 0xb88 (0x7f7c478c0218 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #2: void at::native::gpu_kernel<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) + 0x31b (0x7f7c478c0deb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #3: <unknown function> + 0x18f68e2 (0x7f7c478a88e2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #4: at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) + 0x20 (0x7f7c478a9b30 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #5: <unknown function> + 0x1a3078d (0x7f7c6f40d78d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2d4d91b (0x7f7c48cff91b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #7: at::_ops::fill__Scalar::call(at::Tensor&, c10::Scalar const&) + 0x12b (0x7f7c6f9fa77b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::zero_(at::Tensor&) + 0x83 (0x7f7c6f40dcc3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2d4b955 (0x7f7c48cfd955 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #10: at::_ops::zero_::call(at::Tensor&) + 0x9e (0x7f7c6fd5910e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::structured_nll_loss_backward_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::OptionalTensorRef, long, long, at::Tensor const&, at::Tensor const&) + 0x3d (0x7f7c47df5b0d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #12: <unknown function> + 0x2d49a6b (0x7f7c48cfba6b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #13: <unknown function> + 0x2d49b35 (0x7f7c48cfbb35 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #14: at::_ops::nll_loss_backward::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&) + 0x94 (0x7f7c6fd338e4 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x377a776 (0x7f7c71157776 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x377ae0b (0x7f7c71157e0b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::nll_loss_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&) + 0x1cd (0x7f7c6fd9e5fd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::generated::NllLossBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x23d (0x7f7c70e8d42d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x3db919b (0x7f7c7179619b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1640 (0x7f7c7178f710 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x698 (0x7f7c71790148 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x8b (0x7f7c7178790b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4f (0x7f7c9532726f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #24: <unknown function> + 0xd6de4 (0x7f7c9771bde4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #25: <unknown function> + 0x8609 (0x7f7cb6d50609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #26: clone + 0x43 (0x7f7cb6e8a133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

When I set device_type: cpu, the error is:

root@c54d23ae2acd:/working_dir# marius_train examples/configuration/ogbn_paper100m_disk.yaml 
[2023-03-13 09:13:27.636] [info] [marius.cpp:41] Start initialization
[03/13/23 09:13:30.313] Initialization Complete: 2.676s
[03/13/23 09:13:30.314] Generating Sequential Ordering
[03/13/23 09:13:30.314] Num Train Partitions: 90
[03/13/23 09:14:58.836] ################ Starting training epoch 1 ################
terminate called after throwing an instance of 'c10::IndexError'
  what():  Target 132 is out of bounds.
Exception raised from nll_loss_out_frame at ../aten/src/ATen/native/LossNLL.cpp:226 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fd4810371ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10db8ae (0x7fd4a9c8f8ae in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x25437ee (0x7fd4ab0f77ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x25438cd (0x7fd4ab0f78cd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: at::_ops::nll_loss_forward::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x86 (0x7fd4aadef776 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x36c9202 (0x7fd4ac27d202 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x36c9813 (0x7fd4ac27d813 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: at::_ops::nll_loss_forward::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x1a1 (0x7fd4aae6d5a1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::nll_loss(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x47 (0x7fd4aa682a77 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2744cfd (0x7fd4ab2f8cfd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: at::_ops::nll_loss::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x1a2 (0x7fd4aaf74c52 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::nll_loss_nd(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x2eb (0x7fd4aa68cccb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2744d4d (0x7fd4ab2f8d4d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::nll_loss_nd::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x1b8 (0x7fd4aad40e28 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: at::native::cross_entropy_loss(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, double) + 0x185 (0x7fd4aa68c6a5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x274465d (0x7fd4ab2f865d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: at::_ops::cross_entropy_loss::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, double) + 0x1b2 (0x7fd4aaf6ad22 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: CrossEntropyLoss::operator()(at::Tensor, at::Tensor, bool) + 0xbf (0x7fd3d81d537f in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #18: Model::train_batch(std::shared_ptr<Batch>, bool) + 0x1af (0x7fd3d81dad8f in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #19: ComputeWorkerCPU::run() + 0x4ca (0x7fd3d81f826a in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #20: <unknown function> + 0xd6de4 (0x7fd4d272cde4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #21: <unknown function> + 0x8609 (0x7fd4f1d61609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #22: clone + 0x43 (0x7fd4f1e9b133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Is this error caused by the wrong preprocess command?

$ marius_preprocess --dataset ogbn_papers100m --output_dir datasets/marius/ogbn_papers100m/ --num_partitions 8192 --sequential_train_nodes

marius-team / marius

CUDA out of memory #136