Open qisheng-jiang opened 1 year ago
The OOM is likely caused by the sampling configuration using all neighbors:
train_neighbor_sampling:
- type: ALL
- type: ALL
- type: ALL
In the current implementation, using all neighbors is not scalable on large graphs for mini-batch training due to the exponential explosion of the number of neighbors.
We used uniform sampling of 15-10-5 neighbors for training in the eurosys_2023_artifact
branch config.
For the main branch the corresponding config for neighbor sampling is:
train_neighbor_sampling:
- type: UNIFORM
options:
max_neighbors: 15
- type: UNIFORM
options:
max_neighbors: 10
- type: UNIFORM
options:
max_neighbors: 5
Updating this should solve the OOM.
Thanks so much for replying. It works for OOM. However, there are several new errors.
When I set device_type: cuda
, the error is:
root@c54d23ae2acd:/working_dir# marius_train examples/configuration/ogbn_paper100m_disk.yaml
[2023-03-13 09:11:20.123] [info] [marius.cpp:41] Start initialization
[03/13/23 09:11:23.806] Initialization Complete: 3.682s
[03/13/23 09:11:23.807] Generating Sequential Ordering
[03/13/23 09:11:23.808] Num Train Partitions: 90
[03/13/23 09:12:51.593] ################ Starting training epoch 1 ################
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from launch_vectorized_kernel at ../aten/src/ATen/native/cuda/CUDALoops.cuh:98 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f7c974c41ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) + 0xb88 (0x7f7c478c0218 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #2: void at::native::gpu_kernel<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) + 0x31b (0x7f7c478c0deb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #3: <unknown function> + 0x18f68e2 (0x7f7c478a88e2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #4: at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) + 0x20 (0x7f7c478a9b30 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #5: <unknown function> + 0x1a3078d (0x7f7c6f40d78d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2d4d91b (0x7f7c48cff91b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #7: at::_ops::fill__Scalar::call(at::Tensor&, c10::Scalar const&) + 0x12b (0x7f7c6f9fa77b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::zero_(at::Tensor&) + 0x83 (0x7f7c6f40dcc3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2d4b955 (0x7f7c48cfd955 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #10: at::_ops::zero_::call(at::Tensor&) + 0x9e (0x7f7c6fd5910e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::structured_nll_loss_backward_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::OptionalTensorRef, long, long, at::Tensor const&, at::Tensor const&) + 0x3d (0x7f7c47df5b0d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #12: <unknown function> + 0x2d49a6b (0x7f7c48cfba6b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #13: <unknown function> + 0x2d49b35 (0x7f7c48cfbb35 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #14: at::_ops::nll_loss_backward::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&) + 0x94 (0x7f7c6fd338e4 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x377a776 (0x7f7c71157776 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x377ae0b (0x7f7c71157e0b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::nll_loss_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&) + 0x1cd (0x7f7c6fd9e5fd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::generated::NllLossBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x23d (0x7f7c70e8d42d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x3db919b (0x7f7c7179619b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1640 (0x7f7c7178f710 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x698 (0x7f7c71790148 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x8b (0x7f7c7178790b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4f (0x7f7c9532726f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #24: <unknown function> + 0xd6de4 (0x7f7c9771bde4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #25: <unknown function> + 0x8609 (0x7f7cb6d50609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #26: clone + 0x43 (0x7f7cb6e8a133 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
When I set device_type: cpu
, the error is:
root@c54d23ae2acd:/working_dir# marius_train examples/configuration/ogbn_paper100m_disk.yaml
[2023-03-13 09:13:27.636] [info] [marius.cpp:41] Start initialization
[03/13/23 09:13:30.313] Initialization Complete: 2.676s
[03/13/23 09:13:30.314] Generating Sequential Ordering
[03/13/23 09:13:30.314] Num Train Partitions: 90
[03/13/23 09:14:58.836] ################ Starting training epoch 1 ################
terminate called after throwing an instance of 'c10::IndexError'
what(): Target 132 is out of bounds.
Exception raised from nll_loss_out_frame at ../aten/src/ATen/native/LossNLL.cpp:226 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fd4810371ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10db8ae (0x7fd4a9c8f8ae in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x25437ee (0x7fd4ab0f77ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x25438cd (0x7fd4ab0f78cd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: at::_ops::nll_loss_forward::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x86 (0x7fd4aadef776 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x36c9202 (0x7fd4ac27d202 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x36c9813 (0x7fd4ac27d813 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: at::_ops::nll_loss_forward::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x1a1 (0x7fd4aae6d5a1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::nll_loss(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x47 (0x7fd4aa682a77 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2744cfd (0x7fd4ab2f8cfd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: at::_ops::nll_loss::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x1a2 (0x7fd4aaf74c52 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::nll_loss_nd(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x2eb (0x7fd4aa68cccb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2744d4d (0x7fd4ab2f8d4d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::nll_loss_nd::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long) + 0x1b8 (0x7fd4aad40e28 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: at::native::cross_entropy_loss(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, double) + 0x185 (0x7fd4aa68c6a5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x274465d (0x7fd4ab2f865d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: at::_ops::cross_entropy_loss::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, double) + 0x1b2 (0x7fd4aaf6ad22 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: CrossEntropyLoss::operator()(at::Tensor, at::Tensor, bool) + 0xbf (0x7fd3d81d537f in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #18: Model::train_batch(std::shared_ptr<Batch>, bool) + 0x1af (0x7fd3d81dad8f in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #19: ComputeWorkerCPU::run() + 0x4ca (0x7fd3d81f826a in /usr/local/lib/python3.8/dist-packages/marius/libmarius.so)
frame #20: <unknown function> + 0xd6de4 (0x7fd4d272cde4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #21: <unknown function> + 0x8609 (0x7fd4f1d61609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #22: clone + 0x43 (0x7fd4f1e9b133 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Is this error caused by the wrong preprocess command?
$ marius_preprocess --dataset ogbn_papers100m --output_dir datasets/marius/ogbn_papers100m/ --num_partitions 8192 --sequential_train_nodes
Hi, I'm trying to run dataset
ogbn_papers100m
using marius with PARTITION_BUFFER in main branch. But I cannot find an example about it. So I followed the example in eurosys_2023_artifact branch and rewrote a yaml file.I have tested the example for
fb15k_237
and it works well. However, it didn't work for large datasets, such asogbn_papers100m
. I used the following commands forogbn_papers100m
dataset.Then, an error
CUDAOutOfMemoryError
happened.It seems marius still tries to allocate memory in GPU and does not use the partition buffer? Could you please tell me whether the configuration is correct in yaml? BTW, is there any examples for python API to use partition buffer directly rather than yaml?
m.storage.tensor_from_file
seems only support the device memory?Thanks for replying.
My environment is the following.