marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

CUDA error: device-side assert triggered when trying to execute example scripts #80

Closed IronySuzumiya closed 2 years ago

IronySuzumiya commented 2 years ago

Describe the bug I successfully installed the program and it passed test/cpp/end_to_end, then when I tried to execute examples/training/scripts/fb15k_gpu.sh (and also some other configs with GPU enabled), it triggered a nll_loss_backward_reduce_cuda_kernel_2d assertion failure.

To Reproduce Steps to reproduce the behavior:

  1. I execute bash examples/training/scripts/fb15k_gpu.sh
  2. marius_preprocess step is able to be executed without any problems
  3. When marius_train proceeds to backward for the first batch of the first epoch, the following error occurs:
    nfp@node19:~/marius$ bash examples/training/scripts/fb15k_gpu.sh 
    fb15k
    Downloading fb15k.tgz to output_dir/fb15k.tgz
    Extracting
    Extraction completed
    Detected delimiter: ~   ~
    Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
    Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
    Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
    Number of instance per file:[483142, 50000, 59071]
    Number of nodes: 14951
    Number of edges: 592213
    Number of relations: 1345
    Delimiter: ~    ~
    ['/home/nfp/.local/bin/marius_train', 'examples/training/configs/fb15k_gpu.ini']
    [info] [10/28/21 22:12:59.865] Start preprocessing
    [debug] [10/28/21 22:12:59.866] Initializing Model
    [debug] [10/28/21 22:12:59.866] Empty Encoder
    [debug] [10/28/21 22:12:59.866] DistMult Decoder
    [debug] [10/28/21 22:12:59.867] data/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/embeddings/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/relations/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/train/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/evaluation/ directory already exists
    [debug] [10/28/21 22:12:59.867] data/marius/edges/test/ directory already exists
    [debug] [10/28/21 22:12:59.880] Edges: DeviceMemory storage initialized
    [debug] [10/28/21 22:12:59.894] Edges shuffled
    [debug] [10/28/21 22:12:59.894] Edge storage initialized. Train: 483142, Valid: 50000, Test: 59071
    [debug] [10/28/21 22:13:00.004] Node embeddings: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.004] Node embeddings state: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.004] Node embeddings initialized: 14951
    [debug] [10/28/21 22:13:00.014] Relation embeddings: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.014] Relation embeddings state: DeviceMemory storage initialized
    [debug] [10/28/21 22:13:00.014] Relation embeddings initialized: 1345
    [debug] [10/28/21 22:13:00.014] Getting batches from edge list
    [info] [10/28/21 22:13:00.014] Training set initialized
    [debug] [10/28/21 22:13:00.014] Getting batches from edge list
    [debug] [10/28/21 22:13:00.014] Batches initialized
    [info] [10/28/21 22:13:00.015] Evaluation set initialized
    [info] [10/28/21 22:13:00.015] Preprocessing Complete: 0.149s
    [debug] [10/28/21 22:13:00.032] Loaded training set
    [info] [10/28/21 22:13:00.032] ################ Starting training epoch 1 ################
    [trace] [10/28/21 22:13:00.032] Starting Batch. ID 0, Starting Index 0, Batch Size 10000 
    [trace] [10/28/21 22:13:00.034] Batch: 0 Accumulated 11109 unique embeddings
    [trace] [10/28/21 22:13:00.034] Batch: 0 Accumulated 640 unique relations
    [trace] [10/28/21 22:13:00.034] Batch: 0 Indices sent to device
    [trace] [10/28/21 22:13:00.034] Batch: 0 Node Embeddings read
    [trace] [10/28/21 22:13:00.034] Batch: 0 Node State read
    [trace] [10/28/21 22:13:00.034] Batch: 0 Relation Embeddings read
    [trace] [10/28/21 22:13:00.034] Batch: 0 Relation State read
    [trace] [10/28/21 22:13:00.035] Batch: 0 prepared for compute
    [debug] [10/28/21 22:13:00.040] Loss: 124804.266, Regularization loss: 0.012812799
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
    /pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
    Traceback (most recent call last):
    File "/home/nfp/.local/bin/marius_train", line 8, in <module>
    sys.exit(main())
    File "/home/nfp/.local/lib/python3.6/site-packages/marius/console_scripts/marius_train.py", line 8, in main
    m.marius_train(len(sys.argv), sys.argv)
    RuntimeError: CUDA error: device-side assert triggered
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Exception raised from launch_unrolled_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:132 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f95645bcd62 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
    frame #1: void at::native::gpu_kernel_impl<at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > const&) + 0xb37 (0x7f95665b2f27 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #2: void at::native::gpu_kernel<at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > const&) + 0x113 (0x7f95665bf333 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #3: void at::native::opmath_gpu_kernel_with_scalars<float, float, float, at::native::AddFunctor<float> >(at::TensorIteratorBase&, at::native::AddFunctor<float> const&) + 0xa9 (0x7f95665bf4c9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #4: <unknown function> + 0xe5d953 (0x7f9566592953 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #5: at::native::add_kernel_cuda(at::TensorIteratorBase&, c10::Scalar const&) + 0x15 (0x7f95665930a5 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #6: <unknown function> + 0xe5e0cf (0x7f95665930cf in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #7: at::native::structured_sub_out::impl(at::Tensor const&, at::Tensor const&, c10::Scalar const&, at::Tensor const&) + 0x40 (0x7f95a9f1ef00 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #8: <unknown function> + 0x25e52ab (0x7f9567d1a2ab in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #9: <unknown function> + 0x25e5372 (0x7f9567d1a372 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
    frame #10: at::_ops::sub_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0xb9 (0x7f95aa55d3f9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #11: <unknown function> + 0x34be046 (0x7f95ac03c046 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #12: <unknown function> + 0x34be655 (0x7f95ac03c655 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #13: at::_ops::sub_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x13f (0x7f95aa5b5b2f in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #14: <unknown function> + 0x3f299b0 (0x7f95acaa79b0 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #15: torch::autograd::generated::LogsumexpBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1dc (0x7f95abd1447c in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #16: <unknown function> + 0x3896817 (0x7f95ac414817 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x145b (0x7f95ac40fa7b in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #18: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x57a (0x7f95ac4107aa in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #19: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f95ac4081c9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #20: <unknown function> + 0xc71f (0x7f962b3ad71f in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
    frame #21: <unknown function> + 0x76db (0x7f962d01f6db in /lib/x86_64-linux-gnu/libpthread.so.0)
    frame #22: clone + 0x3f (0x7f962d35871f in /lib/x86_64-linux-gnu/libc.so.6)

Expected behavior The program works well for CPU configs:

nfp@node19:~/marius$ bash examples/training/scripts/fb15k_cpu.sh 
fb15k
Downloading fb15k.tgz to output_dir/fb15k.tgz
Extracting
Extraction completed
Detected delimiter: ~   ~
Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
Number of instance per file:[483142, 50000, 59071]
Number of nodes: 14951
Number of edges: 592213
Number of relations: 1345
Delimiter: ~    ~
['/home/nfp/.local/bin/marius_train', 'examples/training/configs/fb15k_cpu.ini']
[info] [10/28/21 22:19:07.259] Start preprocessing
[info] [10/28/21 22:19:08.397] Training set initialized
[info] [10/28/21 22:19:08.397] Evaluation set initialized
[info] [10/28/21 22:19:08.397] Preprocessing Complete: 1.137s
[info] [10/28/21 22:19:08.410] ################ Starting training epoch 1 ################
[info] [10/28/21 22:19:08.904] Total Edges Processed: 50000, Percent Complete: 0.099
[info] [10/28/21 22:19:09.252] Total Edges Processed: 95000, Percent Complete: 0.198
[info] [10/28/21 22:19:09.700] Total Edges Processed: 152000, Percent Complete: 0.298
[info] [10/28/21 22:19:09.998] Total Edges Processed: 190000, Percent Complete: 0.397
[info] [10/28/21 22:19:10.418] Total Edges Processed: 237000, Percent Complete: 0.496
[info] [10/28/21 22:19:10.809] Total Edges Processed: 286000, Percent Complete: 0.595
[info] [10/28/21 22:19:11.211] Total Edges Processed: 336000, Percent Complete: 0.694
[info] [10/28/21 22:19:11.567] Total Edges Processed: 383000, Percent Complete: 0.793
[info] [10/28/21 22:19:11.958] Total Edges Processed: 432000, Percent Complete: 0.893
[info] [10/28/21 22:19:12.320] Total Edges Processed: 478000, Percent Complete: 0.992
[info] [10/28/21 22:19:12.357] ################ Finished training epoch 1 ################
[info] [10/28/21 22:19:12.357] Epoch Runtime (Before shuffle/sync): 3946ms
[info] [10/28/21 22:19:12.357] Edges per Second (Before shuffle/sync): 122438.414
[info] [10/28/21 22:19:12.358] Pipeline flush complete
[info] [10/28/21 22:19:12.374] Edges Shuffled
[info] [10/28/21 22:19:12.374] Epoch Runtime (Including shuffle/sync): 3963ms
[info] [10/28/21 22:19:12.374] Edges per Second (Including shuffle/sync): 121913.195
[info] [10/28/21 22:19:12.389] Starting evaluating
[info] [10/28/21 22:19:12.709] Pipeline flush complete
[info] [10/28/21 22:19:15.909] Num Eval Edges: 50000
[info] [10/28/21 22:19:15.909] Num Eval Batches: 50
[info] [10/28/21 22:19:15.909] Auc: 0.941, Avg Ranks: 40.139, MRR: 0.336, Hits@1: 0.212, Hits@5: 0.476, Hits@10: 0.600, Hits@20: 0.707, Hits@50: 0.827, Hits@100: 0.895
[info] [10/28/21 22:19:15.920] Evaluation complete: 3531ms
[info] [10/28/21 22:19:15.931] ################ Starting training epoch 2 ################
[info] [10/28/21 22:19:16.361] Total Edges Processed: 46000, Percent Complete: 0.099
[info] [10/28/21 22:19:16.900] Total Edges Processed: 97000, Percent Complete: 0.198
[info] [10/28/21 22:19:17.424] Total Edges Processed: 156000, Percent Complete: 0.298
[info] [10/28/21 22:19:17.697] Total Edges Processed: 189000, Percent Complete: 0.397
[info] [10/28/21 22:19:18.078] Total Edges Processed: 238000, Percent Complete: 0.496
[info] [10/28/21 22:19:18.466] Total Edges Processed: 288000, Percent Complete: 0.595
[info] [10/28/21 22:19:18.825] Total Edges Processed: 336000, Percent Complete: 0.694
[info] [10/28/21 22:19:19.160] Total Edges Processed: 381000, Percent Complete: 0.793
[info] [10/28/21 22:19:19.584] Total Edges Processed: 436000, Percent Complete: 0.893
[info] [10/28/21 22:19:19.909] Total Edges Processed: 481000, Percent Complete: 0.992
[info] [10/28/21 22:19:19.928] ################ Finished training epoch 2 ################
[info] [10/28/21 22:19:19.928] Epoch Runtime (Before shuffle/sync): 3997ms
[info] [10/28/21 22:19:19.928] Edges per Second (Before shuffle/sync): 120876.16
[info] [10/28/21 22:19:19.929] Pipeline flush complete
[info] [10/28/21 22:19:19.947] Edges Shuffled
[info] [10/28/21 22:19:19.948] Epoch Runtime (Including shuffle/sync): 4016ms
[info] [10/28/21 22:19:19.948] Edges per Second (Including shuffle/sync): 120304.29
[info] [10/28/21 22:19:19.961] Starting evaluating
[info] [10/28/21 22:19:20.246] Pipeline flush complete
[info] [10/28/21 22:19:20.255] Num Eval Edges: 50000
[info] [10/28/21 22:19:20.255] Num Eval Batches: 50
[info] [10/28/21 22:19:20.255] Auc: 0.972, Avg Ranks: 21.458, MRR: 0.431, Hits@1: 0.294, Hits@5: 0.595, Hits@10: 0.719, Hits@20: 0.812, Hits@50: 0.906, Hits@100: 0.949
[info] [10/28/21 22:19:20.271] Evaluation complete: 309ms
[info] [10/28/21 22:19:20.282] ################ Starting training epoch 3 ################
[info] [10/28/21 22:19:20.694] Total Edges Processed: 47000, Percent Complete: 0.099
[info] [10/28/21 22:19:21.042] Total Edges Processed: 95000, Percent Complete: 0.198
[info] [10/28/21 22:19:21.425] Total Edges Processed: 143000, Percent Complete: 0.298
[info] [10/28/21 22:19:21.872] Total Edges Processed: 203000, Percent Complete: 0.397
^C[info] [10/28/21 22:19:22.195] Total Edges Processed: 244000, Percent Complete: 0.496
[info] [10/28/21 22:19:22.561] Total Edges Processed: 288000, Percent Complete: 0.595
[info] [10/28/21 22:19:22.971] Total Edges Processed: 342000, Percent Complete: 0.694
[info] [10/28/21 22:19:23.266] Total Edges Processed: 380000, Percent Complete: 0.793
[info] [10/28/21 22:19:23.747] Total Edges Processed: 438000, Percent Complete: 0.893
[info] [10/28/21 22:19:24.101] Total Edges Processed: 479142, Percent Complete: 0.992
...

Environment I tried on 2 machines and got the same error. Platform: linux (Ubuntu 18.04 LTS) Python version: 3.6.9 Pytorch version: 1.10.0+cu102; 1.10.0+cu113

JasonMoho commented 2 years ago

Thanks for reporting this, I'll see if I can reproduce this on my end.

Also I noticed that the debug logs are being printed out for the gpu example. Did you modify the example configuration file to enable debug logging? Just want to make sure because those shouldn't be printing out for that example.

JasonMoho commented 2 years ago

Okay so I was able to reproduce this exact error when using pytorch 1.10.

I downgraded to pytorch 1.9 and was able to successfully run the example. Could you try that and see if it works?

IronySuzumiya commented 2 years ago

I only modified log_level to trace and other configs are left unchanged. OK, I'll try it later. Thanks!

JasonMoho commented 2 years ago

Sounds good. Looks like this issue is related to a known pytorch bug: https://github.com/pytorch/pytorch/issues/66872

I'll update the system requirements to say that 1.10 is not currently supported and leave this issue open until there's a fix/workaround.

IronySuzumiya commented 2 years ago

It works using pytorch 1.8.2 LTS. Thanks for help!