quiver-team / torch-quiver

PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.
https://torch-quiver.readthedocs.io/en/latest/
Apache License 2.0
293 stars 36 forks source link

Test Serving Fail device = rank%len(device_list) ZeroDivisionError: integer division or modulo by zero #161

Closed LukeLIN-web closed 1 year ago

LukeLIN-web commented 1 year ago

I have 3 V100 GPUs.

$ cd examples/serving/reddit
$ python prepare_data.py
$ python reddit_serving.py
root@df7d0cf2237d:~/share/torch-quiver/examples/serving/reddit# python prepare_data.py 
Downloading https://data.dgl.ai/dataset/reddit.zip
Extracting /root/share/torch-quiver/examples/serving/reddit/data/raw/reddit.zip
Processing...
Done!
neighbour_num file not exists
intermediate_path created
Traceback (most recent call last):
  File "/root/share/torch-quiver/examples/serving/reddit/prepare_data.py", line 41, in <module>
    quiver.generate_neighbour_num(node_num, edge_index, sizes, intermediate_path, parallel=True, mode='GPU', device_list=device_list, num_proc=4, reverse=reverse)
  File "/root/share/torch-quiver/srcs/python/quiver/generate_neighbour_num.py", line 27, in generate_neighbour_num
    parallel_generate_neighbour_num_GPU(node_num, edge_index, sizes, resultl_path, device_list, num_proc, reverse, sample)
  File "/root/share/torch-quiver/srcs/python/quiver/generate_neighbour_num.py", line 65, in parallel_generate_neighbour_num_GPU
    mp.spawn(
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/root/share/torch-quiver/srcs/python/quiver/generate_neighbour_num.py", line 38, in single_generate_neighbour_num
    device = rank%len(device_list)
ZeroDivisionError: integer division or modulo by zero

I run seperately

        device_list = [i for i in range(torch.cuda.device_count())]
        print(device_list)

it shows [0,1,2] But in the program device list is empty, []

This seems because my quiver cannot detect GPU.

  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
LukeLIN-web commented 1 year ago

When I try to run python reddit_serving.py It shows

Process Process-30:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/share/torch-quiver/srcs/python/quiver/serving.py", line 120, in cpu_sampler_worker_loop
    os.sched_setaffinity(0, [2*(cpu_offset+rank)])
OSError: [Errno 22] Invalid argument
Process Process-35:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/share/torch-quiver/examples/serving/reddit/reddit_serving.py", line 53, in print_result
    os.sched_setaffinity(0, [2*(cpu_offset+rank)])
OSError: [Errno 22] Invalid argument
LukeLIN-web commented 1 year ago

This is because os.environ['CUDA_VISIBLE_DEVICES'] = '6, 7'