pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.54k stars 3.57k forks source link

IndexError when initializing torch_geometric.nn.Sequential during multiprocessing #9371

Open nehSgnaiL opened 1 month ago

nehSgnaiL commented 1 month ago

🐛 Describe the bug

Problem description:

Hello,

I recently encountered an IndexError when attempting to initialize torch_geometric.nn.Sequential within a multiprocessing environment. My suspicion is that due to the shared nature of multiprocessing, the ID of the Sequential module might be the same across multiple processes, leading to conflicts and incorrect indexing.

I would greatly appreciate any suggestions on how to address this issue. :)

Code to reproduce:


def construct_model(args):
    import torch_geometric.nn as gnn
    gcn_type = 'chebconv'
    input_args = 'x, edge_index, edge_weight'
    channel_list = [[32, 16, 1], [32, 1]]
    for t in range(100):
        gcn_layers = []
        for channel in channel_list:
            for i in range(len(channel) - 1):
                _gcn = gnn.ChebConv(in_channels=channel[i], out_channels=channel[i + 1], K=3)
                gcn_layers.append(
                    (_gcn,
                     'x, edge_index, edge_weight -> x')
                )
            built_layers = gnn.Sequential(input_args=input_args, modules=gcn_layers)
            print(channel, gcn_layers, built_layers)

if __name__ == '__main__':
    import torch.multiprocessing as mp
    from tqdm import tqdm

    num_partitions = 100
    num_processes = 2
    ctx = mp.get_context("spawn")
    with ctx.Pool(num_processes) as pool:
        with tqdm(total=num_partitions) as pbar:
            for i, res in enumerate(pool.imap_unordered(construct_model, [1 for i in range(num_partitions)])):  # set up the pool
                pbar.update()
            # pool.apply_async(run_model, args=(args,))
        pbar.close()
        pool.close()  # close the pool
        pool.join()  # join the pool

Error message:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/miniforge3/envs/deeplearning/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/test.py", line 19, in construct_model
    print(channel, gcn_layers, built_layers)
  File "/tmp/torch_geometric.nn.sequential_8f1555_pbw5o1ex.py", line 645, in __repr__
    module_reprs = [
  File "/tmp/torch_geometric.nn.sequential_8f1555_pbw5o1ex.py", line 646, in <listcomp>
    f'  ({i}) - {self[i]}: {self._module_descs[i]}'
  File "/tmp/torch_geometric.nn.sequential_8f1555_pbw5o1ex.py", line 639, in __getitem__
    return getattr(self, self._module_names[idx])
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/test.py", line 31, in <module>
    for i, res in enumerate(pool.imap_unordered(construct_model, [1 for i in range(num_partitions)])):  # set up the pool
  File "/opt/miniforge3/envs/deeplearning/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
IndexError: list index out of range

Versions

PyTorch version: 2.2.1
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Anaconda gcc) 11.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.9.19 (main, Mar 21 2024, 17:11:28)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-1010-nvidia-lowlatency-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A40
Nvidia driver version: 550.54.15
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.1
[pip3] torch_geometric==2.5.2
[pip3] triton==2.2.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2023.1.0         h213fc3f_46344
[conda] mkl-service               2.4.0            py39h5eee18b_1
[conda] mkl_fft                   1.3.8            py39h5eee18b_0
[conda] mkl_random                1.2.4            py39hdb19cb5_0
[conda] numpy                     1.26.4           py39h5f9d8c6_0
[conda] numpy-base                1.26.4           py39hb5e798b_0
[conda] pyg                       2.5.2           py39_torch_2.2.0_cu121    pyg
[conda] pytorch                   2.2.1           py3.9_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchtriton               2.2.0                      py39    pytorch
devanshamin commented 1 month ago

Potential cause

I suspect the issue is due to the loading of wrong Sequential module likely due to the random uid being used to generate the /tmp/torchgeometric.nn.sequential{uid}.py file.

Explanation

In your construct_model function, you seem to create Sequential module with 3 layers and 2 layers. If the current Sequential module is,

Sequential(
  (0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x
  (1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
)

and the /tmp/torch_geometric.nn.sequential_{uid}.py file loaded from the module_from_template contains Sequential module,

Sequential(
  (0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x
  (1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
  (2) - ChebConv(32, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
)

then you might get an index error when executing the print statement since the loaded Sequential module has len(self) of 3 but the _module_names contains only 2 values.

nehSgnaiL commented 1 month ago

Potential cause

I suspect the issue is due to the loading of wrong Sequential module likely due to the random uid being used to generate the /tmp/torchgeometric.nn.sequential{uid}.py file.

Explanation

In your construct_model function, you seem to create Sequential module with 3 layers and 2 layers. If the current Sequential module is,

Sequential(
  (0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x
  (1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
)

and the /tmp/torch_geometric.nn.sequential_{uid}.py file loaded from the module_from_template contains Sequential module,

Sequential(
  (0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x
  (1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
  (2) - ChebConv(32, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
)

then you might get an index error when executing the print statement since the loaded Sequential module has len(self) of 3 but the _module_names contains only 2 values.

Thanks for your hint.

In my practice, the torch_geometric.nn.Sequential will be confused and might raise an error if many objects are created, no matter whether in a multiprocess environment, because I set the same random seed at the beginning of my function.

Right now, my temporal solution is using torch.nn.ModuleDict instead of torch_geometric.nn.Sequential and making some adjustments to my code.

devanshamin commented 1 month ago

Your original code should work fine if you install PyG from source (pip install git+https://github.com/pyg-team/pytorch_geometric.git). Recent version of PyG 2.6.0 contains Sequential module that has been refactored,

Take a look at #9369.