Convolutional module which inherits from MessagePassing throws OSError when reloading from read-only disk

brainsqueeze commented 8 months ago

🐛 Describe the bug

I have a custom convolutional layer which inherits from torch_geometric.nn.MessagePassing. This is part of a GNN that was trained and is being reloaded in a AWS Lambda serverless environment. I am not able to post a full code example because it is a work project, but the layer initialized like

from typing import List, Optional, Union, Tuple
from torch_geometric.nn import MessagePassing
from torch_geometric.nn.aggr import Aggregation

class MyConv(MessagePassing):
   propagate_type = {'x': Tuple[torch.Tensor, torch.Tensor]}

   def __init__(
        self,
        in_channels: Tuple[int, int],
        out_channels: int,
        dropout_rate: float = 0.0,
        aggregation: Optional[Union[str, List[str], Aggregation]] = "add",
        skip_connections: bool = False,
        normalize: bool = False
    ):
        super().__init__(aggr=aggregation)
        ...

Reloading this locally works just fine. However reloading in an environment with a read-only storage throws and OSError. Here is the stacktrace from AWS Cloudwatch:

LAMBDA_WARNING: Unhandled exception. The most likely cause is an issue in the function code. However, in rare cases, a Lambda runtime update can cause unexpected function behavior. For functions using managed runtimes, runtime updates can be triggered by a function change, or can be applied automatically. To determine if the runtime has been updated, check the runtime version in the INIT_START log entry. If this error correlates with a change in the runtime version, you may be able to mitigate this error by temporarily rolling back to the previous runtime version. For more information, see https://docs.aws.amazon.com/lambda/latest/dg/runtimes-update.html
[ERROR] OSError: [Errno 30] Read-only file system: '/home/sbx_user1051'
Traceback (most recent call last):
  ...
  File "/var/task/candidgraph/nn/conv.py", line 41, in __init__
    super().__init__(aggr=aggregation)
  File "/var/lang/lib/python3.11/site-packages/torch_geometric/nn/conv/message_passing.py", line 170, in __init__
    module = module_from_template(
  File "/var/lang/lib/python3.11/site-packages/torch_geometric/template.py", line 27, in module_from_template
    os.makedirs(instance_dir, exist_ok=True)
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 225, in makedirs

Versions

Collecting environment information... PyTorch version: 2.2.0+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Amazon Linux 2 (x86_64) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.26 Python version: 3.11.6 (main, Feb 28 2024, 17:45:57) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] (64-bit runtime) Python platform: Linux-5.10.209-218.812.amzn2.x86_64-x86_64-with-glibc2.26 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: lscpu: failed to determine number of CPUs: /sys/devices/system/cpu/possible: No such file or directory Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.2.0+cpu [pip3] torch_geometric==2.5.0 [pip3] torch_scatter==2.1.2+pt22cpu [pip3] torch_sparse==0.6.18+pt22cpu [conda] Could not collect

rusty1s commented 8 months ago

Thanks for reporting. Let me try to fix this.

rusty1s commented 8 months ago

Will be fixed via https://github.com/pyg-team/pytorch_geometric/pull/9032, and part of torch-geometric==2.5.1 (soon).

pyg-team / pytorch_geometric

Convolutional module which inherits from MessagePassing throws OSError when reloading from read-only disk #9025

🐛 Describe the bug

Versions