thuml / depyf

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
https://depyf.readthedocs.io
MIT License
510 stars 14 forks source link

[Bug]: using nn module stack for file name lead to too long files in some cases #72

Closed gilfree closed 1 day ago

gilfree commented 1 week ago

Your current environment

Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS Clang version: Could not collect

Python version: 3.10.6 Python platform: Linux Is CUDA available: True CUDA runtime version: 12.4.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB GPU 1: NVIDIA A100-PCIE-40GB

Nvidia driver version: 550.120 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit ....

Versions of relevant libraries: [pip3] mypy==1.13.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==2.1.3 [pip3] pytorch-lightning==2.4.0 [pip3] pytorch-triton==3.1.0+cf34004b8a [pip3] torch==2.5.1 [pip3] torchaudio==2.5.0.dev20241105+cu121 [pip3] torchmetrics==1.5.1 [pip3] torchvision==0.20.1 [pip3] triton==3.1.0 [conda] No relevant packages

🐛 Describe the bug

Seems you are patching the lazy_format_graph_code method and use the name passed to it to compose a file name. as the name is based on the nn_module_stack of models, in some cases it can lead to very long file names, which causes OSError exception. This function generates the name: first_call_function_nn_module_stack

This is the problematic line:

https://github.com/thuml/depyf/blob/ee7d231482ff877aa33b02ca2ae7390365572072/depyf/explain/patched_lazy_format_graph_code.py#L39

Probably truncating the filename to 255 will do, or using os.pathconf(filepath, 'PC_NAME_MAX') to set the limit

youkaichao commented 1 week ago

can you give an example with code?

gilfree commented 1 week ago

Ok, its a bit more involved - the method you are patching is also used by torch.export, and there the name is nn_module_stack based.

I am exporting and compiling the model, and the export was also under depyf prepare_debug context, something like below, which is probably not something you have intended. So, if you prefer to close this as not supported - I'm fine with it, but it would be nice if it would work, as it will allow also export debugging.

import torch
import depyf
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.encoder = torch.nn.TransformerEncoder(
            torch.nn.TransformerEncoderLayer(d_model=8, nhead=2, batch_first=True),
            num_layers=6,
        )
    def forward(self, x):
        return self.encoder(x)
class WrappedModel(torch.nn.Module):
    def __init__(self):
        super(WrappedModel, self).__init__()
        self.model = MyModel()
    def forward(self, x):
        return self.model(x)
class WrappedModel2(torch.nn.Module):
    def __init__(self):
        super(WrappedModel2, self).__init__()
        self.model = WrappedModel()
    def forward(self, x):
        return self.model(x)
model = WrappedModel2()
x = torch.randn(1, 10, 8)
with depyf.prepare_debug('depyf'):
    model2 = torch.compile(model,fullgraph=True)
    model2(x)
    exported = torch.export.export(model,(x,))
    model=exported.module()
youkaichao commented 1 week ago

can you please open a pr to address it?

If the name is too long, I don't know how to truncate it properly. e.g. do you want to keep the suffix of the file? which part to truncate?