Open kkjh0723 opened 11 months ago
@kkjh0723 I think it might break with gradient checkpointing? not sure there is a workaround, possibly maybe using non reentrant mode?
I got the same error trying to run both --grad-checkpointing
and --torchcompile
, but since pytorch 2.1.0 --torchcompile
now works with --accum-freq
> 1 as the next best option.
@EIFY did you try forcing the non reentrant checkpointing? could look to change the default if that works...
@rwightman No I haven't tried that.
In that regard, the good news is that https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/open_clip/transformer.py#L320-L322 https://github.com/pytorch/pytorch/issues/79887 is now fixed and we should be able to do e.g.
if self.grad_checkpointing and not torch.jit.is_scripting():
x = checkpoint(r, x, None, None, attn_mask, use_reentrant=False)
The bad news is that other than that grad_checkpointing
is either delegated to the vision/text trunks w/o argument support
https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/open_clip/model.py#L260-L263
or not supported at all:
https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/open_clip/modified_resnet.py#L161-L164
So fairly involved changes would be necessary. I will try doing the easy part and see if it at least gets past that when I get a chance.
@rwightman OK so it turned out that use_reentrant=False
doesn't help. It still breaks at the same point:
[2023-11-08 12:56:29,383] [0/0] torch._utils_internal: [INFO] CompilationMetrics(frame_key='1', co_name='forward', co_filename='/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/model.py', co_firstlineno=256, cache_size=0, guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, fail_reason="'NNModuleVariable' object has no attribute 'get_name'")
Traceback (most recent call last):
(...)
torch._dynamo.exc.InternalTorchDynamoError: 'NNModuleVariable' object has no attribute 'get_name'
from user code:
File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/model.py", line 274, in forward
image_features = dim_scale_img * self.encode_image(image, normalize=self.normalize) if image is not None else None
File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/model.py", line 239, in encode_image
features = self.visual(image)
File "/home/jason-chou/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/transformer.py", line 486, in forward
x = self.transformer(x)
File "/home/jason-chou/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jason-chou/.local/lib/python3.10/site-packages/open_clip/transformer.py", line 319, in forward
x = checkpoint(r, x, None, None, attn_mask, use_reentrant=False)
Is there any update on this? I am facing the same issue.
Hello,
While I attempt to apply torchcompile option for training CLIP ViT-B-32 model, I got some error. Below is the script to run training.
And I got the below error message. How can I fix this issue? Note that my pytorch version is 2.1.0 and no error occurs when I runs above script without
--torchcompile
option.