mlfoundations / open_clip

An open source implementation of CLIP.
Other
10.11k stars 968 forks source link

Error loading ViT-L-14-quickgelu (metaclip_fullcc) model with version v2.27.0+ #966

Open aivarasbaranauskas opened 2 days ago

aivarasbaranauskas commented 2 days ago

Hello. I am getting an exception pickle.UnpicklingError when loading ViT-L-14-quickgelu model (metaclip_fullcc pretrained) with open_clip version 2.27.0+.

Code that's throwing the exception:

self._model, _, _ = open_clip.create_model_and_transforms(
    'ViT-L-14-quickgelu',
    pretrained='metaclip_fullcc',
    cache_dir='/tmp/open_clip',
)

The exception:

[...]
    self._model, _, _ = open_clip.create_model_and_transforms(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.12/site-packages/open_clip/factory.py", line 414, in create_model_and_transforms
    model = create_model(
            ^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.12/site-packages/open_clip/factory.py", line 320, in create_model
    load_checkpoint(model, checkpoint_path)
  File "/app/.venv/lib/python3.12/site-packages/open_clip/factory.py", line 169, in load_checkpoint
    state_dict = load_state_dict(checkpoint_path, device=device, weights_only=weights_only)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.12/site-packages/open_clip/factory.py", line 139, in load_state_dict
    checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=weights_only)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.12/site-packages/torch/serialization.py", line 1359, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray.scalar was not an allowed global by default. Please use `torch.serialization.add_safe_globals([scalar])` to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

It seems the exception should be caught here: https://github.com/mlfoundations/open_clip/blob/185071e086e5950ea6ca6c25fe393d5d906aeefa/src/open_clip/factory.py#L138-L141 But the pickle.UnpicklingError does not inherit/extend TypeError.

rwightman commented 2 days ago

@aivarasbaranauskas Hmm, I wasn't expecting any of the default pretrained checkpoints to require pickle, I enabled that flag by default for safety and that TypeError is supposed to catch old versions of torch that don't have the weights only arg, not checkpoints that fail.

Thinking about this, whether to allow the pickle checkpoint or re-write them and replace with a different one on the HF hub... HMM

bryant1410 commented 2 days ago

I ran into this. This was my workaround:

torch.serialization.add_safe_globals([np.dtype, np.dtypes.Float64DType, numpy.core.multiarray.scalar,  # noqa
                                      _codecs.encode])
rwightman commented 2 days ago

@bryant1410 thanks, obviously not great to need to do that, have you encountered any other checkpoints that break? I might just disable the weights_only=True default and stick with False for now...

bryant1410 commented 2 days ago

I have only tested the following groups of checkpoints: OpenAI's, Apple's DFN5B's, and MetaCLIP's. MetaCLIP checkpoints are the only ones that needed something like this.

I haven't looked into why MetaCLIP needs these. Maybe they can be easily removed (e.g., maybe they depend on NumPy for some arrays while they could just use PyTorch tensors instead). Still, I think it's somewhat reasonable to have these safe globals added as they seem common and safe in principle. Though I wonder if _codecs.encode could somehow exploit something? Though this seems safer than weights_only=False.

We can keep a list of reasonable globals added, such as the ones here, and when issues pop up we can evaluate adding more. What do you think?

rwightman commented 2 days ago

@bryant1410 looks like _codecs.encode is on pytorch main now, so sure, I'll add those globals then

https://github.com/pytorch/pytorch/blob/a8b912f39d36bd2e6d204808d866439d0075f1a5/torch/_weights_only_unpickler.py#L161-L172

bryant1410 commented 2 days ago

Sounds good. Thanks.

BTW, not sure if the globals I added are the most optimal. E.g., maybe there's a single numpy import that could be added instead. Didn't try much. I just added them based on the error messages I was getting.

rwightman commented 1 day ago

@bryant1410 @aivarasbaranauskas I merged a fix for this, and added the bigG that was missing. Can someone confirm before I make another release?

bryant1410 commented 1 day ago

Is this enough of a test: https://colab.research.google.com/drive/1oHIkYiEGQIt8PNQa4u4b_IzbNc5n_b9O?usp=sharing (it worked there)?

I'm mostly using a private forked version of this library, which is not kept up-to-date with upstream, so not sure how to test it otherwise (please lmk).

rwightman commented 1 day ago

@bryant1410 thanks, I did some basic validation tests of my own too, just wanted another confirm

aivarasbaranauskas commented 1 day ago

Thanks for quick fix. It works now when using numpy v1.*, but still fails with numpy v2.*. One of the types that should be whitelisted was moved to different namespace in numpy v2:

DeprecationWarning: numpy.core is deprecated and has been renamed to numpy._core. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.multiarray.

Have tried to add numpy._core.multiarray.scalar to torch serialization safe_globals, but that did not work either.

rwightman commented 1 day ago

oh dammit numpy, yeah the numpy 2.0 breaks that idea :( I'm not sure it's possible to work around as pickle needs the matching namespace afaik.

So, I guess I do need to re-write some checkpoints and point to new locations if I want to keep the load safe.