urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.4k stars 122 forks source link

Unable to load model in offline mode #123

Open apeuvotepf opened 4 months ago

apeuvotepf commented 4 months ago

Hello

I am unable to load a model in offline mode (i.e., from a local directory). Surprisingly, this works for the model urchade/gliner_multi but not for the model urchade/gliner_multi-v2.1. Other models have not been tested.

Error

The following error occurs:

Traceback (most recent call last):
  File "/home/users/apeuvot/GliNER/evaluate.py", line 105, in <module>
    model = load_model(options.model_path)
  File "/home/users/apeuvot/GliNER/evaluate.py", line 13, in load_model
    model = GLiNER.from_pretrained(path, local_files_only=True)
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/huggingface_hub/hub_mixin.py", line 420, in from_pretrained
    instance = cls._from_pretrained(
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/gliner/model.py", line 409, in _from_pretrained
    gliner = cls(config, tokenizer=tokenizer, encoder_from_pretrained=False, 
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/gliner/model.py", line 38, in __init__
    tokenizer = AutoTokenizer.from_pretrained(config.model_name, 
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 794, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1138, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/home/users/apeuvot/miniconda3/envs/gliner/lib/python3.9/site-packages/transformers/utils/hub.py", line 441, in cached_file
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like microsoft/mdeberta-v3-base is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Temporary Fix

Below is the modified code in gliner/model.py to allow offline mode functionality. The local_files_only argument was not propagated to all relevant functions for loading the model. While this solution might not be ideal as the process is different depending on the model (modification of config.model_name for urchade/gliner_multi-v2.1 but not for urchade/gliner_multi), it temporarily fixes the issue.

Modified Code

The following code shows the modifications:

class GLiNER(nn.Module, PyTorchModelHubMixin):
    def __init__(self, config: GLiNERConfig, 
                        model: Optional[Union[BaseModel, BaseORTModel]] = None,
                        tokenizer: Optional[Union[str, AutoTokenizer]] = None, 
                        words_splitter: Optional[Union[str, WordsSplitter]] = None, 
                        data_processor: Optional[Union[SpanProcessor, TokenProcessor]] = None, 
                        encoder_from_pretrained: bool = True,
+                        local_files_only: bool = False
                        ):
        super().__init__()
        self.config = config

        if tokenizer is None and data_processor is None:
-            tokenizer = AutoTokenizer.from_pretrained(config.model_name)
+            tokenizer = AutoTokenizer.from_pretrained(config.model_name, local_files_only=local_files_only)

        # Existing code...

    @classmethod
    def _from_pretrained(
            cls,
            *,
            model_id: str,
            revision: Optional[str],
            cache_dir: Optional[Union[str, Path]],
            force_download: bool,
            proxies: Optional[Dict],
            resume_download: bool,
            local_files_only: bool,
            token: Union[str, bool, None],
            map_location: str = "cpu",
            strict: bool = False,
            load_tokenizer: Optional[bool]=False,
            resize_token_embeddings: Optional[bool]=True,
            load_onnx_model: Optional[bool]=False,
            onnx_model_file: Optional[str] = 'model.onnx',
            compile_torch_model: Optional[bool] = False,
            **model_kwargs,
    ):
        # Existing code...

        if load_tokenizer:
-            tokenizer = AutoTokenizer.from_pretrained(model_dir)
+            tokenizer = AutoTokenizer.from_pretrained(model_dir, local_files_only=local_files_only)
        else:
            tokenizer = None
        config_ = json.load(open(config_file))
        config = GLiNERConfig(**config_)

+        if local_files_only and config.model_name in ["microsoft/mdeberta-v3-base", "microsoft/deberta-v3-large"]: # for urchade/gliner_multi, it is already the local path
+            config.model_name = os.path.dirname(os.path.dirname(model_dir)) + "/" + config.model_name

        add_tokens = ['[FLERT]', config.ent_token, config.sep_token]

        if not load_onnx_model:
-             gliner = cls(config, tokenizer=tokenizer, encoder_from_pretrained=False, 
+            gliner = cls(config, tokenizer=tokenizer, encoder_from_pretrained=False,                          local_files_only=local_files_only)

        # Existing code...

        return gliner

Please consider addressing this issue in the next release to ensure better support for offline model loading.

Ingvarstep commented 4 months ago

Hi @apeuvotepf , we will definitely consider it in the next releases. Thank you for pointing out the importance of offline mode and your proposed temporary fix. To make it more general, we need to consider more aspects of the current realization, but it should be realized in the next release.

moritzwilksch commented 4 months ago

xref #108