urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.48k stars 127 forks source link

Network connection required? #108

Closed davidress-ILW closed 5 months ago

davidress-ILW commented 6 months ago

I am running 0.2.2 version of GLiNER at a Conda prompt, and it appears that a network connection is required to load the model, even if it has been loaded before. The following code is from a simple python script that loads the pretrained medium model. It executes fine if there is a network connection, but fails in the urllib3\connection.py on line 203 with a socket.gaierror.

from gliner import GLiNER

model_name = "medium"
model = GLiNER.from_pretrained(f"urchade/gliner_{model_name}-v2.1")

Is there not a means for loading the model from disk if it has been downloaded before? Is the model not saved after it is downloaded? If the model is saved after downloading, where is in the environment?

Thanks!

moritzwilksch commented 5 months ago

I'm experiencing the same issue. gliner seems to always try to download the deberta model from the HF hub. I came up with this hack to circumvent this. Just download the gliner and deberta models manually and then use this code to load them:

from gliner import GLiNER
import torch
from dataclasses import dataclass

@dataclass
class HackedConfig:
    # all these attributes are copy-pasted from the gliner_config.json file
    model_name = "/home/path/to/models/microsoft--deberta-v3-large/"
    name = "token level gliner large"
    max_width = 100
    hidden_size = 512
    dropout = 0.1
    fine_tune = True
    subtoken_pooling = "first"
    span_mode = "token_level"
    num_steps = 6000
    train_batch_size = 8
    eval_every = 1000
    warmup_ratio = 0.1
    scheduler_type = "linear"
    loss_alpha = 0.75
    loss_gamma = 0
    loss_reduction = "sum"
    lr_encoder = "5e-6"
    lr_others = "7e-6"
    weight_decay_encoder = 0.01
    weight_decay_other = 0.01
    root_dir = "gliner_logs"
    train_data = "../../data/unie_ner1.json"
    val_data_dir = "none"
    prev_path = "logs_large/model_120000"
    save_total_limit = 10
    size_sup = -1
    max_types = 30
    shuffle_types = True
    random_drop = True
    max_neg_type_ratio = 1
    max_len = 768
    freeze_token_rep = False
    log_dir = "logs_final"

model = GLiNER(HackedConfig())
model.load_state_dict(
    torch.load(
        "/home/path/to/models/models--knowledgator--gliner-multitask-large-v0.5/pytorch_model.bin",
        map_location=torch.device("cpu"),  # for the gpu poor
    )
)
moritzwilksch commented 5 months ago

Actually, the easier version is editing the gliner_config.json file in the model directory and changing model_name to point to the correct deberta model 🎉

davidress-ILW commented 5 months ago

@moritzwilksch - thank you! That worked great!

rolisz commented 5 months ago

Can we get a proper fix for this? You shouldn't have to hack around with editing the gliner_config.json file.

urchade commented 5 months ago

the next version should fix that @rolisz

conraddonau-kfst commented 4 months ago

@moritzwilksch @davidress-ILW could you elaborate on how exactly you changed the config file to allow offline loading..? I've tried about 100 different things with no success so far.

moritzwilksch commented 4 months ago

How does your config look like? You should locate the file, for example gliner-multitask-large-v0.5/gliner_config.json, open it with a text editor and then set the "model_name" to the path where your deberta model resides e.g. "model_name": "/tmp/microsoft--deberta-v3-large/"

conraddonau-kfst commented 4 months ago

@moritzwilksch you mean the path to the mdeberta model that contains the folders refs,snapshots and blobs? Or should I download the model in some other fashion..?

moritzwilksch commented 4 months ago

Ahh you're right, you should be able to use the gliner_config.json in gliner-multitask-large-v0.5/snapshots/some-long-hash/ and have it point to the directory of mdeberta that contains the refs, snapshots, and blobs directories

moritzwilksch commented 4 months ago

this is what my directory layout looks like on an offline box:

➜  ~ tree /tmp/gliner-multitask-large-v0.5/                                                                                                                                                               (base)
/tmp/gliner-multitask-large-v0.5/
├── gliner_config.json
├── gliner_multitask_performance.png
├── pytorch_model.bin
├── README.md
├── refs
│   └── main
└── snapshots
    └── 18e0e4330dedf2a60dd863f098e8d5322bb2256a

4 directories, 5 files
➜  ~ tree /tmp/microsoft--deberta-v3-large/                                                                                                                                                               (base)
/tmp/microsoft--deberta-v3-large/
├── config.json
├── generator_config.json
├── pytorch_model.bin
├── pytorch_model.generator.bin
├── README.md
├── refs
│   └── main
├── snapshots
│   └── 64a8c8eab3e352a784c658aef62be1662607476f
├── spm.model
├── tf_model.h5
└── tokenizer_config.json

4 directories, 9 files
➜  ~ cat /tmp/gliner-multitask-large-v0.5/gliner_config.json                                                                                                                                              (base)
{
  "model_name": "/tmp/microsoft--deberta-v3-large/",
...