Closed yyDing1 closed 7 months ago
The original version used AllenNLP (which causes some dependency issues), so I had to reimplement some layers.
Recently, I have found some way to convert back the weight. I can upload the best large model to HF if you need it
Thanks for your assistance!
So, the issue is that using the current HF framework to load a checkpoint from the previous allennlp version is not appropriate.
Could you share code (and requirements) that is compatible with the previously released checkpoint, or a model that is compatible with the current code?
Just feel free to share the resources at your convenience.
I have uploaded the large weight from the paper on HF under gliner_large
. Can you test it and tell me if it gives the expected results ? @yyDing1
It seems there are still some errors, and after updating the model, the results are as follows:
##############################################
step: 0
Table for zero-shot benchmark
CrossNER_AI : 50.4%
CrossNER_literature : 46.3%
CrossNER_music : 48.8%
CrossNER_politics : 47.3%
CrossNER_science : 46.1%
mit-movie : 48.9%
mit-restaurant : 36.2%
Average : 46.3%
##############################################
That is weird. Do you have the latest version of gliner installed ?
Also, do you have the same behaviour on medium and small ?
ok, you're right. There is a bug somewhere, I cannot reproduce the score with the new implementation, but can with the old one. I will try to fix it
The problem come from the forward pass of LSTM in AllenNLP, which is a bit different of what I have implemented
Yes, refactoring is valuable, but it often comes with unexpected bugs.
Could you share with me the original allennlp code and the corresponding large version checkpoint, along with the requirements?
I want to test the performance of GLiNER under the OOD setting without the "misc" label.
here is the code:
the requirements:
allennlp==2.8.0
flair==0.11.3
transformers==4.29.1
torch==1.10.1+cu111
tqdm==4.62.3
seqeval
thank you for finding the bug
Thanks for your quick response and help; I have finally managed to reproduce the results reported in the paper.
##############################################
step: 0
Table for zero-shot benchmark
CrossNER_AI : 57.2%
CrossNER_literature : 64.4%
CrossNER_music : 69.6%
CrossNER_politics : 72.6%
CrossNER_science : 62.6%
mit-movie : 57.2%
mit-restaurant : 42.8%
Average : 60.9%
##############################################
Hi @yyDing1, I am curious how you did it. Because, the performance of the models in HF seems lower. Did you use the original script in Archive.zip ?
I use the checkpoint released here. I found the checkpoint in the historical commits of your GitHub repository that I had followed before :)
I think I have found the problem. It was because of a mismatch between version of flair, which causes different indexation of the tokenizer
Could you try again to evaluate the weight on HF ? pleaase
Hi, there still exists some bugs, the results of HF version are as follows:
##############################################
step: 0
Table for all datasets except crossNER
Table for zero-shot benchmark
CrossNER_AI : 50.8%
CrossNER_literature : 46.2%
CrossNER_music : 50.0%
CrossNER_politics : 47.6%
CrossNER_science : 46.8%
mit-movie : 51.0%
mit-restaurant : 35.2%
Average : 46.8%
##############################################
I tried to debug and found some differences between the HF version and allennlp version.
The significant difference first appears in the output of flair embedding (or deberta model). (My HF flair version is 0.13.1)
I find some differences between HF version and allennlp version, specifically,
the input_ids in HF version:
tensor([ 1, 128001, 7416, 507, 128001, 1957, 507, 128001, 995, ...], device='cuda:0')
the input_ids in allennlp version:
tensor([ 1, 128002, 7416, 128002, 1957, 128002, 995, ...], device='cuda:0')
Ok, maybe it is better to upload a completely new weight using the new version 🥺
Can you guys share a minimal script for reproduction? I have found the datasets used but I can't seem to get the get_for_all_path
to run. The dataset file is called IE_INSTRUCTIONS.zip
I first load the model, and get_for_all_path
for evaluation.
To import get_for_all_path
, use the following command:
from gliner.modules.run_evaluation import get_for_all_path
To utilize it, execute:
get_for_all_path(model, 0, args.log_dir, "IE_INSTRUCTIONS/NER")
Note that you should first download the pre-trained checkpoint and modify the prev_path
in the model configuration.
https://github.com/urchade/GLiNER/blob/9d11910b2f91adf2d574948159e5a8e676d8291c/config_large.yaml#L25
Hey @yyDing1 or @Shai-Meital, could you tell me where I can find the benchmark datasets in the correct format to run get_for_all_path
?
Hi @th0rntwig, you can download the dataset here:: "https://drive.google.com/file/d/1T-5IbocGka35I7X3CE6yKe5N_Xg2lVKT/view"
Thank you :)
Ok, maybe it is better to upload a completely new weight using the new version 🥺
Hey @urchade did you update weights on HF for medium and large? We are unable to reproduce the published table.
For example: the published results say gliner medium gets 59.7 on literature in Table 1. Our attempts can only get 53 on literature for the hugging face published gliner medium. Generally this is the difference in performance that is observed across all datasets when we attempt to reproduce the benchmarks.
Hi @robcaulk, the result for GLiNER-M should be around 55-56 on OOD benchmark.
Unfortunately, the weights of the model published in the paper are not compatible with the current version, as it uses AllenNLP and an earlier version of PyTorch (see code here: https://github.com/urchade/GLiNER/issues/54#issuecomment-2043118319). I tried to convert the weights but it results in other bugs causing high drops in performance (see higher in this thread). The models on HF are the reproduction using the current codebase.
could you please provide me more information about your versions ? (hyperparameters, training data, ...) @robcaulk
@urchade for the CrossNER dataset, we are using the latest huggingface medium model:
https://huggingface.co/urchade/gliner_medium-v1
With the latest GLiNER github sources. We are using the evaluation you posted in this repo, so the hyperparameters remain unchanged https://github.com/urchade/GLiNER/blob/main/gliner/modules/run_evaluation.py#L92
We are using the evaluation following this approach: https://github.com/urchade/GLiNER/issues/54#issuecomment-2059161074
And using the data you posted here: https://github.com/urchade/GLiNER/issues/54#issuecomment-2066742863
Here we (with @th0rntwig) compare Huggingface version to the results reported in the paper
Oh, I see. Thank you for the evaluations! The models in HF are not the same as the ones reported in the paper but a reproduction, due to the problem I mentioned earlier. But from the table, it seems like the Medium version has quite bad performance for its size (this is likely due a bad run). I will investigate that and will upload a better model soon. For the large and small versions, the difference in results is not significant enough so I will keep them as they are. As the earlier version of GLiNER is too difficult to use, I will update the table in the paper with the latest results (HF version).
I just trained a medium-sized GLiNER and got 54.6 in average:
It is now on HF @robcaulk
Thanks, we will use this as the comparison!
Hi, I'm having some difficulty reproducing the OOD results.
I load the
gliner_large
model released here and pass it toget_for_all_path
for evaluation. However, my results are nearly 7% lower than the reported ones (60.9%) in the paper.Could the issue be that the released checkpoint is not compatible with the current code framework? What am I missing? Do you have any ideas?