Reproduction questions - Githubissues

yyDing1 commented 7 months ago

Hi, I'm having some difficulty reproducing the OOD results.

I load the gliner_large model released here and pass it to get_for_all_path for evaluation. However, my results are nearly 7% lower than the reported ones (60.9%) in the paper.

##############################################
step: 0
Table for zero-shot benchmark
CrossNER_AI         : 53.0%
CrossNER_literature : 57.7%
CrossNER_music      : 64.6%
CrossNER_politics   : 56.2%
CrossNER_science    : 58.0%
mit-movie           : 54.9%
mit-restaurant      : 33.5%
Average             : 54.0%
##############################################

Could the issue be that the released checkpoint is not compatible with the current code framework? What am I missing? Do you have any ideas?

urchade commented 7 months ago

The original version used AllenNLP (which causes some dependency issues), so I had to reimplement some layers.

Recently, I have found some way to convert back the weight. I can upload the best large model to HF if you need it

yyDing1 commented 7 months ago

Thanks for your assistance!

So, the issue is that using the current HF framework to load a checkpoint from the previous allennlp version is not appropriate.

Could you share code (and requirements) that is compatible with the previously released checkpoint, or a model that is compatible with the current code?

Just feel free to share the resources at your convenience.

urchade commented 7 months ago

I have uploaded the large weight from the paper on HF under gliner_large. Can you test it and tell me if it gives the expected results ? @yyDing1

yyDing1 commented 7 months ago

It seems there are still some errors, and after updating the model, the results are as follows:

##############################################
step: 0
Table for zero-shot benchmark
CrossNER_AI         : 50.4%
CrossNER_literature : 46.3%
CrossNER_music      : 48.8%
CrossNER_politics   : 47.3%
CrossNER_science    : 46.1%
mit-movie           : 48.9%
mit-restaurant      : 36.2%
Average             : 46.3%
##############################################

urchade commented 7 months ago

That is weird. Do you have the latest version of gliner installed ?

Also, do you have the same behaviour on medium and small ?

urchade commented 7 months ago

ok, you're right. There is a bug somewhere, I cannot reproduce the score with the new implementation, but can with the old one. I will try to fix it

urchade commented 7 months ago

The problem come from the forward pass of LSTM in AllenNLP, which is a bit different of what I have implemented

yyDing1 commented 7 months ago

Yes, refactoring is valuable, but it often comes with unexpected bugs.

Could you share with me the original allennlp code and the corresponding large version checkpoint, along with the requirements?

I want to test the performance of GLiNER under the OOD setting without the "misc" label.

urchade commented 7 months ago

here is the code:

Archive.zip

the requirements:

allennlp==2.8.0
flair==0.11.3
transformers==4.29.1
torch==1.10.1+cu111
tqdm==4.62.3
seqeval

thank you for finding the bug

yyDing1 commented 7 months ago

Thanks for your quick response and help; I have finally managed to reproduce the results reported in the paper.

##############################################
step: 0
Table for zero-shot benchmark
CrossNER_AI         : 57.2%
CrossNER_literature : 64.4%
CrossNER_music      : 69.6%
CrossNER_politics   : 72.6%
CrossNER_science    : 62.6%
mit-movie           : 57.2%
mit-restaurant      : 42.8%
Average             : 60.9%
##############################################

urchade commented 7 months ago

Hi @yyDing1, I am curious how you did it. Because, the performance of the models in HF seems lower. Did you use the original script in Archive.zip ?

yyDing1 commented 7 months ago

I use the checkpoint released here. I found the checkpoint in the historical commits of your GitHub repository that I had followed before :)

urchade commented 7 months ago

I think I have found the problem. It was because of a mismatch between version of flair, which causes different indexation of the tokenizer

Could you try again to evaluate the weight on HF ? pleaase

yyDing1 commented 7 months ago

Hi, there still exists some bugs, the results of HF version are as follows:

##############################################
step: 0
Table for all datasets except crossNER
Table for zero-shot benchmark
CrossNER_AI         : 50.8%
CrossNER_literature : 46.2%
CrossNER_music      : 50.0%
CrossNER_politics   : 47.6%
CrossNER_science    : 46.8%
mit-movie           : 51.0%
mit-restaurant      : 35.2%
Average             : 46.8%
##############################################

I tried to debug and found some differences between the HF version and allennlp version.

The significant difference first appears in the output of flair embedding (or deberta model). (My HF flair version is 0.13.1)

I find some differences between HF version and allennlp version, specifically,

the input_ids in HF version:

tensor([     1, 128001,   7416,    507, 128001,   1957,    507, 128001,    995, ...], device='cuda:0')

the input_ids in allennlp version:

tensor([     1, 128002,   7416, 128002,   1957,  128002,    995, ...], device='cuda:0')

urchade commented 7 months ago

Ok, maybe it is better to upload a completely new weight using the new version 🥺

Shai-Meital commented 7 months ago

Can you guys share a minimal script for reproduction? I have found the datasets used but I can't seem to get the get_for_all_path to run. The dataset file is called IE_INSTRUCTIONS.zip

yyDing1 commented 7 months ago

I first load the model, and get_for_all_path for evaluation.

To import get_for_all_path, use the following command:

from gliner.modules.run_evaluation import get_for_all_path

To utilize it, execute:

get_for_all_path(model, 0, args.log_dir, "IE_INSTRUCTIONS/NER")

Note that you should first download the pre-trained checkpoint and modify the prev_path in the model configuration. https://github.com/urchade/GLiNER/blob/9d11910b2f91adf2d574948159e5a8e676d8291c/config_large.yaml#L25

th0rntwig commented 7 months ago

Hey @yyDing1 or @Shai-Meital, could you tell me where I can find the benchmark datasets in the correct format to run get_for_all_path?

urchade commented 7 months ago

Hi @th0rntwig, you can download the dataset here:: "https://drive.google.com/file/d/1T-5IbocGka35I7X3CE6yKe5N_Xg2lVKT/view"

th0rntwig commented 7 months ago

Thank you :)

robcaulk commented 6 months ago

Ok, maybe it is better to upload a completely new weight using the new version 🥺

Hey @urchade did you update weights on HF for medium and large? We are unable to reproduce the published table.

For example: the published results say gliner medium gets 59.7 on literature in Table 1. Our attempts can only get 53 on literature for the hugging face published gliner medium. Generally this is the difference in performance that is observed across all datasets when we attempt to reproduce the benchmarks.

urchade commented 6 months ago

Hi @robcaulk, the result for GLiNER-M should be around 55-56 on OOD benchmark.

Unfortunately, the weights of the model published in the paper are not compatible with the current version, as it uses AllenNLP and an earlier version of PyTorch (see code here: https://github.com/urchade/GLiNER/issues/54#issuecomment-2043118319). I tried to convert the weights but it results in other bugs causing high drops in performance (see higher in this thread). The models on HF are the reproduction using the current codebase.

urchade commented 6 months ago

could you please provide me more information about your versions ? (hyperparameters, training data, ...) @robcaulk

robcaulk commented 6 months ago

@urchade for the CrossNER dataset, we are using the latest huggingface medium model:

https://huggingface.co/urchade/gliner_medium-v1

With the latest GLiNER github sources. We are using the evaluation you posted in this repo, so the hyperparameters remain unchanged https://github.com/urchade/GLiNER/blob/main/gliner/modules/run_evaluation.py#L92

We are using the evaluation following this approach: https://github.com/urchade/GLiNER/issues/54#issuecomment-2059161074

And using the data you posted here: https://github.com/urchade/GLiNER/issues/54#issuecomment-2066742863

Here we (with @th0rntwig) compare Huggingface version to the results reported in the paper

urchade commented 6 months ago

Oh, I see. Thank you for the evaluations! The models in HF are not the same as the ones reported in the paper but a reproduction, due to the problem I mentioned earlier. But from the table, it seems like the Medium version has quite bad performance for its size (this is likely due a bad run). I will investigate that and will upload a better model soon. For the large and small versions, the difference in results is not significant enough so I will keep them as they are. As the earlier version of GLiNER is too difficult to use, I will update the table in the paper with the latest results (HF version).

urchade commented 6 months ago

I just trained a medium-sized GLiNER and got 54.6 in average:

CrossNER_science: F1 = 55.91
mit-restaurant: F1 = 34.41
CrossNER_AI: F1 = 51.88
CrossNER_politics: F1 = 67.35
mit-movie: F1 = 46.85
CrossNER_music: F1 = 66.46
CrossNER_literature: F1 = 59.51

It is now on HF @robcaulk

robcaulk commented 6 months ago

Thanks, we will use this as the comparison!

urchade / GLiNER

Reproduction questions #54