Regarding the issue of fine-tuning on a specific domain - Githubissues

urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024

https://arxiv.org/abs/2311.08526

Apache License 2.0

1.48k stars 127 forks source link

Regarding the issue of fine-tuning on a specific domain #64

Closed QuangTQV closed 5 months ago

QuangTQV commented 7 months ago

Dear author, has the file examples/finetune.ipynb included negative entity sampling yet? If not, how can we adjust it to incorporate negative entity sampling?

urchade commented 7 months ago

It already include in batch negative sampling

QuangTQV commented 7 months ago

It already include in batch negative sampling

thanks ^^

QuangTQV commented 7 months ago

It already include in batch negative sampling

How can I make the GliNER model biased towards my specific domain data? Because my data domain is prone to confusion with other domains. For example, "harryporter price" is a question about cryptocurrency price, but the model could mistakenly interpret it as a book or something else

urchade commented 7 months ago

The solution is fine-tuning the model on your specialized domain. You can for instance generate synthetic data for that

QuangTQV commented 7 months ago

The solution is fine-tuning the model on your specialized domain. You can for instance generate synthetic data for that

I know I should fine-tune on my specific domain data, but my dataset compared to the pre-trained model's data is too small. I'm afraid it won't bias towards my data. Do you have any suggestions for a good fine-tuning solution? My data consists of entities within the blockchain domain.

urchade commented 7 months ago

Even with small data it should work. How many is it exactly ? I have read someone finetuning with 20-30 samples getting strong performance in his domain

QuangTQV commented 7 months ago

Even with small data it should work. How many is it exactly ? I have head someone finetuning with 20-30 samples getting strong performance in his domain

I have 500 samples for each entity, and I need to extract about 8 entities.

wjbmattingly commented 7 months ago

I did some testing last week and genrated 70 synthetic examples to bias the model to clasifying different kind of labels associated with bird nesting and dietary habbits. It works quite well. If your real world data is fairly consistent, this helps too. You will want to adjust the number of steps in the fine-tune notebook accordingly.