urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.38k stars 120 forks source link

Create finetuning script/notebook #8

Closed urchade closed 7 months ago

drewskidang commented 8 months ago

Waiting patiently but if you had example of data format that would be great

urchade commented 8 months ago

Hi @drewskidang , the data is a json containing a list of dictionnaries of keys tokenized_text and ner. Eg.:

[
        {
            "tokenized_text": [
                "Alice", "loves", "programming", "in", "Python", ".", "She", "also", "uses", "Lua", "for", "game", "development", "."
            ],
            "ner": [
                [0, 0, "Person"],
                [4, 4, "ProgrammingLanguage"],
                [10, 10, "ProgrammingLanguage"],
                [11, 12, "Activity"]
            ]
        },
        {
            "tokenized_text": [
                "Bob", "is", "a", "Java", "developer", "working", "at", "TechCorp", ".", "He", "enjoys", "machine", "learning", "."
            ],
            "ner": [
                [0, 0, "Person"],
                [3, 3, "ProgrammingLanguage"],
                [7, 7, "Organization"],
                [11, 12, "Field"]
            ]
        }
 ]
drewskidang commented 8 months ago

Thank you!! I saw your ATG repo as well. Did you have plans to ramp that as well

urchade commented 8 months ago

@drewskidang Thanks for your interest in ATG. I can work on releasing a pretrained model, but I am not sure if it would be useful since public datasets for entity and relation extraction are poor and small-scale.

The good news is that ATG can be extended similarly to GLiNER, by conditioning the graph generation on the entity and/or relation types (both ATG and GLiNER use span-based representation). However, it requires building an IE dataset with diverse entity and relation types (equivalent to NuNer and Pile-NER datasets).

drewskidang commented 8 months ago

Thank you! I'm tyring to build something in the legal domain

deepanshu2207 commented 8 months ago

Thanks for opening this issue.

urchade commented 7 months ago

preprocessing for pile-ner/nuner available at https://github.com/urchade/GLiNER/tree/main/data

drewskidang commented 7 months ago

@urchade thank you!!! Also is that data you used to pretrain it? If you had any guidance on how to implement ATG as well. That would be great.

urchade commented 7 months ago

Hi @drewskidang, the results from the paper (https://arxiv.org/pdf/2311.08526.pdf) are from pre-training with Pile-NER. Results for NuNer are also available on huggingface (https://huggingface.co/urchade). Pile-NER yield stronger results as its texts are longer

I will try to work on ATG the coming days

urchade commented 7 months ago

Finetuning (minimal) notebook is available: https://github.com/urchade/GLiNER/blob/main/examples/finetune.ipynb

deepanshu2207 commented 7 months ago

Thank you!