Closed urchade closed 7 months ago
Hi @drewskidang , the data is a json containing a list of dictionnaries of keys tokenized_text
and ner
. Eg.:
[
{
"tokenized_text": [
"Alice", "loves", "programming", "in", "Python", ".", "She", "also", "uses", "Lua", "for", "game", "development", "."
],
"ner": [
[0, 0, "Person"],
[4, 4, "ProgrammingLanguage"],
[10, 10, "ProgrammingLanguage"],
[11, 12, "Activity"]
]
},
{
"tokenized_text": [
"Bob", "is", "a", "Java", "developer", "working", "at", "TechCorp", ".", "He", "enjoys", "machine", "learning", "."
],
"ner": [
[0, 0, "Person"],
[3, 3, "ProgrammingLanguage"],
[7, 7, "Organization"],
[11, 12, "Field"]
]
}
]
Thank you!! I saw your ATG repo as well. Did you have plans to ramp that as well
@drewskidang Thanks for your interest in ATG. I can work on releasing a pretrained model, but I am not sure if it would be useful since public datasets for entity and relation extraction are poor and small-scale.
The good news is that ATG can be extended similarly to GLiNER, by conditioning the graph generation on the entity and/or relation types (both ATG and GLiNER use span-based representation). However, it requires building an IE dataset with diverse entity and relation types (equivalent to NuNer and Pile-NER datasets).
Thank you! I'm tyring to build something in the legal domain
Thanks for opening this issue.
preprocessing for pile-ner/nuner available at https://github.com/urchade/GLiNER/tree/main/data
@urchade thank you!!! Also is that data you used to pretrain it? If you had any guidance on how to implement ATG as well. That would be great.
Hi @drewskidang, the results from the paper (https://arxiv.org/pdf/2311.08526.pdf) are from pre-training with Pile-NER. Results for NuNer are also available on huggingface (https://huggingface.co/urchade). Pile-NER yield stronger results as its texts are longer
I will try to work on ATG the coming days
Finetuning (minimal) notebook is available: https://github.com/urchade/GLiNER/blob/main/examples/finetune.ipynb
Thank you!
Waiting patiently but if you had example of data format that would be great