Multilingual - synthetic dataset

jackboyla / GLiREL

Generalist and Lightweight Model for Relation Extraction (Extract any relationship types from text)

Apache License 2.0

72 stars 5 forks source link

Multilingual - synthetic dataset #9

Open SylvainVerdy opened 1 month ago

SylvainVerdy commented 1 month ago

Hi @jackboyla,

Thanks for your work!

I was wondering if you plan to extend the dataset to other languages (French, Spanish, German, for example), to build a multilingual models.

Regards,

Sylvain

jackboyla commented 1 month ago

hey @SylvainVerdy thank you!

I’d love to but unfortunately I don’t have time at the moment. But if you’re interested, the code I used to create the dataset is at https://github.com/jackboyla/GLiREL/tree/main/data/dataset-generation

You’ll need to find a multilingual dataset of interest, annotate NER with spacy/another NER model and then annotate relations with an LLM.

chrishokamp commented 1 month ago

@SylvainVerdy the tower* LLMs from unbabel could be a good thing to try but do note the non-commercial license https://huggingface.co/Unbabel

SylvainVerdy commented 1 month ago

Thanks @jackboyla and @chrishokamp for your quick feedback. I'll check out your links.

chrishokamp commented 1 month ago

cheers @SylvainVerdy one more to look at https://pieter.ai/trans-tokenization/ https://huggingface.co/Tweeties

SylvainVerdy commented 1 month ago

Hello, I've found a few resources:

https://arxiv.org/abs/2104.08655 (DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction).

If you only want a multilingual NER corpus :

SemEval 2023 Task 2: MultiCoNER II ( https://multiconer.github.io/dataset )

Sorry to bother you again. I have a question about the model. Can i train the model in full finetuning mode with supervised data. I saw in the yaml file that you have to put unseen_relations_type in eval. There is any obligation to follow this instruction?

jackboyla commented 1 month ago

@SylvainVerdy thanks for sharing those resources!

to answer your question, theres no requirement for the eval relations to be “unseen”, unless you’re interested in observing zero shot performance.

For supervised learning, you will need to adjust the train.py file because right now the script will try to split the data into train/eval, ensuring that no relation types overlap between train and eval.

This function performs that split and may need to be removed if you already have a train/eval split that you’re happy with: https://github.com/jackboyla/GLiREL/blob/7f86480c5f9007a6dd7f3b3cd42c07ec5db81c6e/train.py#L79

jackboyla commented 1 month ago

hey again @SylvainVerdy , I’ve updated the finetuning notebook. This will work with datasets where we don’t care about zero-shot improvement, just continuing training 😊 https://github.com/jackboyla/GLiREL/blob/main/examples/finetune.ipynb

SylvainVerdy commented 1 month ago

thank you so much ! @jackboyla :heart_eyes: