hitz-zentroa / GoLLIE

Guideline following Large Language Model for Information Extraction
https://hitz-zentroa.github.io/GoLLIE/
Apache License 2.0
263 stars 18 forks source link

[TASK] Another language support for GoLLIE (more specifically Vietnamese) #9

Open NoAtmosphere0 opened 9 months ago

NoAtmosphere0 commented 9 months ago

Hi GoLLIE research team, I am currently in a group of Vietnamese university students who want to present your paper for an upcoming seminar in our "Introduction to Natural Language Processing" course. Our task is to summarize and explain the contents of your paper to our fellow students and lecturers.

To make it easier to understand for our classmates, we are interested in training GoLLIE using Vietnamese datasets. If it's possible, we would greatly appreciate it if you could provide us with some instructions on how to proceed with this. We sincerely enjoyed reading your paper and believe that it would greatly benefit our presentation.

Here are some datasets for the named-entity-recognition subtask that I found on Hugging Face:

We would be extremely grateful if you could provide us with any guidance or assistance on our endeavor. Please feel free to reach out if you have any questions or require more information from us. We are more than willing to cooperate to make this collaboration successful.

ikergarcia1996 commented 9 months ago

Hi @NoAtmosphere0!

I believe the easiest way to achieve this would be by fine-tuning one of the GoLLIE checkpoints with a Vietnamese dataset. Both Wikiann and Polyglot NER seem like the best candidates since they use the same labels as CoNLL03. To fine-tune your model with either of these datasets, you should:

  1. Duplicate the CoNLL03 config and craft a Wikiann/Polyglot.json file: https://github.com/hitz-zentroa/GoLLIE/blob/main/configs/data_configs/conll03_config.json. Substitute the values in "train_file", "dev_file", and "test_file" with the paths to the datasets in .conll format (.tsv).
  2. Modify the generate data script: https://github.com/hitz-zentroa/GoLLIE/blob/main/bash_scripts/generate_data.sh. Delete all the config files and incorporate the ones you produced in step 1. Subsequently, execute the script.
  3. Modify the GoLLIE7B config file: https://github.com/hitz-zentroa/GoLLIE/blob/main/configs/model_configs/GoLLIE-7B_CodeLLaMA.yaml. Remove all the tasks and incorporate the ones you've recently made. Change the model from codellama/CodeLlama-7b-hf to HiTZ/GoLLIE-7B.
  4. In the output folder, you'll get the new LoRA adapters for GoLLIE. You can use them using the load_model function found here: https://github.com/hitz-zentroa/GoLLIE/blob/main/src/model/load_model.py.

A significant concern here is the proficiency of LLaMA2/CodeLLaMA in Vietnamese. The model might not be very adept for that particular language, and unfortunately, there's a limited selection of multilingual LLMs available.

NoAtmosphere0 commented 9 months ago

Hi @ikergarcia1996!

Thank you for your prompt response and helpful instructions. We will follow the steps that you have outlined in your response to train GoLLIE and also keep in mind your concerns about the proficiency of LLaMA2/CodeLLaMA in Vietnamese.

We will keep you updated on our progress by not closing this issue and let you know if we have any questions or need any further assistance. Thanks again for your support!

brunoalano commented 7 months ago

@NoAtmosphere0 Did you had any progress on that?