iPieter / RobBERT

A Dutch RoBERTa-based language model
https://pieter.ai/robbert/
MIT License
196 stars 29 forks source link

notebook for NER #21

Closed jwijffels closed 3 years ago

jwijffels commented 3 years ago

Hello there @iPieter I need to evaluate RobBERT and bertje in a named entity recognition task on 18th-19th century Dutch texts. Is there a notebook somewhere where I can follow the flow from an IOB-tagged dataset to finetuning and applying the model?

iPieter commented 3 years ago

Thank you for your interest in RobBERT!

I have added a notebook here and uploaded the model pdelobelle/robbert-v2-dutch-ner to the huggingface repository.

You should be able to use the model yourself with this code snippet:

from transformers import RobertaTokenizer, RobertaForTokenClassification

tokenizer = RobertaTokenizer.from_pretrained('pdelobelle/robbert-v2-dutch-ner', force_download=True)
model = RobertaForTokenClassification.from_pretrained('pdelobelle/robbert-v2-dutch-ner', return_dict=True, force_download=True)

If you're interested in experimenting a bit, Maarten van Gompel has also created proycon/robbert-ner-cased-sonar1-nld and proycon/robbert2-ner-cased-sonar1-nld from the sonar corpus. Depending on your task, these may work better or not.

jwijffels commented 3 years ago

Thanks for your time already and the notebook which shows how to score using an existing model. Do you also have a notebook which allows to train on your own corpus. I now started to train today using bertje at https://colab.research.google.com/drive/16zr_LJOfVqPquGV8Idk1y1XhjyFDewJV#scrollTo=8U5qqfuNqcdu but couldn't find a resource to showcase finetuning a NER model, would be great if I could compare to RobBERT and I'm planning as well to compare to a regular CRF model.

iPieter commented 3 years ago

Ok, I went a bit too quick over your question. If you want to finetune RobBERT yourself for NER, I highly recommend hugginface's run_ner.py script.

It is not a notebook, but shouldn't be to difficult to use regardless.

One final suggestion is to not start from pdelobelle/robbert-v2-dutch-base, but from pdelobelle/robbert-v2-dutch-ner if your labels match. If they do, transfer learning would make a lot of sense here.

jwijffels commented 3 years ago

Thanks, the categories in my data do not match the categories of pdelobelle/robbert-v2-dutch-ner. Was pdelobelle/robbert-v2-dutch-ner trained using https://github.com/huggingface/transformers/tree/master/examples/token-classification?

iPieter commented 3 years ago

Yes, although I sightly adapted the original script (from many commits ago) multiple hyperparameter experiments with a custom tracking framework.

Nevertheless, that script is really good and following the readme—and perhaps the code with the debugger—certainly gives some insights. They way they structure the datasets might be different, for example.

jwijffels commented 3 years ago

Ok, thanks. Will give it a try.