projecte-aina / lm-catalan

Official source for Catalan Language Models and resources made within Aina project.
MIT License
20 stars 0 forks source link

Models de Llengua i Datasets
per al Català 💬

Catalan Language Models & Datasets 💬

A repository for the AINA project.
Repositori del projecte AINA.

Models 🤖

Ǎguila-7B Ǎguila is a 7B parameters LLM that has been trained on a mixture of Spanish, Catalan and English data, adding up to a total of 26B tokens. It uses the Falcon-7b model as a starting point, a state-of-the-art English language model that was openly released just a few months ago by the Technology Innovation Institute. Read more here

Ǎguila-7B Ǎguila és un LLM de 7B paràmetres que s'ha entrenat amb dades en castellà, català i anglès, sumant un total de 26B tokens. Utilitza com a punt de partida el model Falcon-7b, un model d'última generació en llengua anglesa que l'Technology Innovation Institute va publicar obertament fa només uns mesos. Llegiu més aquí.

RoBERTa-base-ca-v2 and BERTa are transformer-based masked language models for the Catalan language. They are based on the RoBERTA base model and have been trained on a medium-size corpus collected from publicly available corpora and crawlers.

RoBERTa-base-ca-v2 i BERTa són models de llenguatge basat en transformers per a la llengua catalana. Es basen en el model RoBERTa-base i han estat entrenat en un corpus de mida mitjana, a partir de corpus diponibles públicament i crawlers.

longformer-base-4096-ca-v2 is the Longformer version of the roberta-base-ca-v2 masked language model for the Catalan language. The use of these models allows us to process larger contexts (up to 4096 tokens) as input without the need of additional aggregation strategies. The pretraining process of this model started from the roberta-base-ca-v2 checkpoint and was pretrained for MLM on both short and long documents in Catalan.

longformer-base-4096-ca-v2 és la versió Longformer del model roberta-base-ca-v2 per a la llengua catalana. L'ús d'aquests models permet processar contextos més grans (fins a 4096 tokens) com a entrada sense necessitat d'estratègies d'agregació addicionals. El procés de pre-entrenament d'aquest model va començar al checkpoint roberta-base-ca-v2 i es va pre-entrenar per a MLM en documents curts i llargs en català.

See results achieved on several tasks below.

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-base-ca-v2')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-base-ca-v2')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Tokenization and pretraining 🧩

The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original RoBERTA model with a vocabulary size of 52,000 tokens.

The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. With 16 NVIDIA V100 GPUs of 16GB DDRAM the training lasted a total of 48 hours for BERTa and a total of 96 hours for RoBERTa-ca-v2.

El corpus d'entrenament ha estat tokenitzat fent servir un (BPE) a nivell de bytes utilitzat en el model RoBERTA original, amb una vocabulari de 52.000 tokens.

El pre-entrenament de RoBERTa-ca-v2 consisteix en un entrenament de model de llenguatge per masking, seguint l'enfoc que es va utilitzar per al model RoBERTa-base, amb els mateixos hiperparàmetres que en treball original. L'entrenament es va fer amb 16 GPUs NVIDIA V100 de 16GB DDRAM i va durar 48 hores per al model BERTa i 96 per al RoBERTa-ca-v2.

Word embeddings (FastText) 🔤

Generated from a curated corpus of over 10GB of high-quality text. Generat a partir d'un corpus seleccionat de més de 10 GB de text d'alta qualitat.

Word Embeddings (more efficient Floret version)

Trained using an expansive Catalan textual corpus, comprising over 34GB of data, through the floret method. Entrenat utilitzant un corpus textual català de 34 GB de dades mitjançant el mètode floret.

Training corpora

The training corpora consists of several corpora gathered from web crawling and public corpora. Els corpus d'entrenament són la suma de diversos corpus obtinguts a partir de corpus publics i crawlings del web.

roberta-base-ca-v2

Corpus Size in GB
Catalan Crawling 13.00
Wikipedia 1.10
DOGC 0.78
Catalan Open Subtitles 0.02
Catalan Oscar 4.00
CaWaC 3.60
Cat. General Crawling 2.50
Cat. Goverment Crawling 0.24
Cat. News Agency 0.42
Padicat 0.63
RacoCatalá 8.10
Nació Digital 0.42
Vilaweb 0.06
Tweets 0.02

BERTa and Word embeddings

Corpus Size in GB
DOGC 0.801
Cat. Open Subtitles 0.019
Cat. OSCAR 4
CaWac 3.6
Cat. Wikipedia 0.98
Cat. General Crawling 2.6
Cat. Goverment Crawling 0.247
Cat. News Agency 0.447

To obtain a high-quality training corpus, each corpus has been preprocessed with a pipeline of different operations, including, among the others, sentence splitting, language detection, filtering of badly-formed sentences and deduplication of repetitive contents. During the process, we kept document boundaries. Finally, the corpora are concatenated and further global deduplication among them is applied.

The Catalan Textual Corpus can be found in the following link: https://doi.org/10.5281/zenodo.4519348.

A fi d'obtenir un corpus d'entrenament d'alta qualitat, cada corpus ha estat processat amb una pipeline d'operacions, incloent separació de frases, detecció d'idioma, filtratge de frases mal formades i deduplicació de continguts repetitius, entre d'altres. Durant el procés, hem mantingut els límits dels documents. Finalment, hem concatenat els corpus i hem aplicat una nova dedupliació.

En el següent enllaç podeu trobar el Catalan Textual Corpus: https://doi.org/10.5281/zenodo.4519348.

Fine-tuned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

Fine-tuned from BERTa model:

For a complete list, refer to. Per obtenir una llista completa, consulteu el següent enllaç: https://huggingface.co/projecte-aina/

Fine-tuning

The fine-tuning scripts for the downstream tasks are available in the following link: https://github.com/projecte-aina/club.
They are based on the HuggingFace Transformers library.

Els scripts de fine-tuning per aquestes tasques es poden trobar en el següent enllaç: https://github.com/projecte-aina/club.
Es basen en la llibreria Transformers de HuggingFace.

spaCy models

Available trained pipelines for Catalan in spaCy. Pipelines per al català disponibles a spaCy: https://spacy.io/models/ca

Datasets 🗂️

name task link
ancora-ca-ner Named Entity Recognition https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
ancora-ca-pos Part of Speech tagging https://huggingface.co/datasets/universal_dependencies
STS-ca Semantic Textual Similarity https://huggingface.co/datasets/projecte-aina/sts-ca
TeCla Text Classification https://huggingface.co/datasets/projecte-aina/tecla
TECa Textual Entailment https://huggingface.co/datasets/projecte-aina/teca
VilaQuAD Extractive Question Answering https://huggingface.co/datasets/projecte-aina/vilaquad
ViquiQuAD Extractive Question Answering https://huggingface.co/datasets/projecte-aina/viquiquad
CatalanQA Extractive Question Answering https://huggingface.co/datasets/projecte-aina/catalanqa
xquad-ca Extractive Question Answering https://huggingface.co/datasets/projecte-aina/xquad-ca

For a complete list, refer to. Per obtenir una llista completa, consulteu: https://huggingface.co/projecte-aina/

For a complete list of datasets in Zenodo, refer to. Per obtenir una llista completa dels datasets a Zenodo, consulteu: https://zenodo.org/communities/catalan-ai/

CLUB: Catalan Language Understanding Benchmark 🏆

The CLUB benchmark consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA).

El benchmark CLUB consisteix en 6 tasques: reconeixement d'entitats (NER), etiquetat de categoria gramatical (POS), similitut textual semàntica (STS), classificació textual (TC), implicació textual (TE) i resposta de preguntes (QA).

Results ✅

Task NER (F1) POS (F1) STS (Combined) TC (Accuracy) TE (Accuracy) QA (Vilaquad) (F1/EM) QA (ViquiQuAD) (F1/EM) QA (CatalanQA) (F1/EM) QA (XQuAD-Ca)* (F1/EM)
RoBERTa-base-ca-v2 89.45 99.09 79.07 74.26 83.14 87.74/72.58 88.72/75.91 89.50/76.63 73.64/55.42
BERTa 88.94 99.10 80.19 73.65 79.26 85.93/70.58 87.12/73.11 89.17/77.14 69.20/51.47
mBERT 87.36 98.98 74.26 69.90 74.63 82.78/67.33 86.89/73.53 86.90/74.19 68.79/50.80
XLM-RoBERTa 88.07 99.03 61.61 70.14 33.30 86.29/71.83 86.88/73.11 88.17/75.93 72.55/54.16

*: Trained on CatalanQA, tested on XQuAD-Ca.

For more information, refer to. Per a més informació, consulteu el següent enllaç https://club.aina.bsc.es/

Demos

Cite 📣

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.
For questions regarding this work, contact us at aina@bsc.es

📋 Necessitem (1) ampliar el nostre corpus per poder fer models més grans i (2) entrenar i avaluar el model en més tasques.
Per qualsevol cosa relacionada amb aquesta feina, podeu contactar-nos a aina@bsc.es