import spacy
nlp = spacy.load("ca_core_web_md")
doc = nlp("Setze jutges d'un jutjat mengen fetge d'un penjat.")
for t in doc: print(t.text,"\t",t.lemma_)
ca_base_bsc_trf Contains improvements in training that resove some issues with clitic tokenization and lemmatization. This is the last release before doing 3.1 BSC models that will introduce components and improvements
ca_base_web_trf & ca_core_web_trf contenen un transformer basat en RoBERTA com a base per un entrenament multitasca dels diferents components. La versió "core" conté, a més, vectors FastText per mesurar la similitud semàntica. La versió "base" també pot mesurar la similitud semàntica, però ho fa a partir de NER, dependències i altres informacions.
ca_core_web_lg, en canvi, fa servir els vectors FastText com a base per a l'entrenament dels components, de manera que no necesita transformers o GPU.
Aquesta és una publicació pre-producció. El release oficial serà ben aviat.
Per ara, els syntax_iterators que s'usen per al chunking, els diccionaris de lematització i alguns altres components es troben incrustrats en el codi, en lloc d'en els directoris habituals spacy/lang i spacy_lookups_data. Després del llançament oficial podrem posar cada component al directori que li correspon.
Public release for catalan Spacy 3.0 models
Spacy usage basics: https://spacy.io/usage/spacy-101
pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.7/ca_base_bsc_trf-3.2.7-py3-none-any.whl
ca_base_web_trf & ca_core_web_trf contain a Catalan RoBERTa-based transformer as a common backbone for multitask training of the different components. The latter one ("core") also contains FastText embeddings to measure lexical similarity, although the "base" version can also measure semantic similarity, but using NER, dependency and other information, not directly on a dedicated distance matrix.
ca_core_web_lg, on the other hand, uses FastText embeddings as a training backbone, so it doesn't need transformers or GPUs.
These are the pre-production releases, and a spacy "official" release will be forthcoming. For now, the syntax_iterators (for chunking), the lemmatization dictionaries and other components, as well as other tweaks, are embedded in code, outside of the usual spacy/lang directories or the spacy-lookup packages, and under the config/functions.py and the lemmas/ directory (that will be created when the project is run). We provide the training code for the "base" release, using spacy's project structure and facilities (https://spacy.io/usage/projects). You can clone directly from this github repo. The training data will be downloaded when you do the initialization of the project with:
python -m spacy project assets
pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.7/ca_base_bsc_trf-3.2.7-py3-none-any.whl
pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.6.2/ca_base_web_trf-3.2.6-py3-none-any.whl
pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.6.2/ca_core_web_trf-3.2.6-py3-none-any.whl
pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.6.2/ca_core_web_md-3.2.6-py3-none-any.whl
Based on BERTa transformer, AnCora corpus annotations and UDEP treebanks, all merged into single training/dev corpora to enable simultaneous multi-task training. https://github.com/TeMU-BSC/spacy/releases/download/3.2.6/ANCORA_ca.zip
bsc/roberta-base-ca-cased @ Hugging Face, a RoBERTa transformer pretrained with the 1.760 million token Catalan Text Corpus. Bibtex citation:
@inproceedings{armengol-estape-etal-2021-multilingual,
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
author = "Armengol-Estap{\'e}, Jordi and
Carrino, Casimiro Pio and
Rodriguez-Penagos, Carlos and
de Gibert Bonet, Ona and
Armentano-Oller, Carme and
Gonzalez-Agirre, Aitor and
Melero, Maite and
Villegas, Marta",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.437",
doi = "10.18653/v1/2021.findings-acl.437",
pages = "4933--4946",
}
From version 3.6 of the Catalan Universal Dependencies (https://universaldependencies.org/ca/) project treebank, with changes for pronouns and multi-word tokenization
Adaptation of French lemmatizer, using word lists and corpus frequencies developed in house.
https://github.com/TeMU-BSC/spacy/releases/download/v3.2.4lemmas/lemmas.zip
From original AnCora corpus (https://doi.org/10.5281/zenodo.4529299)
From FastText word embeddings: https://doi.org/10.5281/zenodo.4522040
"token_acc":0.9996689501,
"tag_acc":0.9866830883,
"pos_acc":0.9864785119,
"morph_acc":0.9722713864,
"lemma_acc":0.9679711664,
"dep_uas":0.9409872785,
"dep_las":0.9182501866,
"ents_p":0.9153339605,
"ents_r":0.9136150235,
"ents_f":0.9144736842,
"sents_p":0.9861538462,
"sents_r":0.9922600619,
"sents_f":0.9891975309,
"speed":4129.6607627658