Amb l'alliberament de la versió 3.2 d'Spacy, que incorpora les millores desenvolupades pel TEMU-BSC, a mes del últim model de llengua, aquest repositori s'actualitzará properament amb versions experimentals, milloras, models propis i funcionalitats afegides a la plataforma des del projecte AINA:


With the release of Spacy version 3.2, which includes the enhancements developed at TEMU-BSC, in addition of the latest language model, this repo will be updated shortly with experimental versions, models and functionalities added to the plaform by the AINA project:

[CA] Model per a Spacy 3.0 de la llengua catalana

Manual d'ús de l'Spacy:

To use our models, after installing:

import spacy

nlp = spacy.load("ca_core_web_md")

doc = nlp("Setze jutges d'un jutjat mengen fetge d'un penjat.")

for t in doc: print(t.text,"\t",t.lemma_)

Official catalan spacy models from Explosion now out with version 3.1.0

BSC will continue generating its own, more experimental, models for quicker development cycles and testing new capabilities


New (and last) version for spacy 3.0

Aquesta és una publicació pre-producció. El release oficial serà ben aviat.

Per ara, els syntax_iterators que s'usen per al chunking, els diccionaris de lematització i alguns altres components es troben incrustrats en el codi, en lloc d'en els directoris habituals spacy/lang i spacy_lookups_data. Després del llançament oficial podrem posar cada component al directori que li correspon.

[EN] Spacy 3.0 releases

Public release for catalan Spacy 3.0 models

Spacy usage basics:


pip install

versions 3.2.6

These are the pre-production releases, and a spacy "official" release will be forthcoming. For now, the syntax_iterators (for chunking), the lemmatization dictionaries and other components, as well as other tweaks, are embedded in code, outside of the usual spacy/lang directories or the spacy-lookup packages, and under the config/ and the lemmas/ directory (that will be created when the project is run). We provide the training code for the "base" release, using spacy's project structure and facilities ( You can clone directly from this github repo. The training data will be downloaded when you do the initialization of the project with:

python -m spacy project assets


definitive bsc 3.0 model:

pip install

base model without word vectors:

pip install

core model with word embeddings for lexical similarity

pip install

core model without BERTa transformer, but with Fasttext embeddings

pip install

Non-wheel, gzipped versions also available at


Based on BERTa transformer, AnCora corpus annotations and UDEP treebanks, all merged into single training/dev corpora to enable simultaneous multi-task training.


bsc/roberta-base-ca-cased @ Hugging Face, a RoBERTa transformer pretrained with the 1.760 million token Catalan Text Corpus. Bibtex citation:

Dependency Treebank, XPOS, sentence segmentation

From version 3.6 of the Catalan Universal Dependencies ( project treebank, with changes for pronouns and multi-word tokenization


Adaptation of French lemmatizer, using word lists and corpus frequencies developed in house.

Named Entity Recognition

From original AnCora corpus (

Word vectors ("core" model only)

From FastText word embeddings:

External evaluation on test split for ca_base_web_trf:
