projecte-aina / spacy

Pre-production releases for Spacy in Catalan
MIT License
14 stars 2 forks source link

Novetats:

Amb l'alliberament de la versió 3.2 d'Spacy, que incorpora les millores desenvolupades pel TEMU-BSC, a mes del últim model de llengua, aquest repositori s'actualitzará properament amb versions experimentals, milloras, models propis i funcionalitats afegides a la plataforma des del projecte AINA: https://github.com/projecte-aina/lm-catalan

News:

With the release of Spacy version 3.2, which includes the enhancements developed at TEMU-BSC, in addition of the latest language model, this repo will be updated shortly with experimental versions, models and functionalities added to the plaform by the AINA project: https://github.com/projecte-aina/lm-catalan

[CA] Model per a Spacy 3.0 de la llengua catalana

Manual d'ús de l'Spacy: https://spacy.io/usage/spacy-101

To use our models, after installing:

import spacy

nlp = spacy.load("ca_core_web_md")

doc = nlp("Setze jutges d'un jutjat mengen fetge d'un penjat.")

for t in doc: print(t.text,"\t",t.lemma_)

Official catalan spacy models from Explosion now out with version 3.1.0

BSC will continue generating its own, more experimental, models for quicker development cycles and testing new capabilities

Versions

New (and last) version for spacy 3.0

Aquesta és una publicació pre-producció. El release oficial serà ben aviat.

Per ara, els syntax_iterators que s'usen per al chunking, els diccionaris de lematització i alguns altres components es troben incrustrats en el codi, en lloc d'en els directoris habituals spacy/lang i spacy_lookups_data. Després del llançament oficial podrem posar cada component al directori que li correspon.

[EN] Spacy 3.0 releases

Public release for catalan Spacy 3.0 models

Spacy usage basics: https://spacy.io/usage/spacy-101

Versions

New (and last) version for spacy 3.0

pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.7/ca_base_bsc_trf-3.2.7-py3-none-any.whl

versions 3.2.6

These are the pre-production releases, and a spacy "official" release will be forthcoming. For now, the syntax_iterators (for chunking), the lemmatization dictionaries and other components, as well as other tweaks, are embedded in code, outside of the usual spacy/lang directories or the spacy-lookup packages, and under the config/functions.py and the lemmas/ directory (that will be created when the project is run). We provide the training code for the "base" release, using spacy's project structure and facilities (https://spacy.io/usage/projects). You can clone directly from this github repo. The training data will be downloaded when you do the initialization of the project with:

python -m spacy project assets

Installation:

definitive bsc 3.0 model:

pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.7/ca_base_bsc_trf-3.2.7-py3-none-any.whl

base model without word vectors:

pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.6.2/ca_base_web_trf-3.2.6-py3-none-any.whl

core model with word embeddings for lexical similarity

pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.6.2/ca_core_web_trf-3.2.6-py3-none-any.whl

core model without BERTa transformer, but with Fasttext embeddings

pip install https://github.com/TeMU-BSC/spacy/releases/download/3.2.6.2/ca_core_web_md-3.2.6-py3-none-any.whl

Non-wheel, gzipped versions also available at https://github.com/TeMU-BSC/spacy/releases/tag/3.2.4gz

Sources

Based on BERTa transformer, AnCora corpus annotations and UDEP treebanks, all merged into single training/dev corpora to enable simultaneous multi-task training. https://github.com/TeMU-BSC/spacy/releases/download/3.2.6/ANCORA_ca.zip

Transformer:

bsc/roberta-base-ca-cased @ Hugging Face, a RoBERTa transformer pretrained with the 1.760 million token Catalan Text Corpus. Bibtex citation:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

Dependency Treebank, XPOS, sentence segmentation

From version 3.6 of the Catalan Universal Dependencies (https://universaldependencies.org/ca/) project treebank, with changes for pronouns and multi-word tokenization

Lemmatization

Adaptation of French lemmatizer, using word lists and corpus frequencies developed in house.

https://github.com/TeMU-BSC/spacy/releases/download/v3.2.4lemmas/lemmas.zip

Named Entity Recognition

From original AnCora corpus (https://doi.org/10.5281/zenodo.4529299)

Word vectors ("core" model only)

From FastText word embeddings: https://doi.org/10.5281/zenodo.4522040

External evaluation on test split for ca_base_web_trf:

  "token_acc":0.9996689501,
  "tag_acc":0.9866830883,
  "pos_acc":0.9864785119,
  "morph_acc":0.9722713864,
  "lemma_acc":0.9679711664,
  "dep_uas":0.9409872785,
  "dep_las":0.9182501866,
  "ents_p":0.9153339605,
  "ents_r":0.9136150235,
  "ents_f":0.9144736842,
  "sents_p":0.9861538462,
  "sents_r":0.9922600619,
  "sents_f":0.9891975309,
  "speed":4129.6607627658