ljvmiranda921 / calamanCy

NLP pipelines for Tagalog using spaCy
MIT License
44 stars 3 forks source link

Add training directory for v0.1.0 #17

Closed ljvmiranda921 closed 1 year ago

ljvmiranda921 commented 1 year ago

Description

Closes #16

This PR adds the initial models for calamanCy. For the first few versions, I think it's better to just have a _md to _trf model. There are also some design decisions that I used when training the pipelines. Here are the notable ones:

Models and pipelines

Model Pipelines Description
tl_calamancy_md (73.7 MB) tok2vec, tagger, morphologizer, parser, ner CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys)
tl_calamancy_lg (431.9 MB) tok2vec, tagger, morphologizer, parser, ner CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k)
tl_calamancy_trf (775.6 MB) transformer, tagger, parser, ner GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors.

Data sources

Source Authors License
TLUnified Dataset Jan Christian Blaise Cruz and Charibeth Cheng GNU GPL 3.0
UD_Tagalog-TRG Stephanie Samson, Daniel Zeman, and Mary Ann Tan CC BY-SA 3.0
UD_Tagalog-Ugnayan Angelina Aquino CC BY-NC_SA 4.0