Add training directory for v0.1.0

Description

Closes #16

This PR adds the initial models for calamanCy. For the first few versions, I think it's better to just have a _md to _trf model. There are also some design decisions that I used when training the pipelines. Here are the notable ones:

I merged the two UD treebanks because each one is too small to train a proper model from.
For the large model, I used the fastText vectors (714k keys). It was trained on CommonCrawl and should be useful for most applications.
For the transformer model, I am using jcblaise/roberta-tagalog-base. During my experiments, there's marginal difference between the base and large RoBERTa models, so I opted for the smaller version. This resulted to a slightly "leaner" pipeline (700 MB).
Please note that the Ugnayan dataset has a Non-commercial license. The TLUnified dataset is also GNU GPL, so it might be best to mention this when sharing to folks. I'd definitely want to rant about licensing, but that would be a topic for another blog post.

Models and pipelines

Model	Pipelines	Description
tl_calamancy_md (73.7 MB)	tok2vec, tagger, morphologizer, parser, ner	CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys)
tl_calamancy_lg (431.9 MB)	tok2vec, tagger, morphologizer, parser, ner	CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k)
tl_calamancy_trf (775.6 MB)	transformer, tagger, parser, ner	GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors.

Data sources

Source	Authors	License
TLUnified Dataset	Jan Christian Blaise Cruz and Charibeth Cheng	GNU GPL 3.0
UD_Tagalog-TRG	Stephanie Samson, Daniel Zeman, and Mary Ann Tan	CC BY-SA 3.0
UD_Tagalog-Ugnayan	Angelina Aquino	CC BY-NC_SA 4.0

ljvmiranda921 / calamanCy

Add training directory for v0.1.0 #17

Description

Models and pipelines

Data sources