shon-otmazgin / fastcoref

MIT License
142 stars 25 forks source link

Trainer #10

Closed shon-otmazgin closed 1 year ago

shon-otmazgin commented 1 year ago

Distil your own coref model

On top of the provided models, the package also provides the ability to train and distill coreference models on your own data, opening the possibility for fast and accurate coreference models for additional languages and domains.

To be able to distil your own model you need:

  1. A Large unlabeled dataset, for instance Wikipedia or any other source.

    Guidelines:

    1. Each dataset split (train/dev/test) should be in separate file.
      1. Each file should be in jsonlines format
      2. Each json line in the file must include at least one of:
        1. text: str - a raw text string.
        2. tokens: List[str] - a list of tokens (tokenized text).
        3. sentences: List[List[str]] - a list of lists of tokens (tokenized sentences).
      3. clusters information (see next for annotation) as a span start/end indices of the provided field text(char level) tokens(word level) sentences(word level)`.
  2. A model to annotate the clusters. For instance, It can be the package LingMessCoref model.

    
    from fastcoref import LingMessCoref

model = LingMessCoref() preds = model.predict(texts=texts, output_file='train_file_with_clusters.jsonlines')


3. Train and evaluate your own `FCoref`
```python
from fastcoref import TrainingArgs, CorefTrainer

args = TrainingArgs(
    output_dir='test-trainer',
    overwrite_output_dir=True,
    model_name_or_path='distilroberta-base',
    device='cuda:2',
    epochs=129,
    logging_steps=100,
    eval_steps=100
)   # you can control other arguments such as learning head and others.

trainer = CorefTrainer(
    args=args,
    train_file='train_file_with_clusters.jsonlines', 
    dev_file='path-to-dev-file',    # optional
    test_file='path-to-test-file'   # optional
)
trainer.train()
trainer.evaluate(test=True)

trainer.push_to_hub('your-fast-coref-model-path')

After finish training your own model, push the model the huggingface hub (or keep it local), and load your model:

from fastcoref import FCoref

model = FCoref(
   model_name_or_path='your-fast-coref-model-path',
   device='cuda:0'
)