nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Custom model with pretokenized input including multiword #56

Open ziqianPeng opened 2 years ago

ziqianPeng commented 2 years ago

Hello! I'm trying to train custom parser using trankit with pretokenized input extracted from conllu files.

Maybe I didn't get the right way but in my way some bug occurred for French (multiword token) and Chinese ("KeyError UD-Japanese-Like" if I parse my test file just after finish training), so I modified the source code to fix them. I also modified the path of xlm_roberta model in file_utils.py such that it will be downloaded only one time when training multiple models of the same type, such as 'customized'. The file train_pred_trainkit.py is an example to apply these modification, especially the function pred_trankit.

I hope this would be helpful for you and thanks a lot for developing trankit!