nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
735 stars 102 forks source link

KeyError: 'lemma' #48

Open Bachstelze opened 2 years ago

Bachstelze commented 2 years ago

Following the code from https://trankit.readthedocs.io/en/latest/training.html#training-a-lemmatizer i get a KeyError: 'lemma':

Setting up training config...
Initialized lemmatizer trainer
Training dictionary-based lemmatizer

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

[<ipython-input-9-a90867cc5ef3>](https://localhost:8080/#) in <module>()
     11 
     12 # start training
---> 13 trainer.train()

3 frames

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in train(self)
    680             self._train_posdep()
    681         elif self._task == 'lemmatize':
--> 682             self._train_lemma()
    683         elif self._task == 'ner':
    684             self._train_ner()

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in _train_lemma(self)
    581 
    582     def _train_lemma(self):
--> 583         self._lemma_model.train()
    584 
    585     def _train_ner(self):

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in train(self)
    379             self.config.logger.info("Training dictionary-based lemmatizer")
    380             self.trainer.train_dict(
--> 381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
    383                          type(token[ID]) == tuple and len(token[ID]) == 2)])

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in <listcomp>(.0)
    381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
--> 383                          type(token[ID]) == tuple and len(token[ID]) == 2)])
    384             dev_preds = self.trainer.predict_dict(
    385                 [[token[TEXT], token[UPOS]] for sentence in self.dev_batch.doc for token in sentence if

KeyError: 'lemma'

The recent version from https://github.com/UniversalDependencies/UD_Thai-PUD is used as trainings and development data.

Bachstelze commented 2 years ago

There are no Lemmas in the training data. So there can't be lemmatizer?! Can't i use the the other parts of the pipeline? When i run

from trankit import Pipeline
p = Pipeline(lang='customized', cache_dir='./save_dir')

the following error occurs:

BadZipFile: File is not a zip file
gcelano commented 3 months ago

I get the same error when trying to train the lemmatizer:

Setting up training config...
Initialized lemmatizer trainer
Training dictionary-based lemmatizer
Traceback (most recent call last):
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/custom_train00.py", line 15, in <module>
    trainer.train()
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/tpipeline.py", line 683, in train
    self._train_lemma()
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/tpipeline.py", line 584, in _train_lemma
    self._lemma_model.train()
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/models/lemma_model.py", line 381, in train
    [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/models/lemma_model.py", line 381, in <listcomp>
    [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
KeyError: 'lemma'