nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

lemmatize_doc ignores POS on pre-tokenized input. #86

Open maayanorner opened 4 months ago

maayanorner commented 4 months ago

https://github.com/nlp-uoregon/trankit/blob/f6b916b39decb3bc8b87b0c4bf100f3de41e3d24/trankit/pipeline.py#L980

The issue is self-explanatory, this behavior is a bit unexpected. I am not sure if it's a bug or a design decision. It happens in _lemmatize_sent as well.

    def _lemmatize_doc(self, in_doc, obmit_tag=False):  # assuming input is a document
        if type(in_doc) == str:  # in_doc is a raw string in this case
            in_doc = self._tokenize_doc(in_doc)
            in_doc = self._posdep_doc(in_doc)

        lemmatized_doc = self._lemma_model[self._config.active_lang].predict(in_doc, obmit_tag)

        gc.collect()
        return lemmatized_doc

Possible fix (perhaps with some condition in cases where the data is already tagged):

    def _lemmatize_doc(self, in_doc, obmit_tag=False):  # assuming input is a document
        if type(in_doc) == str:  # in_doc is a raw string in this case
            in_doc = self._tokenize_doc(in_doc)
        in_doc = self._posdep_doc(in_doc)

        lemmatized_doc = self._lemma_model[self._config.active_lang].predict(in_doc, obmit_tag)

        gc.collect()
        return lemmatized_doc