nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
735 stars 102 forks source link

Is it possible to extend Trankit for target language generation? #72

Open yash-srivastava19 opened 1 year ago

yash-srivastava19 commented 1 year ago

Hi !!

I was curious whether Trankit can be extended(by modifying/adding components to the Training Pipeline) for target language generation task, say for example morphological generation. I really like the approach that was done using Transformers, and wanted to ask whether it is possible for TL? Any help on this would be really appreciated.

Apart from that, as per the architecture given in the paper :

architecture

Is it at all possible to not go in the sequential order as given. Suppose, If I want to give the output from the Lemmatizer to the NER module or from output from the PosDep module to the NER module? Is at all possible to do it without breaking the system? Any pointers would be really really really helpful

singhakr commented 1 year ago

I am part of the same team as Yash, who posted this question. We have been browsing through the code and we have some clues about how it could be done, but we are not so far able to put it all together. What we want to do is something like extending the customized-mwt-ner pipeline so that it can also be trained to do morphological inflection in context, using the output of customized-mwt-ner pipeline on which lexical transfer has been carried out, so that the lemmas are now in a language different from the one on which customized-mwt-ner pipeline was run. I guess the key part for this will be using an adapter.

If we are able to build this pipeline, we would also be happy to share it here.

Or, if that is too much work or not feasible for other reasons, could we use the model from customized-mwt-ner pipeline for morphological inflection in context using a finetuning or transfer learning approach?

singhakr commented 1 year ago

We also want to do the second part of what Yash asked, but that is apart from the morphological inflection in context part.

yash-srivastava19 commented 1 year ago

In terms of implementation, it might be making a custom pipeline that does opposite of what is being done by the conventional pipelines in the toolkit. Instead of having to do tokenization, tagging from sentence, we can have the opposite. If this can be added as a feature, then it will be really beneficial for machine translation tasks - which like us, many will be planning on using.