ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.2k stars 1.19k forks source link

Refactor NgramTokenizer #4031

Open mhabedank opened 1 month ago

mhabedank commented 1 month ago

The NgramTokenizer is using torchtext. We want to remove torchtext as a dependency so this Tokenizer has to be refactored not using it.

nqbao commented 1 month ago

if you can provide an example i can help with the rest

mhabedank commented 1 month ago

If we decide to replace the dependency, this would be about 5 lines of code: https://pytorch.org/text/stable/_modules/torchtext/data/utils.html#ngrams_iterator

torchtext is used here:

https://github.com/ludwig-ai/ludwig/blob/00c51e0a286c3fa399a07a550e48d0f3deadc57d/ludwig/utils/tokenizers.py#L142C1-L145C60

nqbao commented 4 weeks ago

can we just copy the code over?

mhabedank commented 4 weeks ago

yeah that would probably be the solution for this tokenizer.