huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.73k stars 26.23k forks source link

Tokenizer for MyT5 Model #31260

Open tomlimi opened 3 months ago

tomlimi commented 3 months ago

Model description

MyT5 is a sister model of ByT5 trained on Morphologically-dervied byte sequences (MYTES). The model itself shares implementation with T5ForConditionalGeneration, so only addition of custom MyT5Tokenizer is needed to run model from hugging-face.

Open source status

Provide useful links for the implementation

The tokenizer implementation is available at: https://github.com/tomlimi/MYTE/tree/main/src/myt5 MYTEs and MyT5 training are described in a research paper: https://arxiv.org/pdf/2403.10691 Model cards: https://huggingface.co/Tomlim/myt5-large

amyeroberts commented 3 months ago

cc @ArthurZucker

ArthurZucker commented 3 months ago

FWY @itazap