Closed ahgraber closed 6 months ago
Hey 🤗 thanks for opening an issue! We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead? I'm sure the community will be of help!
Thanks!
Thanks for the polite redirect :)
I'd like to use portions of the tokenizer pipeline (Normalizer, Pre-tokenizer) separately for some initial preprocessing/cleaning, do some external functions for additional preprocessing, then hand back to (a new?) tokenizer pipeline for
pre-tokenizer
custom (non-tokenizer pipeline) functions
Is there a way to create a Tokenizer pipeline object that doesn't tokenize? Or should I just do something like
Further, if I hope to apply these to a Huggingface Dataset, should I just map the function to the dataset?