huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

Questions re: Tokenizer pipeline composability #1417

Closed ahgraber closed 6 months ago

ahgraber commented 6 months ago

I'd like to use portions of the tokenizer pipeline (Normalizer, Pre-tokenizer) separately for some initial preprocessing/cleaning, do some external functions for additional preprocessing, then hand back to (a new?) tokenizer pipeline for

  1. normalizer
  2. pre-tokenizer

  3. custom (non-tokenizer pipeline) functions

  4. tokenizer.normalizer
  5. tokenizer.pre-tokenizer
  6. tokenizer.tokenize

Is there a way to create a Tokenizer pipeline object that doesn't tokenize? Or should I just do something like

nzr = normalizers.Sequence(...)
ptok = pre_tokenizer(...)

def custom_fn(text: str):
    # custom preprocessing
    ...
    return txt

cleaned = custom_fn(
    ptok.pre_tokenize_str(
       nzr.normalize_str(text)
    )
)

Further, if I hope to apply these to a Huggingface Dataset, should I just map the function to the dataset?

my_ds = load_dataset(...)
nzr = normalizers.Sequence(...)
ptok = pre_tokenizer(...)

my_ds = my_ds.map(nzr.normalize_str)
my_ds = my_ds.map(ptok.pre_tokenize_str)
ArthurZucker commented 6 months ago

Hey 🤗 thanks for opening an issue! We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead? I'm sure the community will be of help!

Thanks!

ahgraber commented 6 months ago

Thanks for the polite redirect :)