Open normster opened 2 years ago
Hello,
I don't have experience with them. Can you provide more information:
cffi
or python modules ?I'm storing the textual metadata in a JSON field. Here is a quick tour of how they work: https://huggingface.co/docs/transformers/preprocessing. They expect strings as input and output dictionaries of int arrays as PyTorch/TensorFlow tensors or a list of int literals. They work on both batched (as a list) or single strings. They come in two varieties: a full python version and a faster version which wraps an underlying Rust implementation. They run on CPU and I estimate that the Python version of the BERT tokenizer processes a sentence about as fast as torchvision takes to process an image with standard ResNet-style augmentations.
Do you need all the three elements of the dict that the tokenizer returns ?
Is there a recommended way of using HuggingFace tokenizers inside ffcv pipelines? I realize I could pre-tokenize the text and store the raw ints in the dataset, but I'd like the flexibility of switching between different tokenizers without re-processing the dataset.