Add vocab UDF from TorchText

pytorch / torcharrow

High performance model preprocessing library on PyTorch

https://pytorch.org/torcharrow/beta/index.html

BSD 3-Clause "New" or "Revised" License

649 stars 79 forks source link

Add vocab UDF from TorchText #287

Closed parmeet closed 2 years ago

parmeet commented 2 years ago

Adding Vocab UDF to TorchArrow

Usage example:

import torcharrow as ta
import torcharrow._torcharrow as _ta
from torcharrow import functional as F
tokens = ["<unk>", "Hello", "world", "How", "are", "you!"]
# 0 is the default index which is returned when OOV token is queried
vocab = _ta.Vocab(tokens, 0)
df = ta.dataframe(
    {
        "text": [["Hello", "world"], ["How", "are", "you!", "OOV"]]
    }
)
df["indices"] = F.lookup_indices(vocab, df["text"])