[FEA] Byte Pair Encoding Tokenizer

VibhuJawa commented 3 years ago

Is your feature request related to a problem? Please describe.

We should add byte pair encoding tokenizer to cuDF. Like our subword-tokenizer adds a bridge to Bert link models. Byte Pair EncodingTokenizer is used by roberta, gpt-2 , gpt-3 and will give us a bridge to a lot of DL models.

We should focus porting a pre-trained tokenizer first.

Describe the solution you'd like

The implimentation should follow GPT-2 tokenizer but should be extendable to the robert-a , gpt-3, megatron etc. We should follow the HuggingFace API for this.

Algorithim:

Add an identifier (</w>) at the end of each word to identify the end of a word and then calculate the word frequency in the text.
Split the word into characters and then calculate the character frequency.
From the character tokens, for a predefined number of iterations, count the frequency of the consecutive byte pairs and merge the most frequently occurring byte pairing.
Keep iterating until you have reached the iteration limit (set by you) or until you have reached the token limit.

Ref: Link

Additional context

Best Explanation of Algorithm: https://leimao.github.io/blog/Byte-Pair-Encoding/

CC: @randerzander , @beckernick

VibhuJawa commented 2 years ago

We will probably need a libcudf implementation of the following function BPE function . (See HF reference implementation) .

Here given the rank of each bigram we combine the most occuring bigram based on the rank provided in merges file. Once we have that we then convert it into token id using the vocabulary provided.

Actual Algorithim:


def bpe(token, bpe_ranks):
    # if token in self.cache:
    #     return self.cache[token]
    word = tuple(token)
    pairs = get_pairs(word)

    if not pairs:
        return token

    while True:
        bigram = min(pairs, key=lambda pair: bpe_ranks.get(pair, float("inf")))
        #print(bigram)

        if bigram not in bpe_ranks:
            break
        first, second = bigram
        new_word = []
        i = 0
        while i < len(word):
            try:
                j = word.index(first, i)
            except ValueError:
                new_word.extend(word[i:])
                break
            else:
                new_word.extend(word[i:j])
                i = j

            if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                new_word.append(first + second)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word = tuple(new_word)
        word = new_word
        if len(word) == 1:
            break
        else:
            pairs = get_pairs(word)
            print(pairs)
    word = " ".join(word)
    #self.cache[token] = word
    return word

   def get_pairs(word):
    """
    Return set of symbol pairs in a word.

    Word is represented as tuple of symbols (symbols being variable-length strings).
    """
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs

Example Call

# wget https://huggingface.co/gpt2/raw/main/merges.txt 
# to get this file

merges_file = 'gpt_2_tokenizer/merges.txt'
with open(merges_file, encoding="utf-8") as merges_handle:
    bpe_merges = merges_handle.read().split("\n")[1:-1]
bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

bpe("Thisisit", bpe_ranks)

'This is it'

VibhuJawa commented 2 years ago

CC: @davidwendt for awareness.

meghmak13 commented 2 years ago

There is a need for the aforementioned feature request as we currently only support tokenization for BERT. As especially considering newer architectures like RoBERTa, GPT, T5 are getting adopted.

VibhuJawa commented 2 years ago

Basic Algo:

Basic pre-processing like space cleanup, utf-8 decoding.
Tokenize each sentence based on a delimiter
Call BPE on each token to further tokenize it
Find the numeric representation of each token in the provided vocabulary
Pad according to the provided padding and return the input_ids which are essentially the key look up from the vocabulary table
Also return the attention_masks which are a binary tensor indicating the position of the padded indices so that the model does not attend to them.

Extra Notes We will have to add stuff like padding and strides similar to what we have for the Subword tokenizer.

Python code to show this in action

from transformers import GPT2Tokenizer
import pandas as pd
import json

# !wget https://huggingface.co/gpt2/raw/main/vocab.json
# !wget https://huggingface.co/gpt2/raw/main/merges.txt
with open('vocab.json') as f:
    token_to_id = json.load(f)
    id_to_token = {v: k for k, v in token_to_id.items()}

text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
tokenizer = GPT2Tokenizer(vocab_file = 'vocab.json', merges_file='merges.txt')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
encoded_batch = tokenizer.batch_encode_plus(text_ser,
                                            return_tensors='np',
                                            truncation=True, 
                                            padding='max_length',
                                            max_length=12)

print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])

print("tokenizer-output-with-not=cleaned-up-special-token ", [id_to_token.get(i, '[PAD]') for i in encoded_batch['input_ids'][0]])
print("tokenizer-output-cleaned-up", [tokenizer.decode(i) for i in encoded_batch['input_ids'][0]])
print("Final Output of tokenizer: ", encoded_batch['input_ids'][0])

print("\n"+"*"*50+"\n")
print("Batched Output")
print("Final Output of tokenizer:\n", encoded_batch['input_ids'])

BPE output ['This', 'is', 'test - sent ence - 1']
tokenizer-output-with-not=cleaned-up-special-token  ['This', 'Ġis', 'Ġtest', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
tokenizer-output-cleaned-up ['This', ' is', ' test', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Final Output of tokenizer:  [ 1212   318  1332    12 34086   594    12    16 50257 50257 50257 50257]

**************************************************

Batched Output
Final Output of tokenizer:
 [[ 1212   318  1332    12 34086   594    12    16 50257 50257 50257 50257]
 [ 1212   318  1332  6827    12    17 50257 50257 50257 50257 50257 50257]
 [ 1212    12   271  1332  6827   513 50257 50257 50257 50257 50257 50257]]

CC: @davidwendt

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

teju85 commented 2 years ago

Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.

davidwendt commented 2 years ago

Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.

I've not worked on it yet but I hope to start on it in 22.04.

davidwendt commented 2 years ago

@VibhuJawa Some questions based on the examples given here. You want a BPE function that takes a host string (and the merge/rank table) and returns the BPE as a host string?

This shows passing in a word (substring of a string) and returning the BPE and then the Python code builds an array of BPE strings from each token.

text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
...
print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])

The Thisisit example showed the same thing -- single host string returns a single host string.

I'm trying to understand the inputs and outputs from a cudf usecase. Are you expecting give the libcudf BPE API a strings column of words and return the encoding of each as a strings column?

Or do I have this all wrong and you are expecting a libcudf API that does everything GPT2Tokenizer is doing in the last example above?

davidwendt commented 2 years ago

For reference: https://gist.github.com/VibhuJawa/8df50cd638d3d98f36109d8316dfa4ad

VibhuJawa commented 2 years ago

On the vocab front

I tried to verify if we can indeed treat the vocab.json files similar to how we treat vocab in Subword tokenizer and i think we can but there are three main discrepancies i found.

Similarity: The vocab dict is a continuous range of ints mapping to a tokens.

Verified that across the commonly used models the token->id dict can be treated as a list as there are no missing ids (Its a continuous range ) like subword tokenizer vocabulary.

Below for the verification reference: https://gist.github.com/VibhuJawa/1670178d07d9659a084a8fbe7d160d23

Discrepancy:

Special Tokens: Most BPE models have these special tokens

'<s>', '</s>', '<unk>', '<pad>', '<mask>'

but can also include something like

'<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>',

while subword one mostly has these:

[BOS],[EOS],[UNK],[SEP],[PAD],[CLS],[MASK];

I think it might make sense to make this configurable from the python API that we will initialize with the right defaults.

2. Padding Token:

Padding token's id is dependent on dictionary (id of <pad>) so its value can change. We should ensure we handle that correctly.

I think (unsure) but we just treat it as 0 currently in Subword.

Treating space characters.

BPE seems to treat space characters differently . That is Hello world and Hello world get mapped differently.

When there is space before the word it gets mapped to ĠHello and if no space to Hello .

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

id_to_token = {v: k for k, v in tokenizer.vocab.items()}

no_space_hello = "Hello world"
no_space_input_ids = tokenizer(no_space_hello ,add_special_tokens=False)['input_ids']
print(no_space_input_ids)
print([id_to_token[i] for i in no_space_input_ids])
print("----"*10)
space_hello = " Hello world"
space_input_ids = tokenizer(space_hello ,add_special_tokens=False)['input_ids']
print(space_input_ids)
print([id_to_token[i] for i in space_input_ids])

[31414, 232]
['Hello', 'Ġworld']
----------------------------------------
[20920, 232]
['ĠHello', 'Ġworld']

On getting a testable example to you.

Sorry on getting a meaningful end to end python example that works across models. It turns out to be tougher than I anticipated but will update here once i have it working.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

BartleyR commented 2 years ago

We have a potential Morpheus customer who wants to use the phishing detection pipeline but in a non-English language. So we'd have to replace the BERT model with something else, and it would need a BPE tokenizer. We can do a POC using a CPU-based tokenizer, but would be good to scope this if we can for an upcoming release. @GregoryKimball for viz

GregoryKimball commented 1 year ago

This request is still relevant. After discussing with @VibhuJawa, the next step is benchmarking a GPT-3 style training workflow, and measuring the percentage of time spent in tokenization. If tokenization is 15-30% of the total time (as we see in bert), then this is worth prioritizing. Otherwise we should recommend tokenization with HuggingFace.

rapidsai / cudf

[FEA] Byte Pair Encoding Tokenizer #9657

Actual Algorithim:

Example Call