Open VibhuJawa opened 3 years ago
We will probably need a libcudf implementation of the following function BPE function . (See HF reference implementation) .
Here given the rank of each bigram we combine the most occuring bigram based on the rank provided in merges file. Once we have that we then convert it into token id using the vocabulary provided.
def bpe(token, bpe_ranks):
# if token in self.cache:
# return self.cache[token]
word = tuple(token)
pairs = get_pairs(word)
if not pairs:
return token
while True:
bigram = min(pairs, key=lambda pair: bpe_ranks.get(pair, float("inf")))
#print(bigram)
if bigram not in bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
except ValueError:
new_word.extend(word[i:])
break
else:
new_word.extend(word[i:j])
i = j
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
print(pairs)
word = " ".join(word)
#self.cache[token] = word
return word
def get_pairs(word):
"""
Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs
# wget https://huggingface.co/gpt2/raw/main/merges.txt
# to get this file
merges_file = 'gpt_2_tokenizer/merges.txt'
with open(merges_file, encoding="utf-8") as merges_handle:
bpe_merges = merges_handle.read().split("\n")[1:-1]
bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
bpe("Thisisit", bpe_ranks)
'This is it'
CC: @davidwendt for awareness.
There is a need for the aforementioned feature request as we currently only support tokenization for BERT. As especially considering newer architectures like RoBERTa, GPT, T5 are getting adopted.
Basic Algo:
input_ids
which are essentially the key look up from the vocabulary table attention_masks
which are a binary tensor indicating the position of the padded indices so that the model does not attend to them.Extra Notes We will have to add stuff like padding and strides similar to what we have for the Subword tokenizer.
Python code to show this in action
from transformers import GPT2Tokenizer
import pandas as pd
import json
# !wget https://huggingface.co/gpt2/raw/main/vocab.json
# !wget https://huggingface.co/gpt2/raw/main/merges.txt
with open('vocab.json') as f:
token_to_id = json.load(f)
id_to_token = {v: k for k, v in token_to_id.items()}
text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
tokenizer = GPT2Tokenizer(vocab_file = 'vocab.json', merges_file='merges.txt')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
encoded_batch = tokenizer.batch_encode_plus(text_ser,
return_tensors='np',
truncation=True,
padding='max_length',
max_length=12)
print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])
print("tokenizer-output-with-not=cleaned-up-special-token ", [id_to_token.get(i, '[PAD]') for i in encoded_batch['input_ids'][0]])
print("tokenizer-output-cleaned-up", [tokenizer.decode(i) for i in encoded_batch['input_ids'][0]])
print("Final Output of tokenizer: ", encoded_batch['input_ids'][0])
print("\n"+"*"*50+"\n")
print("Batched Output")
print("Final Output of tokenizer:\n", encoded_batch['input_ids'])
BPE output ['This', 'is', 'test - sent ence - 1']
tokenizer-output-with-not=cleaned-up-special-token ['This', 'Ġis', 'Ġtest', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
tokenizer-output-cleaned-up ['This', ' is', ' test', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Final Output of tokenizer: [ 1212 318 1332 12 34086 594 12 16 50257 50257 50257 50257]
**************************************************
Batched Output
Final Output of tokenizer:
[[ 1212 318 1332 12 34086 594 12 16 50257 50257 50257 50257]
[ 1212 318 1332 6827 12 17 50257 50257 50257 50257 50257 50257]
[ 1212 12 271 1332 6827 513 50257 50257 50257 50257 50257 50257]]
CC: @davidwendt
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.
Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.
I've not worked on it yet but I hope to start on it in 22.04.
@VibhuJawa Some questions based on the examples given here. You want a BPE function that takes a host string (and the merge/rank table) and returns the BPE as a host string?
This shows passing in a word (substring of a string) and returning the BPE and then the Python code builds an array of BPE strings from each token.
text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
...
print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])
The Thisisit
example showed the same thing -- single host string returns a single host string.
I'm trying to understand the inputs and outputs from a cudf usecase. Are you expecting give the libcudf BPE API a strings column of words and return the encoding of each as a strings column?
Or do I have this all wrong and you are expecting a libcudf API that does everything GPT2Tokenizer
is doing in the last example above?
On the vocab front
I tried to verify if we can indeed treat the vocab.json
files similar to how we treat vocab
in Subword tokenizer and i think we can but there are three main discrepancies i found.
Similarity: The vocab dict is a continuous range of ints mapping to a tokens.
Verified that across the commonly used models the token->id dict
can be treated as a list as there are no missing ids (Its a continuous range ) like subword
tokenizer vocabulary.
Below for the verification reference: https://gist.github.com/VibhuJawa/1670178d07d9659a084a8fbe7d160d23
Discrepancy:
'<s>', '</s>', '<unk>', '<pad>', '<mask>'
but can also include something like
'<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>',
while subword one mostly has these:
[BOS],[EOS],[UNK],[SEP],[PAD],[CLS],[MASK];
I think it might make sense to make this configurable from the python API that we will initialize with the right defaults.
2. Padding Token:
Padding token's id is dependent on dictionary (id of <pad>
) so its value can change. We should ensure we handle that correctly.
I think (unsure) but we just treat it as 0
currently in Subword.
BPE seems to treat space characters differently . That is Hello world
and Hello world
get mapped differently.
When there is space before the word it gets mapped to ĠHello
and if no space to Hello
.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
id_to_token = {v: k for k, v in tokenizer.vocab.items()}
no_space_hello = "Hello world"
no_space_input_ids = tokenizer(no_space_hello ,add_special_tokens=False)['input_ids']
print(no_space_input_ids)
print([id_to_token[i] for i in no_space_input_ids])
print("----"*10)
space_hello = " Hello world"
space_input_ids = tokenizer(space_hello ,add_special_tokens=False)['input_ids']
print(space_input_ids)
print([id_to_token[i] for i in space_input_ids])
[31414, 232]
['Hello', 'Ġworld']
----------------------------------------
[20920, 232]
['ĠHello', 'Ġworld']
On getting a testable example to you.
Sorry on getting a meaningful end to end python example that works across models. It turns out to be tougher than I anticipated but will update here once i have it working.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
We have a potential Morpheus customer who wants to use the phishing detection pipeline but in a non-English language. So we'd have to replace the BERT model with something else, and it would need a BPE tokenizer. We can do a POC using a CPU-based tokenizer, but would be good to scope this if we can for an upcoming release. @GregoryKimball for viz
This request is still relevant. After discussing with @VibhuJawa, the next step is benchmarking a GPT-3 style training workflow, and measuring the percentage of time spent in tokenization. If tokenization is 15-30% of the total time (as we see in bert
), then this is worth prioritizing. Otherwise we should recommend tokenization with HuggingFace.
Is your feature request related to a problem? Please describe.
We should add byte pair encoding tokenizer to cuDF. Like our subword-tokenizer adds a bridge to Bert link models. Byte Pair EncodingTokenizer is used by
roberta
,gpt-2
,gpt-3
and will give us a bridge to a lot of DL models.We should focus porting a pre-trained tokenizer first.
Describe the solution you'd like
The implimentation should follow GPT-2 tokenizer but should be extendable to the robert-a ,
gpt-3
,megatron
etc. We should follow the HuggingFace API for this.Algorithim:
(</w>)
at the end of each word to identify the end of a word and then calculate the word frequency in the text.Ref: Link
Additional context
Best Explanation of Algorithm: https://leimao.github.io/blog/Byte-Pair-Encoding/
CC: @randerzander , @beckernick