rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.36k stars 891 forks source link

[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

Closed shangw-nvidia closed 3 years ago

shangw-nvidia commented 3 years ago

Hi @VibhuJawa ,

As we discussed in the chat, it seems there was an problem that subword_tokenize would not handle special tokens (e.g., [CLS], [SEP]) correctly, and I'm creating this github issue to track it.

Thanks! Shang

davidwendt commented 3 years ago

Is this different than #5765? May that one could be updated instead?

VibhuJawa commented 3 years ago

@davidwendt , Could we close that one and use this one instead, that one was based on a hashing-logic where the problem might lie that has since be upstreamed. Anyways a minimal example of the issue:

Minimal Example

Helper Function to create vocab+text
import cudf
with open('test_vocab.txt','w') as f:
    string ='[PAD]\n[UNK]\n[CLS]\n[SEP]\n[MASK]\nclschar\nsepchar\nmsk_char\ni\nate\ndinner\nit\nwas\nyummy\n.'  
    f.write(string)

def create_vocab_table(vocabpath):
    """
        Create Vocabulary tables from the vocab.txt file

        Parameters:
        ___________
        vocabpath: Path of vocablary file
        Returns:
        ___________
        id2vocab: np.array, dtype=<U5
        vocab2id: dict that maps strings to int
    """
    id2vocab = []
    vocab2id = {}
    import numpy as np
    with open(vocabpath) as f:
        for index, line in enumerate(f):
            token = line.split()[0]
            id2vocab.append(token)
            vocab2id[token] = index
    return np.array(id2vocab), vocab2id

id2vocab,vocab2int = create_vocab_table('test_vocab.txt')

from cudf.utils.hash_vocab_utils  import hash_vocab
hash_vocab('test_vocab.txt','vocab-hash.txt')

Minimal Example: (The encoding of [CLS], [SEP] is off)

text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()])
[ 1  1  1  8  9 10 14  1  1  1 11 12 13 14  1  1  1]
['[UNK]' '[UNK]' '[UNK]' 'i' 'ate' 'dinner' '.' '[UNK]' '[UNK]' '[UNK]'
 'it' 'was' 'yummy' '.' '[UNK]' '[UNK]' '[UNK]']

Expected output

If switch the non special symbol to special symbol this goes away. Below is a work-around for the current issue.

text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
cudf_ser=cudf_ser.str.replace(["[CLS]",'[SEP]'],['clschar ',' sepchar '],regex=False)
cudf_ser=cudf_ser.str.normalize_spaces()
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
### replace all occurence of mask with the one in true vocab  ### its 4 here
tokens[tokens==5]=2
### replace all occurence of sepchar with 3 (true value)
tokens[tokens==6]=3
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()])

[ 2  8  9 10 14  3 11 12 13 14  3  0  0  0  0  0  0]
['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' 'it' 'was' 'yummy' '.' '[SEP]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]']
davidwendt commented 3 years ago

There is no code in the subword tokenizer implementation that looks for these special tokens. So this would be a feature request.

[CLS]I ate dinner.[SEP]It was yummy.[SEP]` 
is tokenized into (after lower-casing):
[ cls ] i ate dinner .  [ sep ] it was yummy .
1 1   1 8 9   10     14 1 1   1 11 12  13    14

The bracket characters '[' and ']' are categorized as pad-with-space probably so words inside are properly parsed/tokenized. What are the rules here?

VibhuJawa commented 3 years ago
  • Are the special tokens always 3 upper-case characters in brackets [XYZ]?

No, I don't think that is a safe assumption. This can be configurable based on vocabulary but the convention is to use them like that.

  • Are there a finite set of special tokens?

In most cases we have a finite set see below (from link ) but this can be configurable. See additional_special_tokens argument.

bos_token (str or tokenizers.AddedToken, optional) – A special token representing the beginning of a sentence. 

eos_token (str or tokenizers.AddedToken, optional) – A special token representing the end of a sentence. 

unk_token (str or tokenizers.AddedToken, optional) – A special token representing an out-of-vocabulary token.

sep_token (str or tokenizers.AddedToken, optional) – A special token separating two different sentences in the same input (used by BERT for instance). 

pad_token (str or tokenizers.AddedToken, optional) – A special token used to make arrays of tokens the same size for batching purpose.

cls_token (str or tokenizers.AddedToken, optional) – A special token representing the class of the input (used by BERT for instance). 

mask_token (str or tokenizers.AddedToken, optional) – A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT).
  • Should the code just always treat text [*] as a single token? This seems like it would be a significant change if anyone is relying the current behavior.

No, it really should not.

So this would be a feature request.

Gotcha, thanks for explaining that. Yes, then this will a feature request.

Behaviour for these special tokens:

I believe the requested behavior will be that we don't tokenize/lowercase these special tokens and skip any pre-processing so that they pick the right token_ids. This I believe will follow what hugging face does.

See link and link.

Inital Solution:

I think if we can just provide support for the above-mentioned 7 tokens with appropriate defaults will cover most use cases so if handling an arbitrary list of special tokens is extra work we can probably skip it for now.

CC: @raykallen and @BartleyR , In case they have any use cases that need more than the above 7 special tokens.

davidwendt commented 3 years ago

The solution I chose for this in #7254 was to hardcode recognizing the following 7 special tokens

[BOS] [EOS] [UNK] [SEP] [PAD] [CLS] [MASK]

These can appear anywhere in the string and may be upper or lower case. If the provided vocab hash includes these tokens, then they will be assigned appropriately otherwise they will assigned the UNK token value.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

davidwendt commented 3 years ago

@VibhuJawa Can we close this? We can reopen it if the solution in #7254 mentioned above is not adequate.

VibhuJawa commented 3 years ago

@davidwendt, Yup this is good to close. Thanks for your work on this.