How does codegpt's BPE tokenizer process whitespaces in code completion task?

microsoft / CodeXGLUE

CodeXGLUE

MIT License

1.56k stars 366 forks source link

How does codegpt's BPE tokenizer process whitespaces in code completion task? #42

Closed HaoboGu closed 3 years ago

HaoboGu commented 3 years ago

Hi there,

I'm trying to implement code-gpt in code completion task. I tried BPE tokenizer, but I found that BPE separates the raw source code by whitespace: https://github.com/rsennrich/subword-nmt/blob/823c880e4bfc4fce5359b8ea87cc14fcf8a60dc7/subword_nmt/get_vocab.py#L40

In source code, there are more separators, such as ., ;, etc.

So my question is, did you consider whitespace as the only separator in code-gpt? If not, does whitespace is regarded as a single token? just like <s>, <EOL>

celbree commented 3 years ago

For the code completion task, we tokenize the codes first before we feed them to BPE tokenizer, which means, e.g., a=func(b); will be tokenized into a token sequence a = func ( b ) ;. In this format, the whitespace is the only separator.

HaoboGu commented 3 years ago

Thanks for the reply! In your example, the token sequence is ['a', '=', 'func', '(', 'b', ')', ';'], if I'm right, the token sequence fed to BPE contains no whitespaces. But in the completion task, the generated code should be like a = func(b);, which contains whitespaces. How do you deal with it?

celbree commented 3 years ago

Actually, when we use BPE tokenizer, e.g., the huggingface style tokenizer, we feed the whole codes into it but not one token by one token. For example, tokenizer.tokenize("a = func ( b ) ;") It returns: ['a', 'Ġ=', 'Ġfunc', 'Ġ(', 'Ġb', 'Ġ)', 'Ġ;'] For another example: tokenizer.tokenize("a = func(b);") It returns: ['a', 'Ġ=', 'Ġfunc', '(', 'b', ');'] You may notice the special sub-token Ġ to represent the whitespace (the separator).

BTW, in our code completion task, we tokenize the codes first, so don't worry about that a = func ( b ) ; and a = func(b); are the same thing in different format because we have tokenize it into the first format in preprocessing.

HaoboGu commented 3 years ago

@celbree Thanks for the explanation