Closed HaoboGu closed 3 years ago
For the code completion task, we tokenize the codes first before we feed them to BPE tokenizer, which means, e.g., a=func(b);
will be tokenized into a token sequence a = func ( b ) ;
. In this format, the whitespace is the only separator.
Thanks for the reply! In your example, the token sequence is ['a', '=', 'func', '(', 'b', ')', ';']
, if I'm right, the token sequence fed to BPE contains no whitespaces.
But in the completion task, the generated code should be like a = func(b);
, which contains whitespaces. How do you deal with it?
Actually, when we use BPE tokenizer, e.g., the huggingface style tokenizer, we feed the whole codes into it but not one token by one token. For example,
tokenizer.tokenize("a = func ( b ) ;")
It returns:
['a', 'Ġ=', 'Ġfunc', 'Ġ(', 'Ġb', 'Ġ)', 'Ġ;']
For another example:
tokenizer.tokenize("a = func(b);")
It returns:
['a', 'Ġ=', 'Ġfunc', '(', 'b', ');']
You may notice the special sub-token Ġ
to represent the whitespace (the separator).
BTW, in our code completion task, we tokenize the codes first, so don't worry about that a = func ( b ) ;
and a = func(b);
are the same thing in different format because we have tokenize it into the first format in preprocessing.
@celbree Thanks for the explanation
Hi there,
I'm trying to implement code-gpt in code completion task. I tried BPE tokenizer, but I found that BPE separates the raw source code by whitespace: https://github.com/rsennrich/subword-nmt/blob/823c880e4bfc4fce5359b8ea87cc14fcf8a60dc7/subword_nmt/get_vocab.py#L40
In source code, there are more separators, such as
.
,;
, etc.So my question is, did you consider whitespace as the only separator in code-gpt? If not, does whitespace is regarded as a single token? just like
<s>
,<EOL>