[Code Completion - Token level] About accuracy calculation

iCSawyer commented 1 year ago

I have 2 questions about accuracy calculation in token-level code completion:

I want to make sure that whether special tokens like <STR_LIT> and <CHAR_LIT:E> are included (or excluded) in the process of accuracy calculation. I have read function eval_acc() in run_lm.py and it seems that they are all included (except for <NUM_LIT>).
After subsituting a string with <STR_LIT> or <STR_LIT:EE>, the preprocessing will add " to the start and the end of <STR_LIT>, which is different from <CHAR_LIT> and <NUM_LIT>. When we use transformers.GPT2Tokenizer to tokenize it, "<STR_LIT>" will be splitted into 3 sub-tokens ", <STR_LIT> and " but actually there is just one token - the origin string. This may lead to some bad influence to the accuray calculation and the inference.

Thank you for your reply!

celbree commented 1 year ago

Yes. It's included in the evaluation. Since the special tokens are included in the training set, models finetuned on our training set are expected to predict such normalized tokens.
Since there might be several prefixes in a string, like ", ', """, ''', r', b', etc., we preserve such characters. Besides, our evaluation metric is token-level, not subtoken-level. "<STR_LIT>" is treated as one token during evaluation even though the model has to predict three subtokens.

St3p99 commented 1 year ago

I have some doubt about token level accuracy calculation.

The inputs of the model are N=batch size encoded string of length L=block size. The output, instead, are N=batch size predictions of the next token or the input string are concatenated with the output?

In each of the two case, i don't understand how the evaluation works.

microsoft / CodeXGLUE

[Code Completion - Token level] About accuracy calculation #141