Questions about dataset

microsoft / CodeXGLUE

CodeXGLUE

MIT License

1.56k stars 366 forks source link

Questions about dataset #132

Closed karlie38 closed 2 years ago

karlie38 commented 2 years ago

I'd like to use /CodeXGLUE/Code-Code/CodeCompletion-token/dataset/javaCorpus/token_completion dataset, and have a following question.

When I download the dataset, it is already splited by ' . '. The original data might be "package org.vaadin.teemu.clara.demo". Could you explain it?

celbree commented 2 years ago

To better evaluate token level accuracy, we tokenize the original code. Take package org.vaadin.teemu.clara.demo as an example, if it isn't split by ., org.vaadin.teemu.clara.demo will be seen as a single token. The accuracy score of prediction A package org.vaadin.teemu.clara is the same as prediction B package net. But prediction A is obviously better than B.