Closed JiyangZhang closed 3 years ago
Hi @JiyangZhang , Thank you for your interest in code2seq.
I can't tell the problem, processing seems OK. These very high results (80 F1) might hint that the problem is too easy for a learning model. Maybe there are duplicates between the training and the test? Do all examples come from a narrow domain, i.e., same project, or same "kind" of methods?
Cheers, Uri
Thanks, I will check soon
Hi, thank you for proposing this model! I am running some experiments on my dataset for comparison between BiLSTM and Code2Seq. However, the BiLSTM seems to have very high score (80 F1) compared with Code2Seq (67 F1). So I am suspecting that the problem might lie in data processing. Below is one of my examples:
After modified JavaExtractor:
index of unicode char|private static int (String input) { for (int i = 0; i < input.length(); i++) { int c = input.charAt(i); if (c > 0x7F) { return i; } } return -1; }
After tokenized script
int ( String input ) { for ( int i = 0 ; i < input . length ( ) ; i ++ ) { int c = input . char At ( i ) ; if ( c > 0 x 7 F ) { return i ; } } return - 1 ; }
The above sequence will become the input of Bi-LSTM implemented by openNMT.
Is there any problem here? What would be the cause of this problem from your point of view?
Thank you very much. Best regards