Abnormal high results of BiLSTM baseline

JiyangZhang commented 3 years ago

Hi, thank you for proposing this model! I am running some experiments on my dataset for comparison between BiLSTM and Code2Seq. However, the BiLSTM seems to have very high score (80 F1) compared with Code2Seq (67 F1). So I am suspecting that the problem might lie in data processing. Below is one of my examples:

After modified JavaExtractor:

index of unicode char|private static int (String input) { for (int i = 0; i < input.length(); i++) { int c = input.charAt(i); if (c > 0x7F) { return i; } } return -1; }

After tokenized script

int ( String input ) { for ( int i = 0 ; i < input . length ( ) ; i ++ ) { int c = input . char At ( i ) ; if ( c > 0 x 7 F ) { return i ; } } return - 1 ; }

The above sequence will become the input of Bi-LSTM implemented by openNMT.

Is there any problem here? What would be the cause of this problem from your point of view?

Thank you very much. Best regards

urialon commented 3 years ago

Hi @JiyangZhang , Thank you for your interest in code2seq.

I can't tell the problem, processing seems OK. These very high results (80 F1) might hint that the problem is too easy for a learning model. Maybe there are duplicates between the training and the test? Do all examples come from a narrow domain, i.e., same project, or same "kind" of methods?

Cheers, Uri

JiyangZhang commented 3 years ago

Thanks, I will check soon

tech-srl / code2seq

Abnormal high results of BiLSTM baseline #75

After modified JavaExtractor:

After tokenized script