Reproducing numbers from the paper on java-med dataset

tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"

http://code2seq.org

MIT License

556 stars 164 forks source link

Reproducing numbers from the paper on java-med dataset #18

Closed claudiosv closed 5 years ago

claudiosv commented 5 years ago

Hi,

Thanks for this great work. I'm trying to reproduce the results from the paper for java-med, and I was wondering what values for config.SUBTOKENS_VOCAB_MAX_SIZE and config.TARGET_VOCAB_MAX_SIZE were used? I couldn't find it in the paper or in any existing issue.

Thank you in advance.

Best, Claudio

urialon commented 5 years ago

Hi @claudiosv . I used for Java-med: SUBTOKENS_VOCAB_MAX_SIZE = 184379 TARGET_VOCAB_MAX_SIZE = 10903

The reason that the numbers are not round is that in the original implementation I limited the vocabulary by taking only tokens/targets that appear at least X times. This lead to these vocab sizes. In the open source version, I changed the vocab to be limited by its max size, taking the mostly occurring tokens/targets.

Let me know if you have any more questions.

claudiosv commented 5 years ago

Hi @urialon, thanks for the details! Very much appreciated.