Contains the code for our ICSE 2020 paper: Big Code != Big Vocabulary: Open-Vocabulary Language Models for Source Code and for its earlier pre-print: Maybe Deep Neural Networks are the Best Choice for Modeling Source Code (https://arxiv.org/abs/1903.05734). This is the first open vocabulary language model for code that uses the byte pair encoding algorithm (BPE) to learn a segmentation of code tokens into subword units.
Hello,
During the BPE encoding, subword-nmt is generating the file _{codesfile}. Can you please share this file? If it is not possible, can you share the training set for obtaining the {codes_file}?
I would like to use OpenVocabNLM with a certain dataset and compare my results with the ones obtained in your research.
Furthermore, I have another question. Do you run create_subtoken_data.py and non-ascii_sequences_to_unk.py before BPE encoding?
Hello, During the BPE encoding, subword-nmt is generating the file _{codesfile}. Can you please share this file? If it is not possible, can you share the training set for obtaining the {codes_file}? I would like to use OpenVocabNLM with a certain dataset and compare my results with the ones obtained in your research.
Furthermore, I have another question. Do you run create_subtoken_data.py and non-ascii_sequences_to_unk.py before BPE encoding?
Thank you.