mast-group / OpenVocabCodeNLM

Contains the code for our ICSE 2020 paper: Big Code != Big Vocabulary: Open-Vocabulary Language Models for Source Code and for its earlier pre-print: Maybe Deep Neural Networks are the Best Choice for Modeling Source Code (https://arxiv.org/abs/1903.05734). This is the first open vocabulary language model for code that uses the byte pair encoding algorithm (BPE) to learn a segmentation of code tokens into subword units.
Apache License 2.0
83 stars 24 forks source link

BPE Encoding files #11

Open fr4nc3sc4 opened 3 years ago

fr4nc3sc4 commented 3 years ago

Hello, During the BPE encoding, subword-nmt is generating the file _{codesfile}. Can you please share this file? If it is not possible, can you share the training set for obtaining the {codes_file}? I would like to use OpenVocabNLM with a certain dataset and compare my results with the ones obtained in your research.

Furthermore, I have another question. Do you run create_subtoken_data.py and non-ascii_sequences_to_unk.py before BPE encoding?

Thank you.

lapplislazuli commented 2 years ago

@fr4nc3sc4

Not an author, but this might be what you are looking for: https://github.com/giganticode/icse-2020 There are zenodo artifacts for the separate datasets: https://zenodo.org/record/3628636