wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
https://arxiv.org/abs/2103.06333
MIT License
186 stars 35 forks source link

Can other programming languages be used? #39

Closed shaoyangxu closed 2 years ago

shaoyangxu commented 2 years ago

Hello author, I have two questions:

  1. Why do you use java and python? I see https://github.com/facebookresearch/TransCoder provides pre-processing pipeline for three languages: c++\java\python, so i just want ask why do you just use two of them(java and python)?
  2. Can other programming languages be used? In this bigmodel work, they use C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, TypeScript for programming languages. However, different programming languages need very different preprocessing pipelines, which is too difficult for me to write language-specific preprocessing code. So, could you recommend some related resources useful for this problem? Thx!
wasiahmad commented 2 years ago
  1. Because Java and Python are the two most popular languages and they are quite different in terms of syntax.
  2. You may consider using tree-sitter to build a universal tokenizer to support all your target languages.