wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
https://arxiv.org/abs/2103.06333
MIT License
186 stars 35 forks source link

translation with indentation #13

Closed saichandrapandraju closed 3 years ago

saichandrapandraju commented 3 years ago

Hi,

I want to try translation between java and python. As current datasets from CodeXGLUE were representing functions in a single line, it is easy to finetune and test. But what if I want to do this for python where indentation is very important? i.e, how will the tokenization take care of it? I saw an example in TransCoder that they used below format for indentation using INDENT, DEDENT and NEWLINE -

def rm_file ( path ) : NEWLINE try : NEWLINE INDENT os . remove (path) NEWLINE print ( " Deleted " )
DEDENT except : NEWLINE INDENT print ( " Error _ while _ deleting _ file " , path ) DEDENT

Can you suggest how to proceed further with indentation using PLBART?

saichandrapandraju commented 3 years ago

Figured out that above is possible by using code_tokenizer.py

wasiahmad commented 3 years ago

Yes, we follow TransCoder, so during preprocessing, we take care of it.