salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.74k stars 401 forks source link

Questions about Masked Identifier Prediction #89

Closed eureka336 closed 1 year ago

eureka336 commented 1 year ago

How to identify identifiers in codes in Masked Identifier Prediction task.

yuewang-cuhk commented 1 year ago

Hi there, we use the AST parser (tree-sitter) to parse the code into AST tokens and then get the identifier tokens. And remember to filter out some reserved keywords (such as return in Python) for each specific PL.