Use Pygments to lex training data

hsezhiyan / CodePatching289G

Transformer Models for Code Patching

0 stars 0 forks source link

Use Pygments to lex training data #4

Closed VHellendoorn closed 5 years ago

VHellendoorn commented 5 years ago

See second comment on #1: ideally while extracting git diff information, we should also store the full lexed file before the bug fix, and the lexed line(s) involved in the fix (for which we need to lex the full fixed file as well). Pygments is just a suggestion, but has worked well for me in the past.

Note, I do remember Pygments messing up with Java fully qualified names (e.g. import statements), which it treates as one giant token, so some heuristic splitting will be necessary at times.

FernandoPieressa commented 5 years ago

@VHellendoorn for the lexed line involved in the fix, why would we need to lex the full fixed file as well? the problem of lexing the whole file is that after lexing then its impossible to know what line of code was actually modified

Shouldn't it be lex after getting line and just that?

VHellendoorn commented 5 years ago

The problem is that you won't know if the line is inside e.g. a comment based on just the line itself. Lexing the whole file is the only fool-proof way. We do need to make sure when lexing the full file that all the original line indices are preserved. I'm pretty sure the Pygments code I included preserves all new-lines for that purpose

FernandoPieressa commented 5 years ago

It doesnt preserve, i was just testing it. But i will try to edit it

VHellendoorn commented 5 years ago

That's weird, line 66 should print a '\n' every time it sees one in the input. Not sure why it wouldn't...

FernandoPieressa commented 5 years ago

The problem is that line 66 \n didn't count all scenarios, for example with multi line comments. But now its fixed

VHellendoorn commented 5 years ago

Great, thanks