The model is currently poor at decoding the position of edits (versus the type and character, e.g. edit = insert 'c' at 5 sig type * char * pos). This may be because:
There is a bug in the python model.
Absolute position encoding is bad, and should use "more advanced decoding" like rotational encoding instead.
There is a bug in ocaml batch generation.
Training isn't long enough, or the model is too small.
Programs have inherent invariances, which lead to ambiguity and training noise: order in addition doesn't matter, e.g.
To which I think:
Definite possibility, need a positive control dataset?
Also probably true, but want to punt on this
Unlikely, as inspected via plot_mmap.py
Also unlikely, the model has in the past memorized the datasets. See 'positive control' above.
Super curious to others thoughts on this. My instinct is to turn the AST (or any graph) into a list of addresses, then use a transformer to encode this into positions to be fed to a larger, orthogonal transformer.
Basically: programs are graphs (or at minimum trees), so operating on them as lists is dumb, and i think we're already running into these limits.
Imagine that this has been described in the literature, but I'm not aware of anything?
The model is currently poor at decoding the position of edits (versus the type and character, e.g. edit =
insert 'c' at 5
sigtype * char * pos
). This may be because:To which I think:
Super curious to others thoughts on this. My instinct is to turn the AST (or any graph) into a list of addresses, then use a transformer to encode this into positions to be fed to a larger, orthogonal transformer.
Basically: programs are graphs (or at minimum trees), so operating on them as lists is dumb, and i think we're already running into these limits.
Imagine that this has been described in the literature, but I'm not aware of anything?