tlh24 / cortex

Apache License 2.0
7 stars 0 forks source link

Position encoding of edits #3

Open tlh24 opened 1 year ago

tlh24 commented 1 year ago

The model is currently poor at decoding the position of edits (versus the type and character, e.g. edit = insert 'c' at 5 sig type * char * pos). This may be because:

  1. There is a bug in the python model.
  2. Absolute position encoding is bad, and should use "more advanced decoding" like rotational encoding instead.
  3. There is a bug in ocaml batch generation.
  4. Training isn't long enough, or the model is too small.
  5. Programs have inherent invariances, which lead to ambiguity and training noise: order in addition doesn't matter, e.g.

To which I think:

  1. Definite possibility, need a positive control dataset?
  2. Also probably true, but want to punt on this
  3. Unlikely, as inspected via plot_mmap.py
  4. Also unlikely, the model has in the past memorized the datasets. See 'positive control' above.
  5. Very likely culprit, suggest http://arxiv.org/abs/1802.03685 as an demo of how to deal with intrinsic invariances. (also pertinent: https://arxiv.org/abs/1711.08028)

Super curious to others thoughts on this. My instinct is to turn the AST (or any graph) into a list of addresses, then use a transformer to encode this into positions to be fed to a larger, orthogonal transformer.

Basically: programs are graphs (or at minimum trees), so operating on them as lists is dumb, and i think we're already running into these limits.

Imagine that this has been described in the literature, but I'm not aware of anything?