Functionally analyze the heads that receive many of the top composition weights in Neo125M
Generalize parameter matrix functions to directly work on different devices (e.g., GPUs)
Finish running composition values for larger models on larger machines
Functionally analyze as many interesting heads as possible before conference
OUTLINE
Figure 1: Contributions to the residual stream and the datasets
Architecture schematic with the residual stream
Subspace vector diagram (contributions to the stream add within any given unit subspace)
Table of the number of input/output pairs of different types across models
Basic value distributions
Figure: Breakdown by type (QKV) and layer
Percentile vs type histogram (w/ and w/o baselines)
Percentile vs layer histogram (<- this needs to be normalized)
Where do the top values point?
Biases vs. att head layer distances
Figure: Can we extract interesting functional properties?
Can we extract induction heads?
What do the heads with many top values do?
Figure: Higher order terms
Input path complexity cartoon (w/ and w/o baselines)
Input path complexity plots vs. random
Supplementary Figures
Basic value distributions with orig denominator
Head-by-head term value plots (one w/ high values and one without)
Scatterplot of old denominator vs. new denominator
NICE TO HAVE
Measuring composition with the embedding & unembedding weights
Think about fusing LN weights with input/output matrices faster to speed things up
Include MLPs in path analysis
FUTURE WORK:
See if paths functionally maintain signals by injecting noise at an early point, and measuring downstream effects (future work)
Look at the maximum value of the reverse edges and see where it pops up
Measure network performance before & after knocking out low composition reads and high composition reads
More meta: do a deeper dive into orthogonal vectors, basis, and subspaces. Investigative work on what mental tools might be useful. Almost orthogonal vectors, etc.
Reserve some portion of the residual stream for embeddings, and then measure how "inflated" the remaining subspaces of the attention head are
Major singular values and where they point
More work on baselines
Rank of input and output weights (by head and layer)
Think about the baseline more
Figure: The baseline
Computing reverse edges cartoon
"95% confidence" and "non-random" thresholds
num sent edges heatmap (by attention head)
num received edges heatmap (by attention head)
Figure: Individual singular values
Do large bandwidth terms mean one large value? Or several?
TO DO:
OUTLINE
Figure 1: Contributions to the residual stream and the datasets
Figure: Breakdown by type (QKV) and layer
Figure: Can we extract interesting functional properties?
Figure: Higher order terms
Supplementary Figures
NICE TO HAVE
FUTURE WORK:
Figure: The baseline
Figure: Individual singular values