I accidentally regularized the distance from the initialization for the heads as well as the body for the base sized experiments. I changed it to only regularize the body before running the large experiments.
Since merging only acts on the body, this might not a big issue when looking at relative performance of merging. It would probably have a bigger impact on absolute performance of the original and merged models.
Also the non-regularized model is unaffected by this bug.
Note: A lot of this will be applicable to other experiments, but I'll focus on the BERT iso preliminary experiments for now.
Potential view of the data
When I also have BERT large results