Open NickyDark1 opened 5 months ago
Yes, this is the expected graph for the first token.
model.set_metric(get_angular_distance_ith_token(i=0))
Here: i
is the token position in the sequence.
If you don't set any metric, the default metric is here:
https://github.com/melisa-writer/short-transformers/blob/ccb9f70312502b7bb28404942a3de9f4c2680b4f/short_transformers/short_transformer.py#L23
which is equivalent to get_angular_distance_ith_token(i=-1)
The statistics for initial tokens indeed is wild (first token is usually also a special token). This observation agrees with https://arxiv.org/abs/2403.04652 where they measured cosine similarity of first 5 tokens:
block_size
and threshold
.Good block_size
will strongly depend on the difficulty of the downstream task.
In this paper: https://arxiv.org/abs/2403.17887 authors removed up to ~25% layers of Mistral 7B without significant MMLU drop (even without continued training).
My observations with pruned models show that their reasoning part tends to drop faster than knowledge.
Setting threshold
is used only for limiting rows printed as an .md table:
https://github.com/melisa-writer/short-transformers/blob/ccb9f70312502b7bb28404942a3de9f4c2680b4f/short_transformers/utils/utils.py#L16
I believe a good pruning pipeline looks like:
prune block_size=N -> (optionally: retrain) -> evaluate on task T -> prune block_size=N+1 -> ...
and stopping whenever the model evaluation results drop below certain acceptance level.
Thank you very much for the quick answer. Now I understand better but I still have some doubts.
I made this code and its results are below but I really wouldn't know exactly what heat would be correct. According to this, how could I identify it?
code colab:
# search
def get_best_pruning_start_score(results, block_size=1,threshold=0.3):
# find optimial block of size 'block_size' to prune
start_layer = get_best_pruning_start(results, block_size=block_size)
# evaluate all possibe block sizes to prune,
# for each block returns score 0-1
# which is averaged over samples distance between input and output to/from a block
block_score = get_scored_blocks(results, return_md=True, threshold=threshold)
print(block_score)
print("start_layer:",start_layer)
# threshold=0.1
for i in range(1,10):
threshold=0.1
for j in range(1,10):
print(f"threshold: {threshold}")
get_best_pruning_start_score(results, block_size=i,threshold=threshold)
threshold+=0.1
print("="*30)
output:
threshold: 0.1
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
start_layer: 5
==============================
threshold: 0.2
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
start_layer: 5
==============================
threshold: 0.30000000000000004
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
start_layer: 5
==============================
threshold: 0.4
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
start_layer: 5
==============================
threshold: 0.5
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|
start_layer: 5
==============================
threshold: 0.6
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|
start_layer: 5
==============================
threshold: 0.7
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|
start_layer: 5
==============================
threshold: 0.7999999999999999
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|
I also forgot to ask you, how to generate this image?
Example:
draw_layers_heatmap(results, "your_metric_name", "layerwise_distances", "output_path.png")
block_size
and threshold
: Based on referenced papers:
mmlu
) block_size = layer_count * 0.25
should be finegsm8k
) I would experiment with block_size = layer_count * 0.1
In any case, the decision about the block_size
should be evaluated after finetuning on a downstream task to find the sweat spot between the model size and the performance.
Let me know how it works for you - we can add the notebook to the repository as an example.
I would like to know if I am doing something wrong or what I need or how to understand it, I am getting this result in the image when I add this code:
choose metric
calculate distances based on the angular distance of the i=0 token model.set_metric(get_angular_distance_ith_token(i=0))
image 1
No choose metric.
but if I remove the model.set_metric code it gives me this result.
select good values.
I would also like to know how to select good values for:
colab.
I also made some code in Google colab: https://colab.research.google.com/drive/1Rdgep1aObqZfBEK4aM8rfp6eStEOoO6w?usp=sharing
good values