melisa-writer / short-transformers

Prune transformer layers
MIT License
61 stars 9 forks source link

Thank you very much for this library and I have questions #2

Open NickyDark1 opened 5 months ago

NickyDark1 commented 5 months ago

I would like to know if I am doing something wrong or what I need or how to understand it, I am getting this result in the image when I add this code:

choose metric

calculate distances based on the angular distance of the i=0 token model.set_metric(get_angular_distance_ith_token(i=0))

image 1

image

No choose metric.

but if I remove the model.set_metric code it gives me this result. image

select good values.

I would also like to know how to select good values ​​for:

colab.

I also made some code in Google colab: https://colab.research.google.com/drive/1Rdgep1aObqZfBEK4aM8rfp6eStEOoO6w?usp=sharing

good values

melisa-writer commented 5 months ago
  1. Metric choice:

Yes, this is the expected graph for the first token.

model.set_metric(get_angular_distance_ith_token(i=0))

Here: i is the token position in the sequence. If you don't set any metric, the default metric is here: https://github.com/melisa-writer/short-transformers/blob/ccb9f70312502b7bb28404942a3de9f4c2680b4f/short_transformers/short_transformer.py#L23

which is equivalent to get_angular_distance_ith_token(i=-1)

The statistics for initial tokens indeed is wild (first token is usually also a special token). This observation agrees with https://arxiv.org/abs/2403.04652 where they measured cosine similarity of first 5 tokens: image

  1. Good values for block_size and threshold.

Good block_size will strongly depend on the difficulty of the downstream task. In this paper: https://arxiv.org/abs/2403.17887 authors removed up to ~25% layers of Mistral 7B without significant MMLU drop (even without continued training). image

My observations with pruned models show that their reasoning part tends to drop faster than knowledge.

Setting threshold is used only for limiting rows printed as an .md table: https://github.com/melisa-writer/short-transformers/blob/ccb9f70312502b7bb28404942a3de9f4c2680b4f/short_transformers/utils/utils.py#L16

I believe a good pruning pipeline looks like: prune block_size=N -> (optionally: retrain) -> evaluate on task T -> prune block_size=N+1 -> ... and stopping whenever the model evaluation results drop below certain acceptance level.

NickyDark1 commented 5 months ago

Thank you very much for the quick answer. Now I understand better but I still have some doubts.

I made this code and its results are below but I really wouldn't know exactly what heat would be correct. According to this, how could I identify it?

code colab:

# search
def get_best_pruning_start_score(results, block_size=1,threshold=0.3):
  # find optimial block of size 'block_size' to prune
  start_layer = get_best_pruning_start(results, block_size=block_size)

  # evaluate all possibe block sizes to prune,
  # for each block returns score 0-1
  # which is averaged over samples distance between input and output to/from a block
  block_score = get_scored_blocks(results, return_md=True, threshold=threshold)
  print(block_score)
  print("start_layer:",start_layer)

# threshold=0.1
for i in range(1,10):
  threshold=0.1
  for j in range(1,10):
    print(f"threshold: {threshold}")
    get_best_pruning_start_score(results, block_size=i,threshold=threshold)
    threshold+=0.1
    print("="*30)

output:

threshold: 0.1
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|

start_layer: 5
==============================
threshold: 0.2
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|

start_layer: 5
==============================
threshold: 0.30000000000000004
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|

start_layer: 5
==============================
threshold: 0.4
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|

start_layer: 5
==============================
threshold: 0.5
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|

start_layer: 5
==============================
threshold: 0.6
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|

start_layer: 5
==============================
threshold: 0.7
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|

start_layer: 5
==============================
threshold: 0.7999999999999999
| Block_size | Removed_layers | Score (avg dist)|
| -------- | ------- | -------- |
| 1 | 5-5 | 0.003|
| 2 | 10-11 | 0.004|
| 3 | 9-11 | 0.005|
| 4 | 9-12 | 0.007|
| 5 | 2-6 | 0.01|
| 6 | 7-12 | 0.012|
| 7 | 5-11 | 0.013|
| 8 | 5-12 | 0.014|
| 9 | 4-12 | 0.015|
| 10 | 3-12 | 0.016|
| 11 | 2-12 | 0.017|
| 12 | 2-13 | 0.019|
| 13 | 1-13 | 0.022|
| 14 | 1-14 | 0.029|
| 15 | 0-14 | 0.048|
| 16 | 0-15 | 0.064|
| 17 | 0-16 | 0.487|
NickyDark1 commented 5 months ago

I also forgot to ask you, how to generate this image? image

melisa-writer commented 5 months ago
  1. Here is the function generating those images: https://github.com/melisa-writer/short-transformers/blob/9ae33fc160cb88a89b316f93a01ad8e89408af53/short_transformers/utils/plot.py#L37

Example:

draw_layers_heatmap(results, "your_metric_name", "layerwise_distances", "output_path.png")
  1. Best pruning block_size and threshold:

Based on referenced papers:

In any case, the decision about the block_size should be evaluated after finetuning on a downstream task to find the sweat spot between the model size and the performance.

Let me know how it works for you - we can add the notebook to the repository as an example.