songmzhang / DSKD

Repo for Paper "Dual-Space Knowledge Distillation for Large Language Models".
30 stars 3 forks source link

Quantify difference in vocabulary #19

Open srikhetramohanty opened 2 days ago

srikhetramohanty commented 2 days ago
  1. Did you guys perform any quantification of the difference between vocabularies before remediating it in the paper? To say Mistral 7B has different vocabulary than Llama or TinyLlama, how much is the difference between their vocabularies?
  2. Will comparing similarity metrics between tokenizer outputs of both models for same set of text provide such a baseline?
songmzhang commented 2 days ago
  1. We did not quantitively measure the difference between the vocabularies of different LLMs. But you can find something like this in the following two papers:
    • Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
    • Bridging the Gap between Different Vocabularies for LLM Ensemble
  2. Yes, I think this will be a promising approach and we indeed compared the tokenized outputs by two different LLMs when we were developing our method.
srikhetramohanty commented 2 days ago

Great. Thanks. While I go through the 2 papers you referenced, what were your observations when you compared tokenized outputs?

songmzhang commented 2 days ago

I mainly focused on comparing Qwen1.5 and GPT2, where I found that nearly 70% of tokens were overlapped in the two tokenized sequences and these tokens could be aligned more easily by our cross-model attention.