Support for LLaMA-2 - Githubissues

ayyyq commented 1 year ago

Hi, nice work! I would like to know which parts of the code you have modified in transformers-4.28.1, and how can I support LLaMA-2?

ecoli-hit commented 1 year ago

I am wondering that too.

garyfanhku commented 1 year ago

A non-exhaustive list of the changes:

voidism commented 8 months ago

Hi,

Sorry for the late reply. The files changed are the above three files mentioned by @garyfanhku We are currently trying to include DoLa in to latest huggingface transformer package. Please stay tuned!

voidism commented 7 months ago

I have merged the DoLa decoding into the new version (4.39.0.dev0) of transformers package. Install it here: https://github.com/voidism/transformers-dola Follow the instructions here for decoding: https://github.com/voidism/transformers-dola/blob/main/docs/source/en/generation_strategies.md#dola-decoding This should support LLaMA2 and new models including Mistral or Gemma.

naveenjafer commented 7 months ago

Hi @voidism Thank you for the pull request and plan to support Llama 2. I have setup things and was trying out things with the default model provided and it works as expected. However, when I shift to Llama v2-70b (in a multi GPU setting), I run into this error

src/transformers/generation/utils.py", line 2066, in dola_decoding
softmax_mature_layer[None, :, :] + softmax_premature_layers
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7!

Given that my layers are being split across GPU machines (which is an expected use case) would you have a suggestion on what I could be doing to fix this? Thank you

voidism commented 7 months ago

Hi @naveenjafer

It's weird because I didn't have this error for llama-1 65B. I think you can simply move the softmax_mature_layer to the same device of softmax_premature_layers. I will try to fix at this issue later.

naveenjafer commented 7 months ago

Hey @voidism Thank you for getting back! I assume the Llama 1 model should have split the layers across GPU nodes too, given how similar the memory requirements are. I will look into this too later today, thank you!

voidism commented 7 months ago

Hi @naveenjafer Yes! When doing the experiments in my paper, I used to run the llama-1 70B model with 8 V100 GPUs and they worked well. Not sure if this issue is due to some difference between llama 1 and llama 2.

wj210 commented 6 months ago

Hi, i have tried the installing the above transformer package and following the example in https://github.com/voidism/transformers-dola/blob/main/docs/source/en/generation_strategies.md#dola-decoding for mistralai/Mistral-7B-v0.1, but i don't see any difference in outputs between greedy decoding and setting dola_layers=high at all.

voidism commented 6 months ago

Hi @wj210 Can you also try dola_layers=low for me? I haven't intensively tested mistral models but different models may have different properties between the layers, so maybe dola_layers=high does not contrast that much in mistral for this example. You can also try something like dola_layers=[6,8,10] and see whether the output changes.

I will try to examine if there are any issues with mistral models if none of the dola_layers work! Just let me know!

wj210 commented 6 months ago

Hi,

heres the code i tried:

  model_name = "mistralai/Mistral-7B-Instruct-v0.2"
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
  device = 'cuda'
  model.to(device)
  set_seed(42)

  text = "On what date was the Declaration of Independence officially signed?"
  inputs = tokenizer(text, return_tensors="pt").to(device)

  # Vanilla greddy decoding
  vanilla_output = model.generate(**inputs, do_sample=False, max_new_tokens=50)
  vanilla_output = tokenizer.batch_decode(vanilla_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
  print (vanilla_output)

  # DoLa decoding with contrasting higher part of layers (layers 16,18,...,30)
  dola_high_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers=[6,8,10])
  dola_high_output = tokenizer.batch_decode(dola_high_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
  print (dola_high_output)

and i got ["\n\nThe Declaration of Independence was officially signed on August 2, 1776. However, it's important to note that not all the delegates signed it on that date. The signing of the Declaration of Independ"] for both.

I tried both high,low, different layers, it still yield the same results.

i even tried tinyllama, a 1.1B version of llama and still no changes TinyLlama/TinyLlama-1.1B-Chat-v1.0.

Also, a side note, after installing from https://github.com/voidism/transformers-dola

my transformer version is 4.40.0.dev0, could it be due to the package version difference?

wj210 commented 6 months ago

ok, it seems that using dola_layers > layer 18 (ie 2 layers above the mid layer) would produce different generations, surprisingly, adding the 16th layer in which is equal to setting the layers to 'high' would yield the same result as greedy decoding.

Using lower layers does not work too. Is there any reason for this behavior?

voidism commented 6 months ago

Hi @wj210

Thanks for testing this! As different models have different distributions of knowledge stored in their layers, it is reasonable to adjust the selected layer range for new models.

Also, this example of "Declaration of Independence" is picked from TruthfulQA, which contains mainly short-sentence answers with dense factual knowledge. In my experiment, TruthfulQA tends to require contrasting with higher parts of the layers to get improvements. However, for most of the other tasks with longer responses for reasoning, e.g. GSM8K and StrategyQA, contrasting with lower parts of the layers would help more.

pkulium commented 5 months ago

It seems that the original code from the main branch works with llama2? I am working with llama2 on transformers-4.28.1 from the main branch.

harisethuram commented 4 months ago

Hey, I am trying to run the code on mistral, but it isn't supported in transformers 4.28-1. Is there any way to use this code base with mistral, in particular the TruthfulQA evaluation (tfqa_mc_eval.py)?

voidism / DoLa

Support for LLaMA-2 #4