huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.25k stars 122 forks source link

Continued Pretraining on Llama 7b. #79

Open wiseyy opened 8 months ago

wiseyy commented 8 months ago

In continuation to https://github.com/huggingface/nanotron/issues/78#issue-2147747937,

I converted the weights as you mentioned, but unfortunately, I cannot get the same sane outputs for the pre-trained llama weights as I get when using HF Api. I am trying to figure out why that is happening. Conversion is straightforward except for the gate_up and qkv weights of nanotron, since the structure of the weights is not mentioned. I assume that concatenating the hf weights in the 0th dimension in the order (gate, up) and (q,k,v) should give the same behaviour for nanotron weights.

The sources of errors I could think of are (assuming there is no bug in run_generate.py):

  1. Order of qkv matrices in the nanotron format.
  2. Storing the transpose of qkv matrices?
  3. Difference in rotary embeddings as compared to HF API.

Could you please help me out?

Update : The outputs look somewhat sane. However, they are far from acceptable.

Screenshot 2024-02-24 at 1 26 46 AM

Here, for example, it tries to speak but then moves on to generating gibberish. This leads me to believe that the weight mapping is correct and that there is some error in the Generation code.

I want to point out that you are not passing the arguments to the sampler in the decode_text function in generation/decode.py.

Screenshot 2024-02-24 at 1 29 43 AM

The above outputs were generated using decode_tokenized(), which does that. The GenerationArgs were as follows:

Screenshot 2024-02-24 at 1 30 36 AM

The output that HF API generates for the same weights and input tokens is as follows:

Screenshot 2024-02-24 at 1 32 19 AM

The quality is a lot better than the text generated by nanotron.

Also, when I try to prompt the 7b-chat version with a system prompt and user input (the default way), nanotron output breaks altogether.

This is HF->

Screenshot 2024-02-24 at 1 35 17 AM

This is nanotron->

Screenshot 2024-02-24 at 1 36 51 AM
  1. Can you suggest reasonable values for GenerationArguments that can be used to reproduce similar-quality text generation?
  2. Is the generation code doing what it is supposed to do?
xrsrke commented 8 months ago

@NouamaneTazi do we have a conversion script from transformers to nanotron checkpoint?

wiseyy commented 8 months ago

Any updates? @xrsrke

yardenas commented 8 months ago

@wiseyy I'm facing a similar challenge. Any way we can join forces on this and try to make it work? :)

wiseyy commented 8 months ago

Glad to know I'm not alone :)

I already chose the easier route to use Megatron-LLM and Meditron. The training throughput, however, is ~2/3 of what nanotron provides. Also, you would have to convert the weights to hf format after you finish training and infer using hf/vllm.

I hope that helps you.

yardenas commented 8 months ago

@wiseyy unfortunately I can't go the megatron route (I'm part of a group and we already committed ourselves to nanotron).

Conversion is straightforward

Can you help me get started with this? Maybe if I can reproduce your errors I'll be able to dig deeper into the issue

SulRash commented 3 months ago

Hey all, I was wondering if there are any conversion scripts yet?

xrsrke commented 3 months ago

hello. you could use this https://github.com/huggingface/nanotron/tree/main/examples/llama

wilzh40 commented 3 months ago

I've noticed the test about consistent logits are commented out for the above conversion scripts: https://github.com/huggingface/nanotron/blob/03d67f2103d5be0dc15ea6022a6cf16d6a633064/examples/llama/tests/test_conversion.py#L223

Also running into this problem of differing logits - any potential solutions?

Also, given a nanotron checkpoint, how do we continue training on it? The above examples only show loading a model for inference but not for continued training. DistributedTrainer only takes in config files in its __init__ function - how do we modify it with an in-memory model (ideally the converted Llama model) and tokenizer?