huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

nanotron <-> conversion for Llama resolve #124 #125

Closed yardenas closed 6 months ago

yardenas commented 7 months ago

A similar idea as in https://github.com/huggingface/nanotron/pull/103, but for a Llama model.

I'd be happy to implement this.

I need it for another project that uses nanotron and was wondering if it is something that you'd want in this repository? If so, I'll start working on an implementation here.

Aside from the contribution guide, are there any other guidelines for this task? For example:

Where should the conversion script be located? Any gotcha's I should be aware of? The best way to validate this would be to write a test that shows that the converted models return the same results as the non-converted ones. Do you think that a rather small model (that I can quickly iterate on while running locally) would be sufficient? Thanks!

Lauler commented 7 months ago

As an outsider also highly interested in seeing this conversion available in the library before I commit to train anything with Nanotron. Just here to cheer you on!

xrsrke commented 7 months ago

I need it for another project that uses nanotron and was wondering if it is something that you'd want in this repository

Yes.

Do you think that a rather small model (that I can quickly iterate on while running locally) would be sufficient?

Yup. That would be great!!

Where should the conversion script be located?

/tools? Feel free to place wherever you ike... we could change it later on

xrsrke commented 7 months ago

Looks very nice. Please ping me once it's ready!!

yardenas commented 7 months ago

@xrsrke I think we're getting there. @AleHD made a significant progress making the tests actually pass -- we currently get ~0.02 absolute error on the logits in both directions.

We made a copy of examples/doremi/tests/utils.py to the llama folder and made some modifications. It's testing utils for llama, so I wouldn't put it inside the nanotron library. That being said, it's not super DRY so happy to make changes if you think there's a better way

wtyd?

xrsrke commented 7 months ago

@yardenas very cool. feel free to ping me if you need any pointers 🤗

xrsrke commented 7 months ago

We made a copy of examples/doremi/tests/utils.py to the llama folder and made some modifications. It's testing utils for llama, so I wouldn't put it inside the nanotron library. That being said, it's not super DRY so happy to make changes if you think there's a better way

The test looks good 🙌

yardenas commented 7 months ago

@xrsrke -- seems to be ready on our side :)

3outeille commented 7 months ago

@yardenas Given a pretrained HF model (huggyllama/llama-7b), I tested the following:

yardenas commented 7 months ago

@yardenas Given a pretrained HF model (huggyllama/llama-7b), I tested the following:

  • Convert HF to nanotron
  • Convert back Nanotron to HF
  • Run generate in HF (through check_converted_model_generation() in convert_nanotron_to_hf.py)

    • [x] with cache
    • [ ] no cache (<== doesn't yield proper results)
  • Run generate in Nanotron

    • [ ] with cache
    • [x] no cache

got it. any ideas why it could fail for the no cache option? I'll take a look into it

3outeille commented 7 months ago

@yardenas Given a pretrained HF model (huggyllama/llama-7b), I tested the following:

  • Convert HF to nanotron
  • Convert back Nanotron to HF
  • Run generate in HF (through check_converted_model_generation() in convert_nanotron_to_hf.py)

    • [x] with cache
    • [ ] no cache (<== doesn't yield proper results)
  • Run generate in Nanotron

    • [ ] with cache
    • [x] no cache

got it. any ideas why it could fail for the no cache option? I'll take a look into it

Mhmm

if I test it this way, generation is the same

def check_converted_model_generation(save_path: Path):
    """Loads a huggingface model and tokenizer from `save_path` and
    performs a dummy text generation."""

    tokenizer = AutoTokenizer.from_pretrained(save_path)
    input_ids = tokenizer(TEST_PROMPT, return_tensors="pt")["input_ids"].cuda()
    print("Inputs:", tokenizer.batch_decode(input_ids))

    model = LlamaForCausalLM.from_pretrained(save_path).cuda().bfloat16()
    out = model.generate(input_ids, max_new_tokens=100)
    print("Generation (converted): ", tokenizer.batch_decode(out))

    model_nocache = LlamaForCausalLM.from_pretrained(save_path).cuda().bfloat16()
    model_nocache.config.use_cache = False
    out_nocache = model_nocache.generate(input_ids, max_new_tokens=100)
    print("Generation (converted, no cache): ", tokenizer.batch_decode(out_nocache))

However, if after step Convert back Nanotron to HF, I manually change use_cache=False in ckpt/model_config.json before step Run generate in HF (through check_converted_model_generation() in convert_nanotron_to_hf.py) , then the generation is not good (maybe I am not supposed to do that)

3outeille commented 7 months ago

@yardenas Also, have you tried training a llama in Nanotron in DP=PP=1 & TP=2and run convert_nanotron_to_hf.py ?

yardenas commented 7 months ago

@3outeille

Will add a test for this case now

yardenas commented 6 months ago

@3outeille we added a fix for the tp=2 case. :innocent:

yardenas commented 6 months ago

@3outeille any updates? :)

yardenas commented 6 months ago

@xrsrke and @3outeille I just committed the requests from @xrsrke, anything else? :)