linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training
https://arxiv.org/pdf/2410.10989
BSD 2-Clause "Simplified" License
3.48k stars 208 forks source link

byron's issue tracker #355

Closed ByronHsu closed 1 week ago

ByronHsu commented 2 weeks ago

🐛 Describe the bug

https://github.com/linkedin/Liger-Kernel/pull/354

  1. revert patching causes flce not taking effect (comment out revert patching for now, and only test float32)
  2. qwen2 vl flce is broken. we should fix later
  3. we should provide a real "on-instance" patch that does not use any monkey patch. now the on-instance patch still relies on monkey patch

Reproduce

No response

Versions

na

tyler-romero commented 2 weeks ago

Can you elaborate on

qwen2 vl flce is broken. we should fix later

What specifically is broken?

ByronHsu commented 2 weeks ago

Sorry the note is quite scrappy. If you uncomment https://github.com/linkedin/Liger-Kernel/blob/85d34efbd423cd97d3e97525af419193fbb07354/test/convergence/test_mini_models.py#L525, and run the conv test. You will see the errors. The root cause is that transformers keeps updating qwen2_vl forward and the existing one is outdated now. Would you help fixing? Thanks in advance

ByronHsu commented 2 weeks ago

Also do you know why qwen2 vl and mllama have to exist in both LM and multi modal convergence test?

tyler-romero commented 2 weeks ago

Yeah I'll look into fixing!

tyler-romero commented 2 weeks ago

And my knowledge is a bit out of date here, but there used to be the two different convergence test files for LLMs: the one that tested with FLCE and the one that tested with just CE.

So in order to get complete coverage of the patches, I implemented the multimodal convergence tests to mirror the ones that tested with just CE, and also added these VLMs to the convergence test file for FLCE (but only tested with text inputs there).

It looks like we've changed to only testing FLCE for the plain LLMs. So in order to mirror that change, the VLMs could be removed from the LM convergence test file and the multimodal test file should be changed to test FLCE.

Do you remember why the convergence tests covering CE were removed?

ByronHsu commented 2 weeks ago

@tyler-romero i removed the CE conv test because

  1. To accelerate the CI to save GPU cost
  2. CE and FLCE conv test are very overlapped and if we have FLCE conv test + ensuring CE unit test correct, then CE conv is not necessary
  3. I observed the majority of users are using FLCE

By the way, i have some thoughts for improvement of monkey patch + conv test. We can schedule a meeting to discuss offline. I will ping you on discord.