Updated LoKr implementation with lycoris support

yaswanth19 commented 1 month ago

For context refer this #1935

@BenjaminBossan Here is an early draft PR. Is this how you have envisioned when you meant by a separate implementation of Lokr.

BenjaminBossan commented 1 month ago

Thanks a lot for this PR. I haven't done an in-depth review yet, but from a first skim, this looks good already. A few points:

Let's ensure that we don't add any new dependencies. The dependency should be optional but it can be added as a dev dependency.
About the name, I wonder if we should go with LycorisLoKr rather than LoKrv2. It's more typing, true, but "v2" is just not very informative. If we stick with "v2", I'd go with LoKrV2Config rather than LoKrConfigv2, etc. I'm also open to other suggestions. WDYT?
Did you already have time to test how the two implementations compare to one another?
Please fix the merge conflict.

yaswanth19 commented 1 month ago

About the name, I wonder if we should go with LycorisLoKr rather than LoKrv2. It's more typing, true, but "v2" is just not very informative. If we stick with "v2", I'd go with LoKrV2Config rather than LoKrConfigv2, etc. I'm also open to other suggestions. WDYT?

@BenjaminBossan I feel we should go with LoKrV2 as this is the latest version and we providing support to use lycoris. If we rename it LycorisLoKr it kind of means we are implementing a API for lycoris even though it is a major functionality.

Did you already have time to test how the two implementations compare to one another?

I am confused on how to test it as the problem is lycoris version has too many flag for weights initialization, and I tried but couldn't find the combination for similar initialization of weight for both PEFT LoKr and lycoris Lokr. So tried training it on MNIST with a simple MLP but then the weights are differing as we are getting difference peformance .

Since I haven't implemented all the features in Lokrv2 so right now I haven't performed the sanity check of comparing results between upstream and LoKrV2.

BenjaminBossan commented 1 month ago

I feel we should go with LoKrV2 as this is the latest version and we providing support to use lycoris. If we rename it LycorisLoKr it kind of means we are implementing a API for lycoris even though it is a major functionality.

Well, we can decide later on the naming, but I think LycorisLokr would still be a more descriptive name.

I am confused on how to test it as the problem is lycoris version has too many flag for weights initialization, and I tried but couldn't find the combination for similar initialization of weight for both PEFT LoKr and lycoris Lokr. So tried training it on MNIST with a simple MLP but then the weights are differing as we are getting difference peformance .

Okay, probably we won't manage to get 100% of the same results. It would probably still be worth it to have a script that compares the performance between the two and checks if they roughly match. It could be one of the existing examples in PEFT or the example you came up with.

yaswanth19 commented 4 weeks ago

@BenjaminBossan Please do a first review of this PR. I have trained a simple MLP on MNIST dataset under same config and attached the loss curve plots below to compare different implementations.

Few things from my side:

I haven't implemented the weight decompose and can be done in a separate PR if majority of community want it.
Things done in this PR from the list (Issue 1935) - fixed rank dropout implementation ,full matrix tuning and fixed maths (not multiplying against the vector, but only the scalar) is WiP.
Training loss curve is a bit concerning and I need to investigate it.
Added LycorisLokrto existing testcases and will be writing some new test cases too.

BenjaminBossan commented 3 weeks ago

I have trained a simple MLP on MNIST dataset under same config

Training loss curve is a bit concerning and I need to investigate it.

Let's try to figure this out first. Did you make some progress on investigating this? If you could share your script, I can also take a look.

yaswanth19 commented 3 weeks ago

Let's try to figure this out first. Did you make some progress on investigating this? If you could share your script, I can also take a look.

This is the notebook I am using for testing: https://colab.research.google.com/drive/12T9CZvSAPcVPi5usXtkbY5G1pvNgDjH7?usp=sharing

EDIT: @BenjaminBossan The loss curves are looking much better now. Please have a look in the notebook.

Query: In the PEFT LoKr implementation we have init_weights which is used to initialize weights of lokr w1 matrix with zeros. In LYCORIS implementation we are initializing all w1 matrices randomly. The reason me asking this is that when I make the code changes for use_scalar parameter, I am getting the delta as 0 because according to PEFT implementation in this case both w1 and w2 would be 0 matrices.

yaswanth19 commented 3 weeks ago

Could you extend a little bit, just so I'm clear what is changed where?

In LoKr we were directly dividing the dropout and now I have added the rank_dropout_scale flag https://github.com/yaswanth19/peft/blob/25f00554bfb259a9e9afbc0c1d8d5ff14203070e/src/peft/tuners/lokrv2/layer.py#L435
Using full matrix now to determine whether to use w1 and w2 https://github.com/yaswanth19/peft/blob/25f00554bfb259a9e9afbc0c1d8d5ff14203070e/src/peft/tuners/lokrv2/layer.py#L137
Mostly scale related corrections; In LoKr we are not calculating scale correctly and not using it in make_kron.

BenjaminBossan commented 3 weeks ago

In LoKr we were directly dividing the dropout and now I have added the rank_dropout_scale flag https://github.com/yaswanth19/peft/blob/25f00554bfb259a9e9afbc0c1d8d5ff14203070e/src/peft/tuners/lokrv2/layer.py#L435

Using full matrix now to determine whether to use w1 and w2 https://github.com/yaswanth19/peft/blob/25f00554bfb259a9e9afbc0c1d8d5ff14203070e/src/peft/tuners/lokrv2/layer.py#L137

Mostly scale related corrections; In LoKr we are not calculating scale correctly and not using it in make_kron.

Thanks for explaining. So this is really in the PEFT code and not so much resolved because we're using lycoris, right? So we could make the same fixes to the existing LoKr implementation too (of course in a separate PR).

Did you check my notebook about reproducibility? I think we should adjust the initialization of parameters to be in line with lycoris and then hopefully see the same results.

yaswanth19 commented 3 weeks ago

So this is really in the PEFT code and not so much resolved because we're using lycoris, right? So we could make the same fixes to the existing LoKr implementation too (of course in a separate PR).

Yes @BenjaminBossan , This could be done. But now I am a bit hesitant on a separate implementation; We mainly have three functional APIs weight_gen which would be used to initialize weights which can be optimized in lycoris in future , make_kron and diff_weight which will not be changed mostly as they are doing fundamental things. I couldn't understand the bypass_forwad_diff method accurately but I think it is mostly calculating w*delta.

Most of the optimizations/changes are done at module level like weight_decompose, rs_lora, unbalanced_factorization. Even though some of these are minor ones we can't utilize them using the current approach of funcitonal API. WDYT?

In this case wouldn't it be better to rewrite the original LoKr and periodically(year or two) update the component. For us to effectively use use_upstream feature; author needs to have full feature parity between Functional and Module which is not the case currently.

Did you check my notebook about reproducibility? I think we should adjust the initialization of parameters to be in line with lycoris and then hopefully see the same results.

Yes, I have made the required changes and now we have similar initilalization to lycoris but not same random weights :crying_cat_face: and I am having hard time reproducing same weights both locally and on colab even after setting seed extensicely.

Colab link: https://colab.research.google.com/drive/1YxlaT9G_jotkoTrEMQIQwNX87lehcMHu?usp=sharing

BenjaminBossan commented 3 weeks ago

Thanks a lot @yaswanth19 for sharing your thoughts on this. When reviewing the PR, I was a bit astonished how little of the lycoris package we're actually using.

As to the question of whether it's worth it to have this separate LoKr implementation or if we should rather fix the existing one: Depending on lycoris has the advantage that each time there is a fix or improvement there, we automatically benefit from it, which is nice. However, checking the lycoris/functional/lokr.py file, there was no change in the past 4 months (besides the docs), so I'm not sure how it plays out in practice. And as mentioned, we're not using a lot of it in the first place.

The disadvantage in my eyes is that we create more confusion for users, have a higher complexity in our code base, and to avoid breaking backwards compatibility, we would need to maintain the existing LoKr implementation anyway.

If your work on this PR has helped you identify the issues with the existing LoKr implementation, I would be super happy if you could work on fixing those. If you do that, there is even less necessity for the alternative implementation. But it could be that I'm missing some arguments, so if @bghira or @sayakpaul have different opinions, please let me know.

In case we decide to go with fixing the existing implementation instead of adding the new one, I would propose that we elaborate the test notebook, e.g. to be a standalone script. Then we can use it to compare the results from the lycoris implementation to PEFT and ensure that they're close. This could be run daily or weekly on CI. Of course, we would need to extend the script to cover a few more cases, like conv layers and some of the edge cases discussed in the initial issue.

Yes, I have made the required changes and now we have similar initilalization to lycoris but not same random weights 😿 and I am having hard time reproducing same weights both locally and on colab even after setting seed extensicely.

Don't worry too much about that. It would be exceedingly hard to get the exact same outcome using PEFT and lycoris, because we would need to ensure that each call involving randomness is executed exactly the same and in the same order.

yaswanth19 commented 3 weeks ago

@bghira @sayakpaul A gentle ping for your thoughts on the above discussion.

sayakpaul commented 3 weeks ago

If your work on this PR has helped you identify the issues with the existing LoKr implementation, I would be super happy if you could work on fixing those. If you do that, there is even less necessity for the alternative implementation.

I am not opposed to the idea, but I guess it depends on the amount of work and the effectiveness of the implementation. From what I understand we're already having to touch many lines of code in the existing implementation, so might as well just fix them at this level instead of a separate implementation?

huggingface / peft

Updated LoKr implementation with lycoris support #2133