Adaptive lora - Githubissues

logix-project / logix

AI Logging for Interpretability and Explainability🔬

Apache License 2.0

74 stars 6 forks source link

Adaptive lora #66

Open hage1005 opened 8 months ago

hage1005 commented 8 months ago

Add adaptive_threshold option to dynamically determine the rank Experiment result with MNIST To use this feature, set compression_ratio_by_covariance or compression_ratio_by_memory in config.yaml

lora:
  init: pca
  compression_ratio_by_covariance: 0.8

This will determine the rank needed for PCA to explain 80% of the covariance or

lora:
  init: pca
  compression_ratio_by_memory: 0.8

This will determine the rank that compresses gradient memory to 80%.

sangkeun00 commented 8 months ago

Thanks for this feature and your analysis! This is a great initial effort. However, there are quite a few things to be figured out before I can merge this PR. For example,

Budget-based vs coverage-based: In your implementation, you fix the covariance coverage and determine the rank for each layer accordingly. However, this could potentially lead to low compression ratio. Instead, we can think about the budget-based approach, where we fix the number of parameters to be tracked and set the rank accordingly based on covariance distribution.
Compatibility with analog.initialize_from_log(): If the rank is determined adaptively for each layer, we have to save this rank structure somewhere so that we can recover the exact LoRA structure when initializing from log.

hage1005 commented 8 months ago

Budget-based vs coverage-based: In your implementation, you fix the covariance coverage and determine the rank for each layer accordingly.

Yeah this indeed might lead to low-compression ratio. For MNIST case the compression ratio seems plausible. But regarding "fix the number of parameters", how do we determine this number from the percentage covariance threshold? Should we put all singular values across layers together and sort them?

Compatibility with analog.initialize_from_log(): If the rank is determined adaptively for each layer, we have to save this rank structure somewhere so that we can recover the exact LoRA structure when initializing from log.

Thanks for catching this!

sangkeun00 commented 8 months ago

I don't have a concrete answer to the first question at the moment, and believe this is largely a research question (which is exciting). I know many literatures in communication efficient distributed training, which also does gradient compression, also applies different compression ratio across layers. You can probably review them, try/develop new ideas, and find the best working one. Once we have this, we can merge this PR!

sangkeun00 commented 8 months ago

Also, we can think about using different ranks for forward and backward. From the implementation perspective, you may allow users to pass tuple of (rank_fwd, rank_bwd) for this. If a user passes an integer value we can use this value to set both rank_fwd and rank_bwd. This is somewhat similar with setting kernel_size or stride in nn.Conv in PyTorch.

hage1005 commented 8 months ago

Instead of explicitly saving rank information, I did it by using the shape of the lora weight matrix. Let me know if this seems okay!