Open hage1005 opened 8 months ago
Thanks for this feature and your analysis! This is a great initial effort. However, there are quite a few things to be figured out before I can merge this PR. For example,
analog.initialize_from_log()
: If the rank is determined adaptively for each layer, we have to save this rank structure somewhere so that we can recover the exact LoRA structure when initializing from log.
- Budget-based vs coverage-based: In your implementation, you fix the covariance coverage and determine the rank for each layer accordingly.
Yeah this indeed might lead to low-compression ratio. For MNIST case the compression ratio seems plausible. But regarding "fix the number of parameters", how do we determine this number from the percentage covariance threshold? Should we put all singular values across layers together and sort them?
- Compatibility with
analog.initialize_from_log()
: If the rank is determined adaptively for each layer, we have to save this rank structure somewhere so that we can recover the exact LoRA structure when initializing from log.
Thanks for catching this!
I don't have a concrete answer to the first question at the moment, and believe this is largely a research question (which is exciting). I know many literatures in communication efficient distributed training, which also does gradient compression, also applies different compression ratio across layers. You can probably review them, try/develop new ideas, and find the best working one. Once we have this, we can merge this PR!
Also, we can think about using different ranks for forward
and backward
. From the implementation perspective, you may allow users to pass tuple of (rank_fwd, rank_bwd)
for this. If a user passes an integer value we can use this value to set both rank_fwd
and rank_bwd
. This is somewhat similar with setting kernel_size
or stride
in nn.Conv
in PyTorch.
Instead of explicitly saving rank information, I did it by using the shape of the lora weight matrix. Let me know if this seems okay!
Add
adaptive_threshold
option to dynamically determine the rank Experiment result with MNIST To use this feature, set compression_ratio_by_covariance or compression_ratio_by_memory in config.yamlThis will determine the rank needed for PCA to explain 80% of the covariance or
This will determine the rank that compresses gradient memory to 80%.