Closed hamishivi closed 1 month ago
Hello @hamishivi
Thanks for your interest in our work! I'll try my best to address your questions below.
add_lora()
vs LogIXScheduler
: Firstly, you are correct that we used both options in different examples, mostly due to some complications when distributed training is enabled. When you enable (data-parallel) distributed training, you want to ensure that model weights on all workers have the same weights, and this is typically done when you wrap the model (e.g. accelerate.prepare()
). However, if you run add_lora()
after you wrap the model, it's possible that lora weights on different ranks may be different. If you LogIXScheduler
, you are more susceptible to this issue. So, tldr, if you use distributed training, manually adding lora before wrapping the model is the safest option!hessian=none
. Computing the Hessian with LoGra should be very cheap, so I would suggest setting it to raw
as in my language modeling experiments. If it's set to none
, then it just computes gradient dot product as in LESS (another point is that LESS rescales gradients with Adam states, so it would likely work better than naive grad dot product).Hope my answers were helpful! If you have any further concerns or questions, feel free to let me know!
@hamishivi If you have any further questions, I am happy to address them! Otherwise, I will close this issue soon!
Hi @sangkeun00 ! Sorry for not responding, I've been travelling recently. I'm trying out using hessians and will try using larger LoGra ranks soon. One thing I found is that out of the box:
trace
isn't supported at non-float32 sizes, so I can't use float16 size for log saving and do hessian calculations at the same time without adding some casting code? But then, I get issues with inversion (self.covariance_inverse_state[module_name][mode] = torch.inverse( \n... The diagonal element 1 is zero, the inversion could not be completed because the input matrix is singular.
).I made some litle fixes in a fork, but I'm still in the process of working these out. I'm actually travelling this and next week so progress is a little slow. Thanks for checking in, feel free close if you want! If I have particularly burning issues I'm happy to re-open this issue or open new ones.
A small update: I still seem to get the inversion issue, even using float32 consistently, which blocks me from setting hessian=raw
. It happens inconsistently, so unfortunately I can't provide an easy reproducible example right now. The error is like:
2024-07-22T10:09:36.118048056Z
2024-07-22T10:09:36.343762089Z Traceback (most recent call last):
2024-07-22T10:09:36.343795456Z File "/opt/conda/envs/venv/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-07-22T10:09:36.344457422Z return _run_code(code, main_globals, None,
2024-07-22T10:09:36.344479346Z File "/opt/conda/envs/venv/lib/python3.10/runpy.py", line 86, in _run_code
2024-07-22T10:09:36.344499286Z exec(code, run_globals)
2024-07-22T10:09:36.344507692Z File "/gantry-runtime/minimal_multitask/compute_influence_logix.py", line 180, in <module>
2024-07-22T10:09:36.344577482Z results = run.influence.compute_influence_all(merged_test_log, log_loader, mode="cosine")
2024-07-22T10:09:36.344585779Z File "/opt/conda/envs/venv/lib/python3.10/site-packages/logix/analysis/influence_function.py", line 268, in compute_influence_all
2024-07-22T10:09:36.344646501Z src_log = self.precondition(src_log, hessian=hessian, damping=damping)
2024-07-22T10:09:36.344665689Z File "/opt/conda/envs/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-07-22T10:09:36.344855169Z return func(*args, **kwargs)
2024-07-22T10:09:36.344863305Z File "/opt/conda/envs/venv/lib/python3.10/site-packages/logix/analysis/influence_function.py", line 74, in precondition
2024-07-22T10:09:36.344900389Z preconditioned_grad = precondition_fn(
2024-07-22T10:09:36.344907834Z File "/opt/conda/envs/venv/lib/python3.10/site-packages/logix/analysis/influence_function_utils.py", line 77, in precondition_raw
2024-07-22T10:09:36.344948646Z cov_inverse = state.get_covariance_inverse_state(damping=damping)
2024-07-22T10:09:36.344955099Z File "/opt/conda/envs/venv/lib/python3.10/site-packages/logix/state.py", line 148, in get_covariance_inverse_state
2024-07-22T10:09:36.345005299Z self.covariance_inverse(damping=damping)
2024-07-22T10:09:36.345012183Z File "/opt/conda/envs/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-07-22T10:09:36.345054768Z return func(*args, **kwargs)
2024-07-22T10:09:36.345061221Z File "/opt/conda/envs/venv/lib/python3.10/site-packages/logix/state.py", line 130, in covariance_inverse
2024-07-22T10:09:36.345098005Z self.covariance_inverse_state[module_name][mode] = torch.inverse(
2024-07-22T10:09:36.345343807Z torch._C._LinAlgError: linalg.inv: The diagonal element 1 is zero, the inversion could not be completed because the input matrix is singular.
Sorry for the delayed response! I've been busy with personal stuffs recently.
I haven't personally encountered this issue myself. Looking at your bug message, the issue seems that the Hessian is singular. I am a bit surprised as a damping term is added automatically when you set damping="none"
(default), and this most likely ensures invertibility. The only potential issue I see is log_dtype
being set to float16
. If you compute gradients after fine-tuning your model, it is possible that your gradient norm gets very small beyond the float16
limit. This may lead to a large amount of 0 components in the Hessian, which then prevents matrix inversion.
To debug, I suggest you to go to your log directory, and open state/covariance_state.pt
, and manually look into them, especially the one that causes this inversion error. Another thing you can do is setting log_dtype
to float32
(this doesn't necessarily mean that you should also use fp32 for your model and training code as log_dtype
is decoupled from dtype you use for training). If you want to set up a meeting, I am also open to it. Let me know if you have any other questions.
Thanks for the response! I tried setting log_dtype
to float32
without luck (still got the same error), but I'll look into the covariance state to see what's going on, thanks. I'll close the issue for now.
I'm travelling a bit for personal stuff right now so a bit otherwise occupied, but if I'm still having issues next week I might reach out. Thanks so so so much for all your help! I think this is a really cool project :)
fyi, I at least worked out why I was getting the inversion issues - I had a few samples I hadn't filtered out that had no labels due to length issues, so of course their covariances were 0!
Glad to hear that you worked out the issue. If you want to use cosine similarity, I want to also warn you that it may sometimes give Nan due to division by 0. You may want to add a small value in this line (https://github.com/logix-project/logix/blob/main/logix/analysis/influence_function.py#L151) if you encounter this issue. Let me know if you face other issues anytime!
Hi, thanks for this library and awesome paper! I think the logra idea is really cool, and I've been trying to use it in my own research. I've been running into some issues/poor results, though, and it would be useful to get your feedback if possible. For some context, I'm trying to apply this library to instruction tuning, and ideally trying to reduce the storage/compute costs somewhat (small logra ranks, etc). Unfortunately, I've been getting results on par with random selection right now, although I'm not entirely sure I'm using the library correctly, so any advice would be really useful.
run.add_lora()
or settingLogIXScheduler(..., lora="random")
preferable? I see both are done in different examples. I've been trying to follow the language modelling example, which usesadd_lora
but then setslora="None"
in the scheduler. This seems fine looking at the logs/code, but I just wanted to check I wasn't missing something.I'm happy to share code/email etc! Currently my code looks like the following (simplified somewhat):