Closed sangkeun00 closed 8 months ago
Overall, it looks good to me and almost ready to be merged. One minor comment is that we may want to add log
to our Trainer, which simply executes train
, just for the better name-functionality alignment. Let me know what you think!
Okay before I merge this I wanted to share:
results are off by little. for example
ipdb> orig_lora[k]
tensor([[-0.0055, 0.0621, -0.0242, ..., -0.0130, 0.0450, -0.0370],
[ 0.0271, -0.0362, 0.0057, ..., 0.0216, -0.0192, 0.0687],
[-0.0077, 0.0054, -0.0051, ..., -0.0497, -0.0534, -0.0277],
...,
[-0.0213, 0.0489, 0.0201, ..., 0.0055, 0.0137, 0.0037],
[ 0.0032, 0.0318, -0.0068, ..., 0.0211, -0.0679, 0.0054],
[-0.0075, 0.0348, 0.0266, ..., 0.0060, -0.0353, -0.0068]],
device='cuda:0')
ipdb> trainer_lora[trainer_k]
tensor([[-0.0055, 0.0621, -0.0242, ..., -0.0129, 0.0451, -0.0369],
[ 0.0271, -0.0362, 0.0057, ..., 0.0220, -0.0190, 0.0690],
[-0.0077, 0.0054, -0.0051, ..., -0.0492, -0.0533, -0.0274],
...,
[-0.0213, 0.0489, 0.0201, ..., 0.0055, 0.0137, 0.0037],
[ 0.0032, 0.0318, -0.0068, ..., 0.0211, -0.0679, 0.0054],
[-0.0075, 0.0348, 0.0266, ..., 0.0059, -0.0353, -0.0069]],
As you can see there is some noticeable difference at 4th decimal point. This does not pass torch.allclose with reasonable atol and rtol like 1e-4~1e-6.
grads are little bit different too.
ipdb> o_g
tensor([[ 8.0321e-03, -6.7063e-04, 5.7372e-03, ..., 4.1149e-04,
3.2684e-03, -7.7792e-03],
[ 8.8194e-03, -8.1639e-03, 1.0469e-02, ..., -5.8189e-03,
-5.7563e-04, -9.7524e-03],
[-9.6131e-03, 1.6930e-02, -7.4397e-03, ..., -4.1782e-03,
3.6515e-04, 9.0296e-04],
...,
[-3.1319e-07, 2.2102e-07, -1.1042e-07, ..., -4.0674e-08,
-2.6651e-09, 4.0236e-08],
[ 1.3198e-07, -8.4140e-07, -1.7608e-08, ..., 4.8742e-08,
2.3470e-07, 7.1272e-08],
[-4.4334e-07, 4.0261e-07, -2.6629e-07, ..., -1.3716e-07,
-4.3607e-08, -1.2816e-08]])
ipdb> t_g
tensor([[ 8.0321e-03, -6.7063e-04, 5.7372e-03, ..., 4.1152e-04,
3.2684e-03, -7.7795e-03],
[ 8.8194e-03, -8.1639e-03, 1.0469e-02, ..., -5.8188e-03,
-5.7563e-04, -9.7527e-03],
[-9.6131e-03, 1.6930e-02, -7.4397e-03, ..., -4.1781e-03,
3.6516e-04, 9.0306e-04],
...,
[-3.1240e-07, 2.1939e-07, -1.0945e-07, ..., -4.1158e-08,
-2.8260e-09, 4.0560e-08],
[-1.3235e-07, 8.4160e-07, 1.7651e-08, ..., -4.8943e-08,
-2.3484e-07, -7.1542e-08],
[ 4.4326e-07, -4.0236e-07, 2.6572e-07, ..., 1.3762e-07,
4.4029e-08, 1.3309e-08]])
fyi this is what I'm using for testing
import json
import numpy as np
from analog.logging.log_loader import LogDataset
import torch
orig_lora_path = "./bert_influence/analog/sst2/lora/lora_state_dict.pt"
orig_lora = torch.load(orig_lora_path)
trainer_lora_path = "./huggingface/analog/sst2/lora/lora_state_dict.pt"
trainer_lora = torch.load(trainer_lora_path)
for k in orig_lora.keys():
trainer_k = k.replace("model.", "")
# print(k, orig_lora[k].shape, trainer_lora[trainer_k].shape)
# print(orig_lora[k].norm(), trainer_lora[trainer_k].norm())
is_close = torch.allclose(orig_lora[k], trainer_lora[trainer_k], atol=1e-1, rtol=1e-1)
# if not is_close:
# import ipdb; ipdb.set_trace(context=10)
orig_ds = LogDataset("./bert_influence/analog/sst2/")
trainer_ds = LogDataset("./huggingface/analog/sst2/")
data_ids = ["hide new secretions from the parental units"]
for data_id in data_ids:
orig_grad = [ex for ex in orig_ds if ex[0] == data_id]
trainer_grad = [ex for ex in trainer_ds if ex[0] == data_id]
for k in orig_grad[0][1]:
o_g = orig_grad[0][1][k]["grad"]
trainer_k = k.replace("model.", "")
t_g = trainer_grad[0][1][trainer_k]["grad"]
# print(o_g.norm(), t_g.norm())
is_close = torch.allclose(o_g, t_g, atol=1e-5, rtol=1e-5)
if not is_close:
import ipdb; ipdb.set_trace(context=10)
I'm confused if this is acceptable difference. Do you have any opinion?
Thanks for the analysis. However, we probably need to more debugging before merging.
First of all, it's more stable to compare covariance_state
than lora_state_dict
as SVD often introduces some precision issues. More importantly, obtained gradients differ in a very weird way. To me, o_g[:-2]
and t_g[:-2]
look fine (except for minor precision issues), but somehow the sign of last two rows are flipped. Can you guess why this happens?
Merging as this numerical difference does not result in significant difference in terms of IF score.
This PR is an initial attempt to integrate AnaLog with Huggingface Trainer. It's still work in progress, so I expect several commits would be require before proceeding with merge. To summarize, I propose to use two approaches:
TrainerCallback
This can be particularly useful if users want to extract logs automatically after training. However, I haven't fully figured out how to implement
TrainerCallback
at the moment, as it requires backward pass which seems pretty uncommon.Separate log extraction
In many cases, I expect users may not want to perform log extraction for every training. In this situation, users can simply pass their HF Trainer instance along with AnaLog instance to
extract_log_from_trainer
, which handles post-hoc log extraction while using Trainer configurations.