Integration with Huggingface Trainer

sangkeun00 commented 8 months ago

This PR is an initial attempt to integrate AnaLog with Huggingface Trainer. It's still work in progress, so I expect several commits would be require before proceeding with merge. To summarize, I propose to use two approaches:

TrainerCallback

This can be particularly useful if users want to extract logs automatically after training. However, I haven't fully figured out how to implement TrainerCallback at the moment, as it requires backward pass which seems pretty uncommon.

Separate log extraction

In many cases, I expect users may not want to perform log extraction for every training. In this situation, users can simply pass their HF Trainer instance along with AnaLog instance to extract_log_from_trainer, which handles post-hoc log extraction while using Trainer configurations.

from transformers import Trainer
from analog import AnaLog

trainer = Trainer(...)
analog = AnaLog("project")
scheduler = AnaLogScheduler(lora=True) # optional
extract_log_from_trainer(trainer, analog, scheduler)

sangkeun00 commented 8 months ago

Overall, it looks good to me and almost ready to be merged. One minor comment is that we may want to add log to our Trainer, which simply executes train, just for the better name-functionality alignment. Let me know what you think!

hwijeen commented 8 months ago

Okay before I merge this I wanted to share:

results are off by little. for example

ipdb> orig_lora[k]
tensor([[-0.0055,  0.0621, -0.0242,  ..., -0.0130,  0.0450, -0.0370],
        [ 0.0271, -0.0362,  0.0057,  ...,  0.0216, -0.0192,  0.0687],
        [-0.0077,  0.0054, -0.0051,  ..., -0.0497, -0.0534, -0.0277],
        ...,
        [-0.0213,  0.0489,  0.0201,  ...,  0.0055,  0.0137,  0.0037],
        [ 0.0032,  0.0318, -0.0068,  ...,  0.0211, -0.0679,  0.0054],
        [-0.0075,  0.0348,  0.0266,  ...,  0.0060, -0.0353, -0.0068]],
       device='cuda:0')
ipdb> trainer_lora[trainer_k]
tensor([[-0.0055,  0.0621, -0.0242,  ..., -0.0129,  0.0451, -0.0369],
        [ 0.0271, -0.0362,  0.0057,  ...,  0.0220, -0.0190,  0.0690],
        [-0.0077,  0.0054, -0.0051,  ..., -0.0492, -0.0533, -0.0274],
        ...,
        [-0.0213,  0.0489,  0.0201,  ...,  0.0055,  0.0137,  0.0037],
        [ 0.0032,  0.0318, -0.0068,  ...,  0.0211, -0.0679,  0.0054],
        [-0.0075,  0.0348,  0.0266,  ...,  0.0059, -0.0353, -0.0069]],

As you can see there is some noticeable difference at 4th decimal point. This does not pass torch.allclose with reasonable atol and rtol like 1e-4~1e-6.

grads are little bit different too.

ipdb> o_g
tensor([[ 8.0321e-03, -6.7063e-04,  5.7372e-03,  ...,  4.1149e-04,
          3.2684e-03, -7.7792e-03],
        [ 8.8194e-03, -8.1639e-03,  1.0469e-02,  ..., -5.8189e-03,
         -5.7563e-04, -9.7524e-03],
        [-9.6131e-03,  1.6930e-02, -7.4397e-03,  ..., -4.1782e-03,
          3.6515e-04,  9.0296e-04],
        ...,
        [-3.1319e-07,  2.2102e-07, -1.1042e-07,  ..., -4.0674e-08,
         -2.6651e-09,  4.0236e-08],
        [ 1.3198e-07, -8.4140e-07, -1.7608e-08,  ...,  4.8742e-08,
          2.3470e-07,  7.1272e-08],
        [-4.4334e-07,  4.0261e-07, -2.6629e-07,  ..., -1.3716e-07,
         -4.3607e-08, -1.2816e-08]])
ipdb> t_g
tensor([[ 8.0321e-03, -6.7063e-04,  5.7372e-03,  ...,  4.1152e-04,
          3.2684e-03, -7.7795e-03],
        [ 8.8194e-03, -8.1639e-03,  1.0469e-02,  ..., -5.8188e-03,
         -5.7563e-04, -9.7527e-03],
        [-9.6131e-03,  1.6930e-02, -7.4397e-03,  ..., -4.1781e-03,
          3.6516e-04,  9.0306e-04],
        ...,
        [-3.1240e-07,  2.1939e-07, -1.0945e-07,  ..., -4.1158e-08,
         -2.8260e-09,  4.0560e-08],
        [-1.3235e-07,  8.4160e-07,  1.7651e-08,  ..., -4.8943e-08,
         -2.3484e-07, -7.1542e-08],
        [ 4.4326e-07, -4.0236e-07,  2.6572e-07,  ...,  1.3762e-07,
          4.4029e-08,  1.3309e-08]])

fyi this is what I'm using for testing

import json

import numpy as np

from analog.logging.log_loader import LogDataset
import torch

orig_lora_path = "./bert_influence/analog/sst2/lora/lora_state_dict.pt"
orig_lora = torch.load(orig_lora_path)

trainer_lora_path = "./huggingface/analog/sst2/lora/lora_state_dict.pt"
trainer_lora = torch.load(trainer_lora_path)

for k in orig_lora.keys():
    trainer_k = k.replace("model.", "")
    # print(k, orig_lora[k].shape, trainer_lora[trainer_k].shape)
    # print(orig_lora[k].norm(), trainer_lora[trainer_k].norm())
    is_close = torch.allclose(orig_lora[k], trainer_lora[trainer_k], atol=1e-1, rtol=1e-1)
    # if not is_close:
    #     import ipdb; ipdb.set_trace(context=10)

orig_ds = LogDataset("./bert_influence/analog/sst2/")
trainer_ds = LogDataset("./huggingface/analog/sst2/")
data_ids = ["hide new secretions from the parental units"]
for data_id in data_ids:
    orig_grad = [ex for ex in orig_ds if ex[0] == data_id]
    trainer_grad = [ex for ex in trainer_ds if ex[0] == data_id]
    for k in orig_grad[0][1]:
        o_g = orig_grad[0][1][k]["grad"]
        trainer_k = k.replace("model.", "")
        t_g = trainer_grad[0][1][trainer_k]["grad"]
        # print(o_g.norm(), t_g.norm())
        is_close = torch.allclose(o_g, t_g, atol=1e-5, rtol=1e-5)
        if not is_close:
            import ipdb; ipdb.set_trace(context=10)

I'm confused if this is acceptable difference. Do you have any opinion?

sangkeun00 commented 8 months ago

Thanks for the analysis. However, we probably need to more debugging before merging.

First of all, it's more stable to compare covariance_state than lora_state_dict as SVD often introduces some precision issues. More importantly, obtained gradients differ in a very weird way. To me, o_g[:-2] and t_g[:-2] look fine (except for minor precision issues), but somehow the sign of last two rows are flipped. Can you guess why this happens?

hwijeen commented 8 months ago

Merging as this numerical difference does not result in significant difference in terms of IF score.

logix-project / logix

Integration with Huggingface Trainer #73

TrainerCallback

Separate log extraction