Difficulty Reproducing Downstream Task Finetune Results with HF Trainer

yangzhao1230 commented 1 month ago

Due to my unfamiliarity with using PyTorch Lightning (PL), I attempted to use the Hugging Face (HF) Trainer for pretraining and fine-tuning downstream tasks.

I was able to largely reproduce the results from your paper on pretraining. For example, the following figure shows the training curve for reproducing caduceus-ph. (Although using the original learning rate of 8e-3 was more unstable, I opted to use a learning rate of 1e-4).
However, I encountered significant difficulty when trying to fine-tune downstream tasks using HF. Fine-tuning NT benchmarks with HF is theoretically straightforward, and although my actual implementation differs, you can refer to my self-contained Colab: Colab Link. This Colab is mainly based on the fine-tuning tutorial provided by NT. When fine-tuning downstream tasks, the numerical results are highly unstable, often resulting in NaN losses, and I am completely unable to reproduce the results from the paper. The following figure shows my reproduction of the H4ac results. As you can see, with warmup_steps=500 and a learning rate of 1e-3, I can mostly reproduce the results from the paper (~0.6).

However, in most cases, the results are very unstable. You can observe in the training loss below that the curves representing a loss of 0 indicate occurrences of NaN.

I used AutoModelForSequenceClassification to load the initial weights you provided without adding any conjoin-related parameters. I am wondering if this is the main reason affecting performance, and I am also curious about the impact of conjoin. I did not seem to find related experimental explanations in the paper.
When running your PL-based code, I can mostly reproduce the results, but since the HF code would be much simpler, I would like to understand why my implementation fails to reproduce the results completely. I hope to get your help!

Thank you for your assistance!

yangzhao1230 commented 1 month ago

In conclusion, I have two main questions:

Can the HF Trainer theoretically be used for pretraining and fine-tuning your model? Using the HF Trainer would significantly simplify the code.
Does directly loading the basic model (without any conjoin-related parameters and reverse complement augmentations) have a significant impact on downstream tasks?

yangzhao1230 commented 1 month ago

We can observe a very strange phenomenon. The figure shows different learning rates: 1e-3, 5e-4, 5e-4, and 5e-5. Surprisingly, it is the intermediate value of 5e-5 that shows instability.

yangzhao1230 commented 1 month ago

Sorry, I may need to confirm the issue of numerical instability again. I'm not sure if some random issues in the environment are causing occasional occurrences of {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.11} situations, which do not seem to be related to the specific learning rate set. Therefore, I just need you to help me understand whether RC augmentation in downstream tasks is critical.

yangzhao1230 commented 1 month ago

I seem to have discovered some reasons. The initial values of the linear classification layer self.score in the code provided by Hugging Face are very strange and often have extreme outlier values. However, I have not yet figured out why the initialization here results in such strange phenomena. dd0c8287d3a16965e5dfef3d6412cf1

yangzhao1230 commented 1 month ago

This is a simple experiment to illustrate this issue. Each time I initialize an AutoModelForSequenceClassification and a Linear layer, I notice that the values in the linear layers of AutoModelForSequenceClassification are very unusual, whereas the values in the Linear layer are normal.

yangzhao1230 commented 1 month ago

Here I provide my experiment code. I tried changing the hardware device but still observed the same phenomenon.

import torch
import torch.nn as nn

from transformers import AutoModelForSequenceClassification

MAX_TRAILS = 10

for i in range(MAX_TRAILS):
    model = AutoModelForSequenceClassification.from_pretrained(
        "kuleshov-group/caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3", trust_remote_code=True)

    # print model.score's max and min value 
    score_max_value = model.score.weight.max().item()
    score_min_value = model.score.weight.min().item()
    print(f"Trail {i}: Max value: {score_max_value}, Min value: {score_min_value}")

    # randomly initialize a linear with the same shape as model.score
    linear = nn.Linear(model.score.in_features, model.score.out_features, bias=False)
    # print linear's max and min value
    linear_max_value = linear.weight.max().item()
    linear_min_value = linear.weight.min().item()
    print(f"Trail {i}: Linear Max value: {linear_max_value}, Min value: {linear_min_value}")

yair-schiff commented 1 month ago

Perhaps there is an issue in initializing the linear layers of CaduceusForSequenceClassification. Did you try manually initializing these, or not using this class and simply adding nn.Linear layers as you have above?

yangzhao1230 commented 1 month ago

Yes. I tried to re-initialize the weight and it worked.

yangzhao1230 commented 4 weeks ago

I am still unsure about the exact cause of the bug. Given that the hyenadna codebase does not appear to have such issues, could you provide some suggestions on how to effectively identify and fix initialization bugs?

yair-schiff commented 4 weeks ago

I am also not sure why HF is not initializing the params of the linear prediction head in a standard way. In any case, I pushed a fix for this where I explicitly init these weights. See here, for example. This should hopefully resolve the issue.

kuleshov-group / caduceus

Difficulty Reproducing Downstream Task Finetune Results with HF Trainer #37