Closed Zehui127 closed 1 year ago
@Zehui127 it is difficult because of the batchnorms
however, recently it seems like researchers are giving good testimonies about using LoRA
i could add this as an option, on the condition that you let me know your results once you try it?
@lucidrains Thanks for responding to this. It will be much appreciated if there is an option for doing tune-fining without freezing the transformer parameter. I will definitely let you know the result if you make the change.
@Zehui127 oh, it has been a while and i have forgotten what i had built
have you tried this option yet?
i'm not really caught up on the fine tuning literature, but my impression is that you don't really need to unfreeze the entire network
@Zehui127 LoRA should be quite good, i see people doing dreambooth fine tuning in stable diffusion using that technique
@lucidrains I actually have looked at the code of this option; it basically froze the transformer and convolutional layer; only add a linear layer to multiply with the linear heads. We don't want to do it because we are trying to do end-2-end training with Enformer and the downstream model. I suppose in the original Tensorflow Enformer Implementation, it unfroze the entire network. Is that possible know how to avoid the issue with the batchnorms? I can also try to implement it if you provide some hints
that option unfreezes the layernorms of the transformer, a popular technique. there's also another option to unfreeze the penultimate layers of the transformer
oh actually nevermind, I never integrated that last N layers fine-tuning thing
what is your batch size when fine-tuning?
fine-tuning is still more art than science. stuff like catastrophic forgetting happens often when you unfreeze the entire network
As for now, I only did a fine-tuning with single V100 (30 GB), and Batch-Size of 2 examples; Larger Batch Size will trigger OUT_OF_Memory on cuda.
yeah, batch norm won't work unless your batch size is at least 32 🤷♂️
batchnorm, why do you have to work so well shakes fist
yeah, I think unfreezing the transformer may be your best bet, since it is free of batch normalization (while keeping conv stem frozen)
I'll take a look tomorrow morning and see if I can easily squeeze in the unfreezing of last N layers of transformer setting, so you can play around with that
may I ask whose lab you are working for?
So which means that the following will work right?
I'm a PhD in Imperial College London
well, you forget that catastrophic forgetting can still happen, esp if you unfreeze the whole network at once. recommend you google for literature on that and see what the latest bag of tricks are
higher batch sizes will help with the batch norm yeah
Thanks for your kind help!
ahh ok cool, yeah then by your tomorrow afternoon, I can probably get the unfreezing of entire or last N layers of transformers finished
no problem, I want this repository to be useful 🌝
@Zehui127 Hi Zehui
reviewed the code this morning and noticed that i already do freeze the batchnorms, so if you are using any one of the finetuning adapters, the batchnorm should not be the issue (as long as you are gradient accumulating properly and not training on batch size of 2). in the case you did things correctly, likely issue would be catastrophic forgetting
i've finished integrating the feature for freezing the entire network but the last N layers of the transformer. you can use it by setting this keyword argument on adapter forward to an integer
@Zehui127 which adapter are you using for the finetuning? the head adapter?
@lucidrains Thanks for helping on this. I was using the HeadAdapterWrapper before. I will try and let you know the result with last N layers fine-tuning.
@Zehui127 were you gradient accumulating when training on batch size of 2? wondering if i should offer a training wrapper that takes care of this too (noticed i actually left myself a todo for researchers like you, but never completed it)
@Zehui127 also, have you tried the discrete key / value bottleneck feature? https://arxiv.org/abs/2207.11240 i quite liked the paper, but never tried it out on a task yet
@lucidrains , It looks like a very interesting paper, I will try and let you know the result. For gradient accumulation, previously, I did use it in some runs, but it didn't have much improvement. I'm refactoring the code at this moment to allow larger effective batch size. Hopefully, with all these changes, the performance could be improved. Thanks again.
@Zehui127 ok, keep me in the know!
@Zehui127 you'll need gradient accumulation for large effective batch size, i'll build it into some training wrapper at some point. it is needlessly confusing for the uninitiated
Dear team,
As far as I can understand, the current fine-tuning feature will fix transformer model and only fine-tune the linear heads. I wonder has anyone try to fine-tuning the whole model without fixing the transformer parameter? I tried, but the performance is not good; for example, when I simply fine-tune the model on the original basenji training set, the correlation score decreased from 0.6 to 0.1. I wonder has anyone encounter the similar issue and what is the potential reason for this?