Closed Vectorrent closed 11 months ago
Hi @LuciferianInk 👋
Calling model.eval()
turns off things like dropout and per-batch normalization and, as you wrote, should improve things at inference time. I'm afraid that without a short reproducer to load the fine-tuned model and reproduce the problem quickly, there is little we can do -- our bandwidth is limited, so we need your help too :)
Okay, thanks for the response. I might need a couple of days, but I'll try to put something together for you. I'll probably use Docker. Let me know if that's an issue.
Hi @gante,
As requested, I have published a small project to reproduce this issue. It will load RWKV-v4 430m model, attach a LoRA adapter, and quickly run inference. You may choose to use the provided Docker configs, or not; both Docker or vanilla Python should work. Further instructions are in the README file.
I did not recreate the training loop, because you didn't ask for it (nor am I certain that training was the problem). If you'd like to see the training code, I linked to it above.
Thank you for your time and attention to this matter. Please let me know if you need anything else from me.
Well, I've learned a few things, which make me lean towards this being a "quirk in the model," rather than an actual problem with Transformers' inference.
RWKV/rwkv-4-169m-pile
by using PEFT, without running into this issue at all. However, both 430m and 1b5 immediately run into it.Not sure if the issue is still worth tracking here, at this point. I really think I'm just fighting with the challenge of training an RNN, versus the ease of a transformer. I'll leave it to the maintainers to decide if they'd like to close the issue or not.
Thanks for sharing your insights! Might be interesting for @pacman100 who has worked on PEFT (no actionable items right now AFAIU)
Okay! I think we finally landed on a solution. It started with an explanation of the various RWKV modules from Google Bard:
The key module takes the input query and context as input and produces a representation that is used to retrieve the most relevant key-value pairs from the RWKV memory. This is done by transforming the input query and context into a common space, where they can be compared to the keys in the memory. The key module is typically implemented as a neural network, with parameters that are learned during training.
The value module takes the retrieved key-value pairs as input and produces a representation that is used to update the output query. This is done by transforming the key-value pairs into a common space, where they can be combined to produce an update to the output query. The value module is typically implemented as a neural network, with parameters that are learned during training.
The receptance module controls how much of the update produced by the value module is applied to the output query. This is done by multiplying the update by a scalar value, which is called the receptance. The receptance module is typically implemented as a single layer neural network, with parameters that are learned during training.
Long story short, I spent some time experimenting with asymmetric ranks and alpha on the different modules, and eventually landed on some settings that work. At this point, I'm tired of fighting with it, and ready to move on.
I'll be sure to close this issue in a few days, after I'm positive the problem was resolved.
Well, there is no doubt that RWKV is more difficult to work with than a transformer, but I've finally landed on some functional settings. At the end of the day, it required a larger training set, less weight decay, SWA, and a lot of other optimizations. But mostly - avoid training the "value", "output", and "head" modules - and you'll have a better time.
Going to close this issue now.
Thanks for sharing, very insightful @LuciferianInk :)
System Info
transformers
version: 4.32.1Who can help?
@ArthurZucker @gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am not using the Huggingface Trainer. I am using a very basic training loop, originally forked from AITextGen, which uses LightningAI's automatic optimization. The bulk of that fine-tuning code can be found here. This code works correctly for the fine-tuning of other models, like GPT-Neo and Cerebrus-GPT. Something about RWKV v4 (169m/430m) is different.
To reproduce, you would have to implement my training logic (which isn't terribly "custom" or "complicated" at all), then toggle between eval/train modes, while performing inference - to see the difference. Alternatively, perhaps you could train in your own way, and toggle between eval/train... just to let me know if the problem is with my training code? I don't think it is.
I have tried both LoRA and traditional fine-tuning. Both have the same results. I have tried all manner of learning rate adjustments, weight decay, batch size... but hyperparameters don't seem to fix this problem. Nor would I really expect it to; if the problem can be fixed by toggling between eval/train modes, then I would expect that the problem lies in the HF implementation. I spoke to BlinkDL about this (the creator of RWKV), and he said it sounds like a bug in the HF inference code.
Expected behavior
RWKV is unable to produce coherent output after fine-tuning, when
self.model.eval()
is enabled. If the model is set toself.model.train()
, then the output is as expected.Take this sample data, which I've fine-tuned RWKV v4 430m on:
Within <1000 training steps, a fine-tuned model (with
self.model.train()
enabled) will be capable of producing output like this:However, that same model - with
self.model.eval()
enabled - will produce gibberish, like this:I would expect RWKV to perform better in
self.model.eval()
mode, not worse thanself.model.train()
. Clearly, the model is training correctly, and it is learning; something about eval mode completely break generation, though.