Open PicoCreator opened 1 year ago
It's not complicated. In RWKV-4, the sequence modeling is implemented by recurrent patterns in CUDA code which can't utilize full SM Kernel and Tensor Cores. Even if the community claims it has potential on parallel-scan, there is no implementation when we publish our paper.
As for the cascading nature, every decoder has this property. But it is a question of whether it is friendly to hardware. If your community accelerates the training by that, it's better to show the code and evaluation.
Nevertheless, even though RWKV doesn't utilize parallel thoroughly, it isn't a question to you. "Training Parallelization" is a fact, but it doesn't have an influence on your real training process, which is still not slow I believe.
Thanks for the clarification - are you referring to the RWKV-v4
or the RWKV-v4neo
code in blinks main repo here? https://github.com/BlinkDL/RWKV-LM
The former (v4) is a reference implementation, the later (v4neo) is the recommended trainer, that has been better optimised to better make use of the GPU. There is no architecture changes between the two (they use the same models).
The v4neo trainer has been available since sept 2022
Currently, with the right configuration of batch size, and sufficient training context size (eg. 4k/8k), it is possible to configure RWKV to keep the GPU under sustained loads - the following is an example from a recent training run i have done across 8 GPUs.
( the dips in between are checkpoints. The usage graph should be similar to other transformer trainers )
Admittingly - getting the configuration right to fully make use of all the available GPU, is really "not obvious" in the default RWKV trainer. And thats our fault in the RWKV community for not having clearer and better documentation for this process. We can also backdate the trainer code to sometime before the paper, and assist with the required configurations if needed.
The reason, I am seeking clarification, is since the paper was preprinted on 17 July, we been getting questions on the RWKV discord every few days by a reader of the paper. Asking what does it mean when RWKV does not support "Training Parallelization"
Example: * While RWKV does not support <___ Training Parallelization definition ___> , it is still able to heavily utilise multiple gpu's in parallel across multiple nodes with the right configuration.
PS: I am not the author of RWKV nor the RWKV paper, i am simply a volunteer on the discord, who is getting abit tired of explaining that "yes we support multiple GPU training" + "I do not know what the author of the paper mean, about not supporting Training Parallelization, please ask them instead" - there have been readers who have interpreted the paper as "RWKV does not support Data Parallelism / Multi-GPU training"
As such, if there is a clear technical definition on why RWKV does not fit under this, I would gladly point to that for future inquiries.
We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.
The point we emphasize is the parallelization along the sequence, which we write in the paper " RWKV replaces AFT’s position embeddings with exponential decay and runs the models recurrently for training and inference". When the sequence is longer, the GPU utilization may be bounded.
I don't doubt the throughput of RWKV. I also reproduce it in our pipeline. In the paper, we aim to talk more academically, where the parallelization should be both in sequence and in channel. For that, I read the RWKV-4-neo code, it seems still recurrent in sequence dimension, which is not sequence parallelization and doesn't use more advanced hardware acceleration like Tensor Core.
I personally recommend considering our paper's claim from a theoretical perspective, where the current model runs well but may be bounded in the future. As for other misunderstandings, I believe it's not a big problem.
Thanks for clarifying,
That it is not about throughput. But it is regarding the need to compute the previous token state, to compute the next token state. (For a given data sample)
In this regard, there is no dispute on my side on this. As you are right in your understanding, and that this is a fundamental design characteristic of RWKV. (As much as we optimised it in various forms, all upper layers is still dependent on the first token layer block, in a data sample, so that much i can agree on)
A few of the original RWKV paper authors, actually had pointed out that this could have been the interperation. But were not certain. Due to lack of a definition of the term in the v1 paper.
Part of the confusion by the public, might be simple due to the fact that a google search on "Training Parallelization" leads to articles like the following
Ideally, there should be a definition / clarification to avoid further confusion.
But the above explanation is sufficient for me. As I can now form a proper answer accordingly.
I hope you understand, that I did not want to be mis-explaining what the term meant, and misrepresenting your claims.
Either way, perhaps we should view the growing interest by very random folks (and their misunderstanding of the paper), as a good sign that there is growing interest to a new alternative to transformers. Be it "Retentive Network" or "RWKV"
After all we need more, not less alternatives 😊
PS: I believe the bf16 matrix multiplications do make use of tensor cores via pytorch. But I dun think thats the main issue here
We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.
I now understand your intention, but was deeply confused by your paper because I've never seen anyone use language like that before. Given that it's already causing widespread confusion and misunderstanding, do you think you could change the language to something else?
We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.
I now understand your intention, but was deeply confused by your paper because I've never seen anyone use language like that before. Given that it's already causing widespread confusion and misunderstanding, do you think you could change the language to something else?
Don't worry, you understand that now, and we will give a more detailed explanation in the next version of our paper.
We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.
I now understand your intention, but was deeply confused by your paper because I've never seen anyone use language like that before. Given that it's already causing widespread confusion and misunderstanding, do you think you could change the language to something else?
Don't worry, you understand that now, and we will give a more detailed explanation in the next version of our paper.
I was wondering when this "next version of our paper" was expected to come out. The most recent version was released the same day as your comment and seems to still have the misleading language. On a practical front, this use of language has continued to cause confusion and we have even seen people claim exactly what you are professing not to claim and cite your paper as the source.
This is pertaining to the paper published here: https://arxiv.org/abs/2307.08621
Unfortunately as I could not find contact information directly. We would like to ask for clarification on what does it mean that RWKV does not "training parallization"?
This has cause quite a number of confusion within the community, as RWKV was designed to be able to take advantage of multiple GPUs and be trained in parallel. Across multiple data samples concurrently. Additionally, due to the cascading nature of RWKV state with tokens, it is able to rapidly scale up in a cascading pattern in training in parallel across multiple tokens ( see digram here: https://wiki.rwkv.com/advance/architecture.html#how-does-rwkv-differ-from-classic-rnn )
If this was a mistake in the paper, the RWKV community would like it to be fixed, otherwise we would like a footnote / clarification on what did it mean when we do not "training parallization"
Hoping that someone from the authoring team can be tagged.