microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.36k stars 2.56k forks source link

Clarification on RWKV not supporting "Training Parallelization" designation on ARXIV paper ? #1243

Open PicoCreator opened 1 year ago

PicoCreator commented 1 year ago

This is pertaining to the paper published here: https://arxiv.org/abs/2307.08621

Unfortunately as I could not find contact information directly. We would like to ask for clarification on what does it mean that RWKV does not "training parallization"?

Screenshot 2023-08-09 at 1 56 08 AM

Page 6

This has cause quite a number of confusion within the community, as RWKV was designed to be able to take advantage of multiple GPUs and be trained in parallel. Across multiple data samples concurrently. Additionally, due to the cascading nature of RWKV state with tokens, it is able to rapidly scale up in a cascading pattern in training in parallel across multiple tokens ( see digram here: https://wiki.rwkv.com/advance/architecture.html#how-does-rwkv-differ-from-classic-rnn )

If this was a mistake in the paper, the RWKV community would like it to be fixed, otherwise we would like a footnote / clarification on what did it mean when we do not "training parallization"

Hoping that someone from the authoring team can be tagged.

sunyt32 commented 1 year ago

It's not complicated. In RWKV-4, the sequence modeling is implemented by recurrent patterns in CUDA code which can't utilize full SM Kernel and Tensor Cores. Even if the community claims it has potential on parallel-scan, there is no implementation when we publish our paper.

As for the cascading nature, every decoder has this property. But it is a question of whether it is friendly to hardware. If your community accelerates the training by that, it's better to show the code and evaluation.

Nevertheless, even though RWKV doesn't utilize parallel thoroughly, it isn't a question to you. "Training Parallelization" is a fact, but it doesn't have an influence on your real training process, which is still not slow I believe.

PicoCreator commented 1 year ago

Thanks for the clarification - are you referring to the RWKV-v4 or the RWKV-v4neo code in blinks main repo here? https://github.com/BlinkDL/RWKV-LM

The former (v4) is a reference implementation, the later (v4neo) is the recommended trainer, that has been better optimised to better make use of the GPU. There is no architecture changes between the two (they use the same models).

The v4neo trainer has been available since sept 2022

Currently, with the right configuration of batch size, and sufficient training context size (eg. 4k/8k), it is possible to configure RWKV to keep the GPU under sustained loads - the following is an example from a recent training run i have done across 8 GPUs.

Screenshot 2023-08-09 at 9 40 19 PM

( the dips in between are checkpoints. The usage graph should be similar to other transformer trainers )

Admittingly - getting the configuration right to fully make use of all the available GPU, is really "not obvious" in the default RWKV trainer. And thats our fault in the RWKV community for not having clearer and better documentation for this process. We can also backdate the trainer code to sometime before the paper, and assist with the required configurations if needed.


The reason, I am seeking clarification, is since the paper was preprinted on 17 July, we been getting questions on the RWKV discord every few days by a reader of the paper. Asking what does it mean when RWKV does not support "Training Parallelization"

Example: * While RWKV does not support <___ Training Parallelization definition ___> , it is still able to heavily utilise multiple gpu's in parallel across multiple nodes with the right configuration.


PS: I am not the author of RWKV nor the RWKV paper, i am simply a volunteer on the discord, who is getting abit tired of explaining that "yes we support multiple GPU training" + "I do not know what the author of the paper mean, about not supporting Training Parallelization, please ask them instead" - there have been readers who have interpreted the paper as "RWKV does not support Data Parallelism / Multi-GPU training"

As such, if there is a clear technical definition on why RWKV does not fit under this, I would gladly point to that for future inquiries.

sunyt32 commented 1 year ago

We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.

The point we emphasize is the parallelization along the sequence, which we write in the paper " RWKV replaces AFT’s position embeddings with exponential decay and runs the models recurrently for training and inference". When the sequence is longer, the GPU utilization may be bounded.

I don't doubt the throughput of RWKV. I also reproduce it in our pipeline. In the paper, we aim to talk more academically, where the parallelization should be both in sequence and in channel. For that, I read the RWKV-4-neo code, it seems still recurrent in sequence dimension, which is not sequence parallelization and doesn't use more advanced hardware acceleration like Tensor Core.

I personally recommend considering our paper's claim from a theoretical perspective, where the current model runs well but may be bounded in the future. As for other misunderstandings, I believe it's not a big problem.

PicoCreator commented 1 year ago

Thanks for clarifying,

That it is not about throughput. But it is regarding the need to compute the previous token state, to compute the next token state. (For a given data sample)

In this regard, there is no dispute on my side on this. As you are right in your understanding, and that this is a fundamental design characteristic of RWKV. (As much as we optimised it in various forms, all upper layers is still dependent on the first token layer block, in a data sample, so that much i can agree on)

A few of the original RWKV paper authors, actually had pointed out that this could have been the interperation. But were not certain. Due to lack of a definition of the term in the v1 paper.

Part of the confusion by the public, might be simple due to the fact that a google search on "Training Parallelization" leads to articles like the following

Ideally, there should be a definition / clarification to avoid further confusion.

But the above explanation is sufficient for me. As I can now form a proper answer accordingly.

I hope you understand, that I did not want to be mis-explaining what the term meant, and misrepresenting your claims.

Either way, perhaps we should view the growing interest by very random folks (and their misunderstanding of the paper), as a good sign that there is growing interest to a new alternative to transformers. Be it "Retentive Network" or "RWKV"

After all we need more, not less alternatives 😊


PS: I believe the bf16 matrix multiplications do make use of tensor cores via pytorch. But I dun think thats the main issue here

StellaAthena commented 1 year ago

We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.

I now understand your intention, but was deeply confused by your paper because I've never seen anyone use language like that before. Given that it's already causing widespread confusion and misunderstanding, do you think you could change the language to something else?

sunyt32 commented 1 year ago

We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.

I now understand your intention, but was deeply confused by your paper because I've never seen anyone use language like that before. Given that it's already causing widespread confusion and misunderstanding, do you think you could change the language to something else?

Don't worry, you understand that now, and we will give a more detailed explanation in the next version of our paper.

StellaAthena commented 1 year ago

We don't mention anything about multiple gpu's parallelizations...so the misunderstanding on the discord is not what we hope for.

I now understand your intention, but was deeply confused by your paper because I've never seen anyone use language like that before. Given that it's already causing widespread confusion and misunderstanding, do you think you could change the language to something else?

Don't worry, you understand that now, and we will give a more detailed explanation in the next version of our paper.

I was wondering when this "next version of our paper" was expected to come out. The most recent version was released the same day as your comment and seems to still have the misleading language. On a practical front, this use of language has continued to cause confusion and we have even seen people claim exactly what you are professing not to claim and cite your paper as the source.