[RFC] adding Tensor and Pipeline Parallelism to transformers

stas00 commented 3 years ago

Following up on this proposal https://github.com/huggingface/transformers/issues/12772 I just had a discussion with @hyunwoongko (with great help from @JakeTae who patiently translated for us), and we tried to discuss a strategy of how to best integrate Tensor Parallelism (TP) and Pipeline Parallelism (PP) into transformers, making it easy for reviewers and the contributors. Note that parallelformers currently implements only TP.

So here is a great example of how the TP can be added, as @hyunwoongko already implemented it in his fork for GPTNeo https://github.com/tunib-ai/transformers/commit/5bf8655be624b3aeda799b80fddd220213491b04 (he didn't use GPT2 since it already has the naive PP implemented). So you can see exactly what we want to merge. It's a very thin layer to the model and most of the functionality is in the helper parallel utils. The end of the change is multiple tests/examples that need to be converted to our test framework.

Now, while adding TP is relatively easy, adding PP is very complex in the current state of HF models because they include many features that interfere with implementing PP - due to the requirements:

for the model to be nn.Sequentialand
inputs/outputs to be simple tensors with the first dimension of batch size.

So to implement PP we will most likely have to fork each model, strip the unnecessary for scalability features and only then be able to implement PP.

So my thinking is that perhaps we do it from the get the going? Instead of integrating TP into the normal model - say GPTNeo, we fork it to say GTPNeo3D from the get going and do all the work including TP and PP on that new model. Once everybody is happy we can rinse and repeat for other models.

I added 3D to GPTNeo to make GTPNeo3D - 3D = DP/TP/PP - not exactly sure about this particular name or attached to it, this is just something to start with.

Also once TP is implemented in say GTPNeo3D we can start replicating it to other models. Because parallelformers has them all covered already. PP will be much harder and we can do this in parallel.

I wanted to check in with the team to see if this approach resonates better, rather than modifying the existing models.

Thank you!

Also see this blog post explaining parallelforms.

Additionally see the main pytorch Parallelism discussion at https://github.com/pytorch/rfcs/pull/32

@LysandreJik, @sgugger, @patrickvonplaten

siddk commented 3 years ago

@stas00 - I like this a lot! And as we're been dragging our feet with implementing some of the Megatron 3D parallelism into mistral - I think it might be a great way for us to collaborate; we can just start with the base GPT-2 model perhaps?

I think my (and the Mistral team's) addition the next few weeks will be trying to do some benchmarking of Megatron and existing gains with various subsets of parallelism (at a very fundamental level - profiling which kernels are being called, etc.) and maybe creating a set of unit tests to verify correctness?

Separately - might be worth keeping logs of how to "3D-ify" new models, and ways we might make that procedure even easier moving forward.

Let me know if this makes sense!

hyunwoongko commented 3 years ago

@stas00 @siddk If we are creating a new class, we do not need to modify the existing parallelize() method, so we do not need to work with GPTNeo. I think GPT2 would be better.

stas00 commented 3 years ago

Thanks for the feedback, Sidd.

The reason @hyunwoongko thought of starting with GPTNeo was because GPT2 already has the naive PP parallelize(). But the problem is that it's not just in the model, it's also in the Trainer. So we probably need to choose some other action name for that function altogether. At least for a time being so that we could move forward.

Note that the intention is to do simple things first and not do too many things at once. So I think starting with GPTNeo on a clean slate is a better idea. Once it's happy it'd be trivial to replicate that to GPT2. And it's already done as you can see from the link in OP.

Here is my vision of 3Difying transformers:

step 1. implement TP in one model step 2a. start replicating TP to other models step 2b. start working on PP in one model step 3a. start replicating PP to other models.

note how step 2 can be done in parallel by different people.

So I can see that Mistral's team efforts would be parallel work and not sequential. So for example:

step 3b. implement Mistral's GPT2 improvements to GPT2 step 4a. start replicating it to other models.

If were were to start with GPT2 we would interfere with your work, Sidd, so I think it's actually best if we pick 2 different starting models.

But let's stay focused in this discussion on TP+PP, otherwise it'd be too easy to get side-tracked. We already spent too much time talking - let's see some code going into transformers! :)

wrt trainers, it'll be a natural part of the work - I'm not worried too much about it. I don't know much about accelerate yet, but HF Trainer should be relatively easy.

siddk commented 3 years ago

This makes a lot of sense to me - thanks @stas00 and @hyunwoongko for the clarifications! The steps above form a pretty good concrete plan - but if you both are already planning on tackling it, maybe it makes sense for us to tackle some of the other Megatron-LM improvements first, like the custom loss scaling/kernels/etc. (in mistral, so we can break things 😅)? And as y'all build the "main API" for 3D parallelism, we can just drop that in, and train larger models!

The PR with the mistral's first set GPT-2 improvements is waiting on approval right now - once that's in we can move a bit faster as well.

stas00 commented 3 years ago

That sounds like a perfect plan to me, Sidd.

hyunwoongko commented 3 years ago

@stas00 I think the following method is not good for megatron-friendly method.

step 1. implement megatron-friendly TP in one model
step 2a. start replicating megatron-friendly TP to other models
step 2b. start working on megatron-friendly PP in one model
step 3a. start replicating megatron-friendly PP to other models.

Ultimately, implementing PP requires rewriting all modeling code. (including GPT2Attention, GPT2MLP, GPT2Model, ...) I wasn't familiar with PP until not long ago. but recently, I became very familiar with PP and found out that we had to rewrite all the code. (generation_utils.py used for inference should also be changed.) Therefore, it is recommended that megatron-friendly TP and PP be implemented together. (I think it's inefficient to implement megatron-friendly TP alone.)

hyunwoongko commented 3 years ago

The transformers-friendly method (=parallelformers) has the advantage of being able to extend the model quickly because it does not need to rewrite the modeling code (it uses the existing transformers code), but it is not compatible with PP. So we have to remove all the transformers-friendly TP when implementing PP. Which strategy we take is a matter of choice. We can quickly expand them in a transformers friendly way, and then change them one by one to be megatron friendly like

step 1. implement transformers-friendly TP in one model
step 2a. start replicating transformers-friendly TP to other models
step 2b. start working on megatron-friendly TP + PP in one model
step 3a. start replicating megatron-friendly TP + PP to other models.

Or there is a way to not implement transformers-friendly methods because they will be removed anyway. But, since there are thousands of lines of code to write for megatron-friendly and tens of lines of code for transformers-friendly, the megatron-friendly approach will scale very slowly.

step 1. start working on megatron-friendly TP + PP in one model
step 2. start replicating megatron-friendly TP + PP to other models.

One thing to note is that the transformers-friendly TP implementation is completely eliminated when implementing the megatron-friendly TP. A megatron-friendly TP is implemented differently from a transformers-friendly TP.

sgugger commented 3 years ago

Adding a GPTNeo3D to experiment seems like a good idea to me. At the end of the day, that modeling file can leave in the same folder as modeling_gptneo.py.

Note that while you experiment, you can leverage #13467 to share models on the Hub that have no implementation in Transformers and still work with the auto-model API.

stas00 commented 3 years ago

Adding a GPTNeo3D to experiment seems like a good idea to me. At the end of the day, that modeling file can leave in the same folder as modeling_gptneo.py.

Great!

Note that while you experiment, you can leverage #13467 to share models on the Hub that have no implementation in Transformers and still work with the auto-model API.

The 3D GPTNeo model's weights are the same as a normal GPTNeo model's - i.e. it can be used w/ or w/ PP/TP, so I'm not sure why we need a special API?

And I guess we won't be able to use AutoModel, because the config.model_type will say 'gpt_neo', but we will want to load it with GPTNeo3D* classes.

stas00 commented 3 years ago

@hyunwoongko, you're bringing up excellent points.

I suppose the main question is how much of a benefit we can give to users by having just TP. My thinking is that if it's easy to add TP to all models and since you have already done this, let's do it.

I'm concerned that adding PP will be a very slow process because as you said it requires massive rewrites to the model's code, and meanwhile those models that are waiting their turn won't be very scalable (except with Deepspeed ZeRO).

Besides we can delegate the TP adding to the rest of the models to others (other developers and even community) since it's mostly just replaying the code you have already written. But it still requires work, at least in adding tests and documentation, and then PRs.

The only concern with adding the transformers-friendly way is that the external API remains the same when we add PP.

How does that sound?

hyunwoongko commented 3 years ago

@stas00 But anyway, I don't prefer PP. As you know, PP is memory inefficient because it is not compatible with ZeRO 2, 3. In fact, we also decided not to use PP when developing language models. So adding just TP would be helpful for many people. So let's go with the following strategy. but, as you said, the API for both methods should remain the same.

step 1. implement transformers-friendly TP in one model
step 2a. start replicating transformers-friendly TP to other models
step 2b. start working on megatron-friendly TP + PP in one model
step 3a. start replicating megatron-friendly TP + PP to other models.

But transformers-friendly TPs have no reason to rewrite their modeling code. What should we do?

stas00 commented 3 years ago

That's great, @hyunwoongko!

And once we complete GPTNeo3D with TP we can decide whether to fold it back to the normal GPTNeo model or keep it separate. I'm saying that if at the end we will do PP only for a few select models (which is too a real possibiilty), then there is absolutely no need to fork 60 models and create a lot more maintenance work for transformers, if they will have just TP+DP.

hyunwoongko commented 3 years ago

@stas00

In my opinion, transformers-friendly TP have no reason to write their own modeling code like GPTNeo3D.

So the transformers-friendly TP will just use the existing model
And let's make a new modeling class such as GPT2For3D when we develop the megatron-friendly TP + PP (GPT2, Bert, T5, etc, It will probably be some models, not all.)

hyunwoongko commented 3 years ago

I'm thinking of an API like this.

from transformers import GPTNeoModel

model = GPTNeoModel.from_pretrained("elutherai/gpt-neo-1.3B", tensor_model_parallel_size=4)

or

model = GPTNeoModel.from_pretrained("elutherai/gpt-neo-1.3B", tp=4)

I implemented megatron friendly model internally like


@classmethod
def from_yaml(
    cls,
    cfg_path: str,
    tensor_model_parallel_size: int = 1,
    pipeline_model_parallel_size: int = 1,
    tp: int = None,
    pp: int = None,
):
    """
    Create model from yaml config file

    Args:
        cfg_path: path of configurations
        tensor_model_parallel_size: tensor model parallel world size
        pipeline_model_parallel_size: pipeline model parallel world size
        tp (int): equivalent with `tensor_model_parallel_size`
        pp (int): equivalent with `pipeline_model_parallel_size`
    """

    if tp is not None:
        assert tensor_model_parallel_size == 1, (
            "you can't use param `tensor_model_parallel_size` and `tp` at the same time. "
            "they are equivalent. so please use one of them."
        )
        tensor_model_parallel_size = tp

    if pp is not None:
        assert pipeline_model_parallel_size == 1, (
            "you can't use param `pipeline_model_parallel_size` and `pp` at the same time. "
            "they are equivalent. so please use one of them."
        )
        pipeline_model_parallel_size = pp

stas00 commented 3 years ago

I totally agree, that this is a much better way to proceed.

@sgugger, is it ok if we change the initial proposal and add TP to the normal model classes? As we continued discussing this and based on my experience with trying to add PP to transformers it'll be a huge amount of work to do it for all models, and so it's very likely many models will never get it. And since TP requires no changes to the models then there is no reason to make it difficult on users and maintainers to fork the model for that feature to work.

And we believe just having TP+DP will already be a great boon to the scalability of the models (if Deepspeed ZeRO doesn't already address this for whatever reason).

For PP new classes will be needed 100%.

Thank you.

sgugger commented 3 years ago

As long as the changes are minimal, no objection from my side. I agree it makes much more sense to get that out if it's faster and deliver the PP later on.

hyunwoongko commented 3 years ago

the problem is the 'parallelize()' method, the API for layerwise naive parallelism in GPT2 and T5. Do you agree to remove this method? The megatron-friendly TP + PP cannot handle it that way. This is because in the case of PP, parallelization occurs at the time of model creation. That's why I let from_pretrained takes the tp and pp sizes as input.

stas00 commented 3 years ago

I'm thinking of an API like this.

from transformers import GPTNeoModel

model = GPTNeoModel.from_pretrained("elutherai/gpt-neo-1.3B", tensor_model_parallel_size=4)

or

model = GPTNeoModel.from_pretrained("elutherai/gpt-neo-1.3B", tp=4)

I think transformers tends to go with more spelled out args, but not too too long, so perhaps tensor_parallel_size=4

the problem is the 'parallelize()' method, the naive parallelism (layer-wise) implementation. Do you agree to remove this method? The megatron-friendly TP + PP cannot handle it that way. This is because in the case of PP, parallelization occurs at the time of model creation. That's why I let from_pretrained take the tp and pp sizes as input.

The naive PP is experimental:

https://github.com/huggingface/transformers/blob/50c746eeb71f7b8f95a264b09249c9555cdd2e17/src/transformers/models/gpt2/modeling_gpt2.py#L527-L529

but we shouldn't remove it until we replace it with real PP, because users actively use the naive PP at the moment.

That's why we proposed to work on NeoGPT first so that it's easier to take time and not need to have that older code interfere.

hyunwoongko commented 3 years ago

@stas00

I think transformers tends to go with more spelled out args, but not too too long, so perhaps tensor_parallel_size=4

So I made it support both variables (long name and short name). not good?

but we shouldn't remove it until we replace it with real PP, because users actively use the naive PP at the moment. That's why we proposed to work on NeoGPT first so that it's easier to take time and not need to have that older code interfere.

I totally agree with you. Let's start from GPTNeo.

The second thing to discuss is the embedding layer. When I implemented parallelformers, I didn't actually parallelize the embedding layer. In this case, the embedding layer is copied to all GPUs. Therefore, it is memory inefficient. But in fact we can apply VocabParallelEmbedding and VocabParallelCrossEntropy. (However, we should not use the original CrossEntropy in this case) we also need to decide whether or not to add VocabParallelEmbedding to the transforemrs-friendly TP.

I didn't tell you guys, but I actually experimented little by little. I already figured out that I can do VocabParallelEmbedding internally with transformers-friendly TPs.

stas00 commented 3 years ago

@stas00

I think transformers tends to go with more spelled out args, but not too too long, so perhaps tensor_parallel_size=4

So I made it support both variables (long name and short name). not good?

At the moment I don't recall transformers using shortcut aliases for arg names, so probably just having tensor_parallel_size is fine. (no need to repeat "model_" as the shorter name I proposed is not ambiguous)

The second thing to discuss is the embedding layer. When I implemented parallelformers, I didn't actually parallelize the embedding layer. In this case, the embedding layer is copied to all GPUs. Therefore, it is memory inefficient. But in fact we can apply VocabParallelEmbedding and VocabParallelCrossEntropy. (However, we should not use the original CrossEntropy in this case) we also need to decide whether or not to add VocabParallelEmbedding to the transformers-friendly TP.

Was CrossEntropy the reason for not doing it in the first place in parallelformers? I guess the integration will allow to overcome this then if I understood your comment correctly.

But otherwise by all means let's make TP as efficient as possible.

hyunwoongko commented 3 years ago

I like the name tensor_parallel_size more, but I named it tensor_model_parallel_size because I wanted to follow the Megatron-LM nomenclature. In fact, if we input the mpu to DeepSpeed, methods such as mpu.XXX_model_parallel_rank() are called inside it. Therefore, it is better to unify the names.
Since parallelformers is inference only toolkit, there was no reason to worry about CrossEntropy. The reason I didn't do it at the time was because it was a bit complicated. (But it's not difficult.)

How about implementing it with options first?

from_pretrained(tensor_model_parallel_size=4, embedding_parallelism=True)

stas00 commented 3 years ago

I like the name tensor_parallel_size more, but I named it tensor_model_parallel_size because I wanted to follow the Megatron-LM nomenclature. In fact, if we input the mpu to DeepSpeed, methods such as mpu.XXX_model_parallel_rank() are called inside it.

Ah, ok, we can use tensor_model_parallel_size then to make things easier to cross-code. May be then add a note at why this particular name has been chosen.

Since parallelformers are inference only in the first place, there was no reason to worry about CrossEntropy. The reason I didn't do it at the time was because it was a bit complicated. (But it's not difficult.)

Ah, right, I forgot that parallelformers was intended for inference only in the first place. Yes, so what you proposed is a good idea.

stas00 commented 3 years ago

How about implementing it with options first?

from_pretrained(tensor_model_parallel_size=4, embedding_parallelism=True)

Is there a technical reason for not always doing the latter?

hyunwoongko commented 3 years ago

Because of VocabParallelCrossEntropy. the user should be able to use a loss function other than CrossEntropy by using the Transformers model. (RMS, Center Loss, Large-margin softmax, ...) With VocabParallelEmbedding, the Loss function should handle this appropriately. You can check this https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/cross_entropy.py

So I thought the default value of embedding_parallelism as false and turning it on when the user wants to.

stas00 commented 3 years ago

Thank you for the explanation, Hyunwooongko.

Then yes we need that arg. Should the default be False then, so the priority is for the user code to work out of the box and we document embedding_parallelism=True as an optimization?

Further, embedding is ambiguous since we have different types, should we say explicitly word_embed_parallelism?

hyunwoongko commented 3 years ago

Oh I was wrong. Only tying the output embeddings have problems with the loss function. I checked and it doesn't matter since neither gpt2 nor gpt neo are tying output embeddings.

In most cases, we don't need to worry about the loss function. Therefore, I will implement embedding parallelism works everytime so this option is unnecessary. and users do not need to worry about it. If I find a model that tying input and output embeddings without an lm head later, I will think about it then.

hyunwoongko commented 3 years ago

But maybe Meg-DS and GPT NeoX use embedsing tying. So this option will be needed in the future.

stas00 commented 3 years ago

If I'm not mistaken many models have input and output embeddings tied.

deepakn94 commented 3 years ago

Hi all, I helped implement pipeline parallelism in Megatron (and was also one of the lead authors on the PipeDream project). Happy to answer any questions.

I had a question too: what is the current plan for the new PP-friendly model classes? What is going into these, and how will they be different from the vanilla model classes?

Thanks!

stas00 commented 3 years ago

Hi all, I helped implement pipeline parallelism in Megatron (and was also one of the lead authors on the PipeDream project). Happy to answer any questions.

Thank you for joining in and offering to support this endeavour, Deepak!

Have you looked at the recent version of the PP in the core pytorch? I tried to suggest to make the API much more flexible over the last spring - at least wrt inputs and outputs, which is very limited in most current PP implementations. so the new API is much more encouraging. You can even pass non-tensor inputs/outputs.

They have some other interesting tech in there. For example stashes to which you can push / pop at different stages, which could for example help pass around complex structures.

I had a question too: what is the current plan for the new PP-friendly model classes? What is going into these, and how will they be different from the vanilla model classes?

The main obstacles to making HF models PP-friendly are:

variety of complex inputs: like a tuple of a tuple of tensors, inputs that aren't tensors, inputs that are tensors but aren't of first dimension of batch size, etc. Most of the variables are very optional and are there to support research.
models weren't written with nn.Sequential in mind and are very difficult to make into such. for example some models have conditionals on running the encoder or decoder stage or not, which is tricky for nn.Sequential

So based on my earlier attempts to implement PP in transformers we have to fork the existing models, strip down all the unnecessary features, convert to nn.Sequential while adjusting the inputs/outputs to work with it.

As I suggested above pytorch core's new PP API should make this work easier as it's more flexible. But of course, there are other options, more on that later.

The other approach is to start from scratch and build the model with PP in mind from the ground up using Megatron and Deepspeed as a reference, and then to try to adjust the outcome to re-use as much as possible from the current transformers model arsenal. The goal is to avoid maintaining 2 separate code bases. I think building from scratch would be preferable to @hyunwoongko - but then we have only a few models w/ PP to borrow from (gpt2, bert - not even t5). Remember in transformers we have some 50+ models. Some are slightly different, others are quite different from each other.

So these are the 2 ways I have in my mind.

I'm very open to hearing any other propositions. The key need is the ability to replicate the solution to several dozen of models.

And of course, the other essential question is which PP API to use:

write our own / borrow from Megatron
Deepspeed
Pytorch core (will probably require pt-1.9 or even pt-1.10 for some of the recent features, but it should be no problem)
I think FairScale has an API as well, but they have been upstreaming it into the pytorch core, so it's probably the best to rely on the latter.

Beside the ease of use, we also want to make sure that the API allows for the most efficient incarnation of the PP tech - since there are quite a few of them. The goal is to minimize the idling bubble.

BTW, for those who perhaps are new to the topic I wrote this doc: https://huggingface.co/transformers/parallelism.html So that you can quickly understand what we are talking about.

hyunwoongko commented 3 years ago

@sgugger @stas00

We need to name the new class. Do you prefer GPT2Parallel or ParallelGPT2 or GPT23D? Or any other good names? I think GPT23D is very weird. It looks like 23-dimensional parallelism and names like GPT2Parallel are also weird. GPT2ParallelForSequenceClassification or GPT2ParallelWithLMHead looks like Parallel is for SequenceClassification or Parallel is with LMHead. But putting Parallel in the prefix makes everything good. (e.g. ParallelGPT2ForSequenceClassification or ParallelGPT2WithLMHead) If later extended to TF or Flax, naming such as ParallelTFGPT2 is also possible.

stas00 commented 3 years ago

The initially proposed 3D addition is awkward when it becomes GPT23DForSequenceClassification

ParallelGPT2ForSequenceClassification rings nicely to me.

Another alternative is to use a postfix: GPT2ForSequenceClassificationParallel

hyunwoongko commented 3 years ago

The initially proposed 3D addition is awkward when it becomes GPT23DForSequenceClassification

Well... It looks like GPT twenty three lol

Well, the name doesn't really matter. But we have to decide. would you like to vote? What's the best way?

stas00 commented 3 years ago

Let's wait for @sgugger to follow up.

We can vote for the appendix variations, but the main structure (should it be prefix/postfix/infix) is up to Sylvain as he is overseeing the big structure.

sgugger commented 3 years ago

I like the Parallel prefix for those new models.

deepakn94 commented 3 years ago

Have you looked at the recent version of the PP in the core pytorch? I tried to suggest to make the API much more flexible over the last spring - at least wrt inputs and outputs, which is very limited in most current PP implementations. so the new API is much more encouraging. You can even pass non-tensor inputs/outputs.

I have not, but I will try to soon. I agree that flexibility in terms of the number of input and output tensors (as well as types) is good.

The main obstacles to making HF models PP-friendly are:

This makes sense to me. I agree that having a sanitized model with far fewer optional arguments is nice and easier to support.

I think one way to go is to have the Parallel* classes just inherit (or borrow implementations) from the corresponding existing classes (but with the forward() method not supporting all the optional arguments). This will ensure that the model implementation really only lives in a single place, but the necessary if guards and other code is elsewhere in the Parallel* class implementation. The nice thing about these transformer models is that they are pretty repetitive, so the amount of PP-specific code in the guts of the model implementation is not a lot (e.g., don't run through the embedding layer if the current stage is not the first one, etc.).

For models like T5 with an encoder and decoder, it is important to think through how tensors should be passed through the different stages (e.g., the encoder_hidden_state is an input to every decoder stage).

Down the road, you might want to also support interleaved schedules (that trade off a smaller pipeline bubble size for more communication). This will require different execution schedules, so it is also perhaps worthwhile thinking through how to specify these in an easy way.

stas00 commented 3 years ago

HF Transformers doesn't use inheritance in models to help the readers of the model to understand it better, so we will either copy-and-strip or we will build ground-up and then copy what we can to match the original class.

For models like T5 with an encoder and decoder, it is important to think through how tensors should be passed through the different stages (e.g., the encoder_hidden_state is an input to every decoder stage).

How do you propose to deal with conditional runs of encoder (T5)? I see Megatron-LM added T5 but last I checked it didn't support PP.

Down the road, you might want to also support interleaved schedules (that trade off a smaller pipeline bubble size for more communication). This will require different execution schedules, so it is also perhaps worthwhile thinking through how to specify these in an easy way.

I thought that was exactly the question of choosing the right PP framework/API, since some of those support interleaved schedule and others don't. It's best to delegate such things to the external API I'd think.

I realize that Megatron-LM's approach is different since it doesn't use any API but builds its own.

hyunwoongko commented 3 years ago

I made first draft PR for this: https://github.com/huggingface/transformers/pull/13726

deepakn94 commented 3 years ago

HF Transformers doesn't use inheritance in models to help the readers of the model to understand it better.

Fair enough. I guess there will be some code duplication then.

How do you propose to deal with conditional runs of encoder (T5)? I see Megatron-LM added T5 but last I checked it didn't support PP.

What do you mean by "conditional run of the encoder"? I have looked at pipeline parallelism with T5, but we always ran inputs through the encoder (and then as I mentioned above, passed the encoder_hidden_state through the pipeline stages with decoder layers).

It's best to delegate such things to the external API I'd think.

I agree that this would be ideal. But unfortunately, the pipeline-parallelism schedule used is pretty problem- and hardware-dependent, so it might make sense to expose a couple of different options to users (especially since transformers supports so many different models with different computation characteristics). Even better would be a way to allow a user to specify their own new schedule if they wanted, but this can almost definitely be tabled to later.

Additionally, I don't think any pipeline parallelism API out there is robust and has enough features to be truly useful. For example, Torch's PP support seems pretty nice, but only currently supports an all-forward, all-backward schedule that has high memory footprint; 1F1B is strictly better than this (Section 2.2.1 in https://arxiv.org/pdf/2104.04473.pdf). The all-forward, all-backward schedule won't work super well for really large models.

stas00 commented 3 years ago

How do you propose to deal with conditional runs of encoder (T5)? I see Megatron-LM added T5 but last I checked it didn't support PP.

What do you mean by "conditional run of the encoder"? I have looked at pipeline parallelism with T5, but we always ran inputs through the encoder (and then as I mentioned above, passed the encoder_hidden_state through the pipeline stages with decoder layers).

It runs the encoder only on the first pass, and not afterwards. You can see the conditional here: https://github.com/huggingface/transformers/blob/469b80d4e7f9d0ca9411d77845600839e5edf113/src/transformers/models/t5/modeling_t5.py#L1367-L1376

So my question is how to build nn.Sequential when half of it is conditional.

It's best to delegate such things to the external API I'd think.

I agree that this would be ideal. But unfortunately, the pipeline-parallelism schedule used is pretty problem- and hardware-dependent, so it might make sense to expose a couple of different options to users (especially since transformers supports so many different models with different computation characteristics). Even better would be a way to allow a user to specify their own new schedule if they wanted, but this can almost definitely be tabled to later.

Do you mean that once the models are converted to a straightforward nn.Sequential then it could be fed to a variety of PP APIs w/o altering the model itself?

How would that work? e.g. Deepspeed's PP uses all kinds of special APIs for Tied layers (e.g. TiedSpec if I remember the name correctly) and other features that they model has to call explicitly. So it's far from being a generic plug-n-play.

Perhaps pytorch's PP API is slightly more so.

Additionally, I don't think any pipeline parallelism API out there is robust and has enough features to be truly useful. For example, Torch's PP support seems pretty nice, but only currently supports an all-forward, all-backward schedule that has high memory footprint; 1F1B is strictly better than this (Section 2.2.1 in https://arxiv.org/pdf/2104.04473.pdf). The all-forward, all-backward schedule won't work super well for really large models.

Doesn't Deepspeed PP do 1F1B as well?

You're making an excellent point about torch's PP not supporting progressive PP protocols. Definitely need to inquire about them supporting interleaved PP. I wonder if faiscale has been working on that.

@hyunwoongko, what's your take - use an API, and which one you resonate the most with, or develop our own, or borrow an internal implementation from Megatron-LM?

Perhaps we should prepare a table of pros and cons for the different approaches. But somehow I feel you already have something you feel is the best in mind.

stas00 commented 3 years ago

You're making an excellent point about torch's PP not supporting progressive PP protocols. Definitely need to inquire about them supporting interleaved PP. I wonder if faiscale has been working on that.

@pritamdamania, if I may ask - do you have plans to support the interleaved PP protocol in pytorch?

We are having a discussion on which PP framework we should use in transformers. As you know I favour pytorch core because you made it much more user friendly than most other PP frameworks I have seen, but as Deepak says above it may have trouble with huge models because it uses all-forward, all-backward schedule. And we happen to work a lot with huge models as of recent (currently using Megatron-Deepspeed for that). Or perhaps my notion is outdated and the interleaved PP is in works already in pytorch?

Thank you!

hyunwoongko commented 3 years ago

@hyunwoongko, what's your take - use an API, and which one you resonate the most with, or develop our own, or borrow an internal implementation from Megatron-LM?

my plan is DeepSpeed pipeline module. because we should consider ZeRO. Note that my implementation is a variant of Megatron-DeepSpeed.

plus) DeepSpeed PP is based on 1F1B https://github.com/microsoft/DeepSpeed/issues/1110

stas00 commented 3 years ago

So Deepspeed PP with ZeRO-1, correct?

From what I understand while ZeRO-2/3 could technically work, they won't give any performance improvements over ZeRO-1.

But the key feature is that we could easily turn off PP and enable Z2/3 + offload for inference, which is why we use Megatron-Deepspeed for the BigScience and not just Megatron-LM. That is we use DS PP + Z1.

hyunwoongko commented 3 years ago

Currently, DeepSpeed PP and ZeRO 2 and 3 are incompatible. If users don't want to use PP, TP + ZeRO DP is enough. However, the reason we want to provide ParallelGPT2 with DeepSpeed PP is because there are many other things. (PP, Kernel fusion, Sparse attention, Activation checkpoint offloading ...). I think DeepSpeed is the best API to provide all these features and I think it is recommended to unify it with one toolkit as much as possible.

deepakn94 commented 3 years ago

It runs the encoder only on the first pass, and not afterwards. You can see the conditional here.

I believe this is only on the first pass during inference. You would run it every time for training.

Do you mean that once the models are converted to a straightforward nn.Sequential then it could be fed to a variety of PP APIs w/o altering the model itself? How would that work?

The only real things we need to know are: a) What computation should run on each "virtual stage" (and how these stages are mapped to actual GPUs)? b) How should tensors be routed between stages? Using this, you should in theory be able to implement any pipeline schedule that takes care of forward and backward passes.

Things like weight tying can be thought of as postprocessing steps at the end of these forward and backward steps (and before the optimizer step runs). For example, in Megatron, we perform an all-reduce of the gradients of all copies of the embedding layers after completing the forward and backward passes in a given batch (and the logic here is the same regardless of the schedule used for forward and backward passes).

hyunwoongko commented 3 years ago

I think names such as GPT2Model3D is better. ParallelGPT2 is also nice, but it seems like an ambiguous name in that the existing GPT2 is capable of tensor model parallelism.

stas00 commented 3 years ago

Additionally, on slack we have now started discussing a totally new direction for PP and that's where we don't touch the model and get it parallelized automatically.

For example see SageMaker's PP overview https://aws.amazon.com/blogs/aws/amazon-sagemaker-simplifies-training-deep-learning-models-with-billions-of-parameters/ Note that it now doesn't even mention nn.Sequential (it used to back in the spring) - it now totally automates the process. Some months back it preferred nn.Sequential but didn't require it - now it doesn't even mention it. Unfortunately, there is no disclosure on how they do it. I think we should be able to do something similar.

The closest approach to doing it that is publicly disclosed is FlexFlow https://huggingface.co/transformers/parallelism.html#flexflow https://github.com/flexflow/FlexFlow

pritamdamania87 commented 2 years ago

@pritamdamania, if I may ask - do you have plans to support the interleaved PP protocol in pytorch?

@stas00 I'm assuming you are referring to Interleaved Schedule mentioned here: https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/. As you maybe aware there is a lot of research out there in terms of improving pipeline parallel performance on various axes (memory, speed) and each of them have their tradeoffs. It is not feasible to include all those algorithms in Pytorch. Although there are a few options here:

1) If there is an algorithm that is clearly superior than the rest and if there is enough demand for it from the community, we can implement that in Pytorch. 2) We could evaluate making pipeline parallelism in PyTorch extensible so that the community can quickly try out different algorithms on top of a core pipelining framework without having to reimplement the whole algorithm from scratch themselves.

Or perhaps my notion is outdated and the interleaved PP is in works already in pytorch?

Currently we are not working on interleaved PP, but as I mentioned above we have a couple of options in terms of how we can enable this.

@deepakn94 Would love to get your thoughts on this regarding the two options I mentioned above since a lot of your research work is in this area :) My initial feeling is that there probably isn't one algorithm that would be the best for all use cases and even if that is true today, new research a few months later might make that algorithm obsolete. Do you feel it is valuable to have an extensible pipelining framework in PyTorch where researchers like yourself can quickly try out different algorithms/schedules? :)

stas00 commented 2 years ago

Thank you for your follow up, @pritamdamania87!

Yes, Megatron, Deepspeed and Sagemaker all support the interleaved schedule.

If there is an algorithm that is clearly superior than the rest and if there is enough demand for it from the community, we can implement that in Pytorch.

I'd defer to @deepakn94 as he has much more experience with the various schedules.

Perhaps @ShadenSmith has some insights to share as well, as he has built the PP framework in Deepspeed.

(Perhaps we need a PP-creators/users thread where we can share what works the best and how to make the different implementations interchangeable - i.e. creating a standard API).

We could evaluate making pipeline parallelism in PyTorch extensible so that the community can quickly try out different algorithms on top of a core pipelining framework without having to reimplement the whole algorithm from scratch themselves.

That is definitely the best approach it seems.

I think for the plethora of HF Transformers users and uses - being able to choose the best schedule would be a great boon to the whole community.

stas00 commented 2 years ago

@pritamdamania87, have you by chance contemplated an approach where any model could be made to support PP w/ only minor mods or none using automatic splitting based on the graph? For context please see 3 comments up: https://github.com/huggingface/transformers/issues/13690#issuecomment-934756008

huggingface / transformers

[RFC] adding Tensor and Pipeline Parallelism to transformers #13690