Community Integration: Colossal-AI for Large AI Models

binmakeswell commented 2 years ago

Feature request

Dear Hugging Face Team,

My name is Yongbin Li. I am part of Colossal-AI Team.

Thanks for your previous invitation to Colossal-AI org to join Hugging Face. We are happy to share our founder's blog about Hugging Face.

We are thinking about further collaboration, eg. integrating Colossal-AI into Hugging Face to help your community members use large AI models in an efficient and easier manner.

For example, we can democratize its access to all your users in the same way as you did with DeepSpeed. https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/deepspeed

Motivation

We believe the democratization of large AI models is also very helpful for Hugging Face members. We are very appreciated if we could build the integration with you to benefit both of our users.

Actually, we are working on similar integrations with Meta OPT(done), PyTorch Lightning(in process), etc.

Your contribution

We can provide help you need in this cooperation for free. Actually, we have reached a preliminary idea with your team member: omar, lysandre, and julien via email(ybl@hpcaitech.com) and look forward to your further reply.

Feel free to reach out to me on Hugging Face Discord. My username is billy2022. We can discuss more details with other colleagues in a private group.

Thank you very much.

Best regards, Yongbin Li, Chief Marketing Officer, HPC-AI Tech

binmakeswell commented 2 years ago

If you have any difficulties or concerns, please let me know. We can have further discussion about them, thanks. :-)

flozi00 commented 2 years ago

@stas00 seems much better than https://github.com/huggingface/transformers/issues/17392

stas00 commented 2 years ago

I haven't had a change to read on Colossal-AI yet, why do you believe it's much better based on your research, @flozi00? I did notice that it suggests the integration of PatrickStar's functionality.

CAI appears to be its own eco-system - not sure how easy it'd be to integrate with our eco-system.

flozi00 commented 2 years ago

https://github.com/hpcaitech/ColossalAI-Examples/blob/757514d2b1501d3530777cdf567f0a18063acf2d/image/resnet/train.py#L82-L111

In terms of code, it looks very similar to a normal pytorch training loop Did not had a deep look into the CAI code itself, focused on integration compitability to existing code to me it looks like you don't have to deal with the integration of patrickstar since everything is handled by CAI the dependencies are also manageable

I already noticed some time ago, that is was for a range of time in the trends of paperswithcode

The benchmarks looks pretty nice on the first take, but are a little bit confusing too. https://github.com/hpcaitech/ColossalAI#gpt-2 For RAM, Model size and throughput comparison are different techniques used (pytorch, deepspeed, megatron), did not checked if its only cherry picking or really does not matter which one to use

In any case, I think it's not bad to test alternatives to deepspeed. At first glance, the integration into existing pytorch code looks feasible without major problems. Also, with the expertise of both organizations, the integration could be done without much trouble for a single one, with CAI offering to help with the integration "We are very appreciated if we could build the integration with you to benefit both of our users".

stas00 commented 2 years ago

Thank you for sharing your insights, @flozi00!

I read their paper and I'm not quite sure of what type of integration is proposed here. Unlike Deepspeed which is meant to be integrated with the user code, CAI seems to be a standalone solution.

One of the biggest issues with any parallelism proposals (other than DDP) is that they all require rewriting the model's code, which with 100+ models and growing in our arsenal would be prohibitively expensive. Therefore we always welcome automated solutions like Deepspeed which require no changes whatsoever to most models and sometimes a small tweak for some peculiar models.

It's definitely worth exploring all the different versions of TP (2/2.5/3D) mentioned in the paper, but we need this automated and not manually rewritten.

The paper briefly mentions PP, but as we all know this one definitely requires a complete rewrite of the model for most frameworks.

So again let's ask a very concrete question - other than being part of the HF ecosystem what is the vision for the proposed integration?

We already have 2 trainer loop systems (HF Trainer and Accelerate) and we won't want to maintain a 3rd one.

Do you need to inject something into the modeling_utils.py to better support CAI?

Do you propose to rewrite the models to support?

Perhaps let's take one HF Transformers model of your choice and tell us what would you like to do with it to have it run on CAI? This would be more practical.

and specifically to your interest @flozi00 - yes, I hear you like the advanced memory utilization proposed in PatrickStar and CAI suggests to have integrated that functionality.

I hope my commentary was constructive, we are definitely open for good improvements to our tools. It's just I'm weary to add yet another tool unless a clear advantage and ease of integration can be shown.

stas00 commented 2 years ago

Also, let's ping @hyunwoongko - Kevin, I know you have studied many frameworks while building https://github.com/tunib-ai/oslo - have you by chance researched Colossal-AI on your journey? If you did, would you kindly share a few insights if you have any? I know you were cherry picking the best parts from many systems in addition to your own innovations.

flozi00 commented 2 years ago

I'm sorry to admit that I didn't think of the backwards compatibility, totally forgot about that point, sorry.

I focused mainly on the integration in the trainer and did not include the now very many architectures and weights.

Maybe CAI has an idea to automate that ? What about the integration to lightning, did they had discussed that point too ?

I have some ideas in mind but that would be more part of CAI itself or third party tools, about finding JIT methods to convert the required model parts, instead of the HF integration.

stas00 commented 2 years ago

I'm sorry to admit that I didn't think of the backwards compatibility, totally forgot about that point, sorry.

I focused mainly on the integration in the trainer and did not include the now very many architectures and weights.

No harm done. This is totally understandable - the HF transformers eco-system has been becoming more and more complex so often it's far from trivial to add yet another component to it.

We are super welcoming solutions that can automate performance enhancements (like torchdynamo - see below).

Maybe CAI has an idea to automate that ? What about the integration to lightning, did they had discussed that point too ?

PL is a training framework/loop, last I looked they didn't have the model library and were using transformers, so they don't need to deal with modeling.

I have some ideas in mind but that would be more part of CAI itself or third party tools, about finding JIT methods to convert the required model parts, instead of the HF integration.

there is already work being done on that with torchdynamo/nvfuser - it's not fully stable yet, but shows some impressive speed ups (and lower memory usage) for converting normal pytorch code to fused kernels - but this is a different dimension to parallelism and advanced memory management systems. It's definitely not a replacement for parallelism, as it can save 2x memory, or provide a 2x speed up, but it's far from enough for 100B+ models.

Please see the HF integration details here: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#inference-with-torchdynamo

ver217 commented 2 years ago

Hi, we drafted a pull request which intergrates ColossalAI to lightning. And here are exmaples and benchmark https://github.com/hpcaitech/ColossalAI-Pytorch-lightning. We have impletemented ZeRO-DP with chunk-based memory management and heterogeneous memory management. I think this is not hard to intergrate to HF. Besides, we are working on auto parallelism. I believe we can use TP/PP without modifying model in the future.

stas00 commented 2 years ago

OK, so at the moment you're proposing to integrate CAI for:

its ZeRO-DP with chunk-based memory management and heterogeneous memory management. This is something that Deepspeed is lacking at the moment (and if I understand correctly the technology comes from PatrickStar)
down the road for auto-parallelism

@sgugger, should this perhaps go straight into accelerate?

(Sylvain is on vacation, so please let's wait a bit for him to be back and advise on how to best to proceed.)

sgugger commented 2 years ago

We'll probably need to duplicate the integration in the Trainer and Accelerate for now, since the Trainer does not depend on Accelerate.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers