Pytorch lightning - Githubissues

lucidrains commented 2 years ago

how are people's experience with pytorch lightning for distributed training? i tried it a while ago for GANs and it was not flexible enough, but perhaps it can do well here (if it can handle only updating part of the network - training one unet at a time)

lucidrains commented 2 years ago

perhaps it would be interesting to build out both huggingface accelerate trainer as well as a pytorch lightning one and see which one researchers like more

jacobwjs commented 2 years ago

how are people's experience with pytorch lightning for distributed training? i tried it a while ago for GANs and it was not flexible enough, but perhaps it can do well here (if it can handle only updating part of the network - training one unet at a time)

I've done a lot of GAN training using Pytorch Lightning, and have been for quite some time. I found it quite great to be honest. Just need to re-think your structuring, and embrace Lightning's interface/API.

And honestly for any large model that will 100% need distributed, large scale training, not using one of these frameworks from the beginning is probably a bad idea (no knock on you Phil ;)).

I've created an Accelerate friendly version of imagen-pytorch, and to be honest pytorch lightning is just much more mature and robust. Currently there are a lot of unsupported/experimental (and broken) features in Accelerate. I'm sure in the next year that will change considerably, but as of now anything outside of HuggingFace's tried and true Accelerate pipeline is suspect. I'd recommend that if you're heading down the path of abstracting device placement, and are considering using something other than DDP, pytorch lightning is the way to go in the near future.

Also something to note is there is a less abrasive first step into using Pytorch Lightning that considerably reduces re-factoring. https://pytorch-lightning.readthedocs.io/en/stable/starter/lightning_lite.html

@lucidrains lightning lite is a good first step if you head down that path.

lucidrains commented 2 years ago

@jacobwjs thank you for this! i think this may sway me to give lightning lite a try before accelerate!

lucidrains commented 2 years ago

@jacobwjs one last question, have you used any of the deepspeed integrations that comes packaged with lightning

lucidrains commented 2 years ago

@jacobwjs so one of the reasons it may make sense to bet on huggingface accelerate is that they are actually in the process of (successfully) training a 175B parameter language model, and i believe all the training code will eventually be integrated into accelerate. but if pytorch lightning also has solid deepspeed integration, that may encourage me to try them out, since they have been in the pytorch training space for longer. i do think eventually training will be automated away as these two companies compete and ideas cross pollinate

lucidrains commented 2 years ago

that said, the imagen paper shows the unets don't have to be that large, as the pretrained T5 LLMs are doing most of the work. so maybe it is a non-issue

lucidrains commented 2 years ago

if anyone else can put in a good word for pytorch lightning (and especially using it with deepspeed to train anything bigger than 2B parameters), do speak up!

lucidrains commented 2 years ago

hmm, doesn't look like pytorch lightning even supports EMAs https://github.com/Lightning-AI/lightning/issues/10914

nateraw commented 2 years ago

Pinging @SeanNaren - he might be able to provide some info on the deepspeed integrations

lucidrains commented 2 years ago

Oh hey Nate! Speak of the devil lol 👋

pacocp commented 2 years ago

I have been trying to unsuccessfully implement deepspeed with the code. It may be for my own errors, but it looks like it doesn't like the unet structure and the different optimizers. The cast_model_parameters is something that it cannot be done in deepspeed, since models can not be updated after they have been initialized by the library (or that's what I recalled the error said). However it is the first time that I am trying to use it, so for sure there are a lot of things that I am doing wrong!

nateraw commented 2 years ago

In the case of cast_model_parameters, I think you'd want to handle UNet's init properly in __init__ of the lightning module to avoid having to reinit.

SeanNaren commented 2 years ago

Thanks for the ping @nateraw!

This is epic! I hope Lightning can help out here, I'm definitely able to help integrate if needs be @lucidrains, we could detail out this issue and I can help PoC some stuff with Lightning.

Regarding deepspeed, I've been working extensively with it through Lightning and have pretty open communication with the team to make sure it stays up to date with Lightning. We're just organizing code so the performance should be the same as pure PyTorch + DeepSpeed (I'm currently working on a separate side project with DeepSpeed/Lightning and a transformer to validate this as well, see https://github.com/SeanNaren/SmallScience)

Just to make clear as well, the BigScience (176B param model BLOOM) project is using its own implementation to train iirc. This repo is also amazing but probably won't fit the needs of this library.

TLDR: Here to help, happy to PoC something together with you using Lightning to see if it's a right fit.

jacobwjs commented 2 years ago

@jacobwjs one last question, have you used any of the deepspeed integrations that comes packaged with lightning

With full blown Lightning (i.e. not Lite version), yes. Unfortunately not able to link you to that work. All things operated as expected using 02 and fp16. No problems for my use case (large GAN training).

jacobwjs commented 2 years ago

@jacobwjs so one of the reasons it may make sense to bet on huggingface accelerate is that they are actually in the process of (successfully) training a 175B parameter language model, and i believe all the training code will eventually be integrated into accelerate. but if pytorch lightning also has solid deepspeed integration, that may encourage me to try them out, since they have been in the pytorch training space for longer. i do think eventually training will be automated away as these two companies compete and ideas cross pollinate

Ya super exciting stuff right! I've been following the "bigscience" work closely. Really looking forward to what comes out of that and informs Accelerate's roadmap as well.

From my side and experience there really isn't any blocker in moving forward with Deepspeed + Lightning. Seems like more if you want to go through the re-factoring. I'd say it's definitely worth it and happy to support however I can.

And looking forward to the day training is automated away! Until then we grind away...

lucidrains commented 2 years ago

@jacobwjs @SeanNaren awesome! i'll embark on pytorch lightning integration then and ping you for help if i run into any blockers

thank you!

SeanNaren commented 2 years ago

@jacobwjs @SeanNaren awesome! i'll embark on pytorch lightning integration then and ping you for help if i run into any blockers

thank you!

Awesome, that's great! Let me know either via this issue/our community slack/wherever on the internet if you run into issues, happy to assist.

In the meantime, I'll get https://github.com/Lightning-AI/lightning/issues/10914 closed out. Seems everything is kinda in the issue, just needs to be put into one place.

lucidrains commented 2 years ago

@SeanNaren ok that sounds good

hopefully there is a way to work around that open issue because i don't plan on waiting that long! basically we need to keep track of exponential moving averages of all the subnetworks separately (there are multiple unets in the cascade)

on evaluation, we need to be able to call the exponential moving averaged unets in succession as well

do you foresee any big issues with that?

edit: the exponential moving average doesn't need to be a tightly coupled feature with lightning, i just need access to the parameters and be able to update an EMA copy on one machine for evaluation. that should be straightforward?

SeanNaren commented 2 years ago

@lucidrains you should be able to modify this I think: https://github.com/Lightning-AI/lightning-bolts/blob/feat/ema/pl_bolts/callbacks/ema.py

There seems to be some additional memory being allocated (definitely wrong) as described in the issue, going to try to resolve this. It should be fairly straightforward to modify this code to your needs, let me know if you need assistance!

lucidrains commented 2 years ago

@SeanNaren will do, thank you!

lucidrains commented 2 years ago

@SeanNaren just for my information, was wondering what is the largest model trained with pytorch lightning to date? is there a concerted effort by the lightning team to support LLM training?

lucidrains commented 2 years ago

going to embark on the lightning code later today, and hopefully get it finished before friday

SeanNaren commented 2 years ago

hey @lucidrains, https://github.com/NVIDIA/NeMo have some very large models fine-tuned/pre-trained (and they rely on Lightning) as well as a lot of FB initiatives are built off Lightning's integrations (in terms of scaling, primarily using FSDP but we are aware of other internal initiatives). In most cases, billion parameter model sizes have been trained with pytorch-lightning but hard for me to give exact numbers (would need to ask the teams!).

going to embark on the lightning code later today, and hopefully get it finished before friday

Let me know if anything arises!

lucidrains commented 2 years ago

@SeanNaren yea, if you could share a paper or the name of one of these billion parameter models trained with lightning, that would help me a lot in my assessment, thanks!

lucidrains commented 2 years ago

https://github.com/Lightning-AI/lightning/issues/10914#issuecomment-1171402690

lucidrains commented 2 years ago

over at http://github.com/lucidrains/dalle2-pytorch, Zion and Aidan have been training DALLE-2 successfully using huggingface accelerate, with EMA and all, just for everyone's information

SeanNaren commented 2 years ago

Lightning-AI/lightning#10914 (comment)

I'll respond on the issue, TLDR I don't see any blockers (we're just organizing pytorch code, you should be free to do whatever you'd like per step, it only gets weird if your training loop is something exotic, such as hogwild).

I'll detail what I can regarding my knowledge.

Here is a repo where I worked with the DeepSpeed team to fit the largest model I could using Lightning/minGPT (we got to 45B parameters): https://github.com/SeanNaren/minGPT

NeMo Megatron as mentioned before is built on Lightning + megatron. I'm sure they've trained billion param models albeit I can't see them in the public ngc containers.

Personally, other than the above I haven't seen many billion parameter models being trained publically too much, but I might be missing some of Metas' work (cc @ananthsub) who may have some further projects to share.

I do know accelerate/lightning both rely on DeepSpeed for large model scaling (Optionally FSDP, but it doesn't have as much adoption) so really it's about picking the best API/feature set for your use case (which I leave in your expert hands!).

over at http://github.com/lucidrains/dalle2-pytorch, Zion and Aidan have been training DALLE-2 successfully using huggingface accelerate, with EMA and all, just for everyone's information

If we exclude which trainer library, it seems this already achieves a lot of what you'd have to implement all over again. Maybe it's worth assessing the level of effort (IMO the quicker you can get to experiments/training a scaled up imagen model, the better).

lucidrains commented 2 years ago

hi Sean, unfortunately i talked with someone in the city who was using pytorch lightning for work and decided against using it after hearing about his experience

i'll probably be betting on accelerate instead, or simply writing light wrappers around deepspeed or fairscale

thank you for all your time responding to my questions in this issue

SeanNaren commented 2 years ago

hi Sean, unfortunately i talked with someone in the city who was using pytorch lightning for work and decided against using it after hearing about his experience

i'll probably be betting on accelerate instead, or simply writing light wrappers around deepspeed or fairscale

thank you for all your time responding to my questions in this issue

Totally fine, trust you to make the right decision.

Would you be able to summarize the feedback you received? Mainly out of curiosity (and hopeful changes to the library in the future!)

AlvL1225 commented 2 years ago

hi @lucidrains, how is your accelerate integration

lucidrains commented 2 years ago

@SeanNaren what i'm hearing over and over again is that it is great when it works, but when you run into a bug or an edge case that doesn't fit the mold, the internals are too complicated to figure out. not necessarily saddled with technical debt, but just too convoluted with too many layers of abstractions. in a lot of these cases, the researcher or research engineers get stuck for weeks if not a month trying to figure it out with little support, and that's not a position i want to be in

meanwhile, word is that accelerate is already functioning for dalle2-pytorch, with zero issues. so it is kind of a sign for me to just go with that, since the two repositories are largely similar

thanks Sean for helping me up to this point so far! :pray:

lucidrains commented 2 years ago

it'll be done this week!

nateraw commented 2 years ago

@lucidrains are you tracking the accelerate integration with an issue anywhere? 😄 would be great to open one up so I can tag any relevant folks if you run into issues.

lucidrains commented 2 years ago

@nateraw i just talked to Romain over at Laion this morning, and he ran into zero issues scaling accelerate up to 800 GPUs for DALLE2-pytorch on regular DDP

PiotrDabkowski commented 2 years ago

@lucidrains @SeanNaren Exactly my experience with PyTorch Lightning, great when it works, but too unreliable and convoluted. Just looking at the code (eg log method) reminds me of TensorFlow codebase mess, and it looks like it cannot work reliably... The last thing you want is to be stuck because of some weird bug in the overly complex and poorly tested training framework. Keep it simple. Switching to accelerate as well.

robflynnyh commented 2 years ago

I had a bit of memory overhead when using pytorch-lightning with DDP about 6 months ago, was fine-tuning billion param hubert (ASR) model and trying to squeeze it onto 4 V100s, kept getting OOM errors changed the code to just using vanilla pytorch and I was able to fit it on - was otherwise using the same configuration so not sure where it was coming from, but changing it to vanilla pytorch gave me ~100mb which was the extra I needed to run the model, so I've avoided it since lol

williamFalcon commented 2 years ago

hi Sean, unfortunately i talked with someone in the city who was using pytorch lightning for work and decided against using it after hearing about his experience

i'll probably be betting on accelerate instead, or simply writing light wrappers around deepspeed or fairscale

thank you for all your time responding to my questions in this issue

hey @lucidrains! Lightning founder here!

First off, no doubt you're considering some of the best training anstractions. I'll add a few more insights about Lightning that aren't really discussed here.

the person you spoke with very likely is using LightningModule, not LightningLite. LightningLite uses fewer abstractions which makes debugging a lot easier. The experience is full raw Pytorch control
LightningLite is being recommended without disclosing the fact that you will end up needing something like LightningModule in 6 months to a year… (same applies to Accelerate). It’s deceivingly easy to get started but if you want to do training at scale, there’s no free lunch 😊. This is also why Facebook, NVIDIA and massive companies use Lightning. LightningLite at least gives you the ability to delay adopting the full complexity until later. In fact, Lightning has its own accelerator abstraction we introduced in Summer of 2020 that has been already battletested with 1TB parameters before and has been powering Lightning since day One (ie: what FAIR and I were using to train on the facebook cluster daily, consistently across 1024 GPUs with apex + multi-node + all the other bells and whistles).
Lightning has trained 1T+ parameter models before, like it was mentioned here. BLOOM and such didn’t even actually use the HF library… while I wrote the core distributed logic of Lightning at Facebook AI research to do exactly SSL and massive model training (at facebook lightning runs on 1024+ GPUS DAILY for the enterprise workloads…. (and the large-scale public examples @SeanNaren already pointed out).

In terms of support, for key projects, we actually have dedicated channels that are always being watched (your friend likely is not part of one of those channels).

There’s zero reason you wouldn’t be able to get almost instant help from our team @lucidrains. Again, Lightning is used by 10,000+ companies to do massive, large-scale training and deployment.

My suggestion is you give Lightning a try with our help and decide for yourself later on.

Join our slack here https://join.slack.com/t/pytorch-lightning/shared_invite/zt-1bqiy6kpt-x~2PBicDp~z_rF8r8l3vcg

@lantiga and @awaelchli can help you here.

lucidrains / imagen-pytorch

Pytorch lightning #73