Open ZhiyuanChen opened 1 month ago
The whole team appreciates your feedback, as we understand it comes from a place of someone that's worked closely with us in the past, and has contributed to accelerate quite a bit. For the sake of adding "bloat", feature creep is a thing that happens as time goes by. It can be frustrating, especially when our focus isn't what you want it to be, and I can completely respect and understand that.
I'd like to address what you've written here (and somewhat attempt to rewrite it differently) in three different points:
Accelerator
is a navigator for your code, we don't handle things "automatically" in this regard. You can completely ignore this side of the API and everything runs as normal. unwrap
? Call just save_state
/load_state
? This API has three APIs to do one thing... which is not good. There's a lack of a clear path/direction there, which I understandPartialState
. If you don't want this bloat, just do state = PartialState()
and the distributed process portion is done. If you want automatic mixed precision, then you can use the Accelerator
itself and call prepare()
. Steps forward:
With 1.0 and its roadmap, there's an attempt to bring more of a focus back into the training backend rather than everything else happening or as you call it "bloat". The attempt is to consolidate, simplify, and make downstream implementations easier and less... frankly weird.
With the Trainer
... that's a whole other problem. For awhile I've wanted to gut 99% of that thing and just make a "mini" version that doesn't have a million TrainingArguments
. This might live in Accelerate
, or this might be something we try and do on the transformers
side because it is truly a confusing class.
I appreciate all of your comments here, and while we hope that you'll stick it out a little more, or someday return to Accelerate, we also understand if you choose not to. Hopefully in the future if you run into accelerate, we just might look more like what you'd hoped we'd be
As long as this discussion/issue remains civil, I'm very open to listening to constructive criticism on where we can improve, and working towards this in the very near future/immediately.
Thank you for your prompt response. And I do apologise if this is hard to read, as there were too many conflicts when I was refactoring my codebase and I got very upset.
For tracking. I think what user need is not accelerator with tracking capabilities, but a standalone metric library that computes metric in distributed context. Because when dealing with large training cluster, the least I want is synchronization.
If you are interested, I wrote a Metrics before. It can computes metric at different sync level (gpu level and node level for intermediate logging, and cluster level for evaluation).
I used to hope to merge this into torcheval
but it seems Facebook is not actively working on auxiliary libraries like this.
If you think it can be a good addition to any HuggingFace library, I would be very happy to contribute.
For checkpointing, I do not mean it is a bad idea to provide APIs for saving checkpoints. What I meant is as accelerate bloats, the only initialization is still Accelerator()
. DeepSpeed on the other hand, provides a separate init_distributed
API to allow users to initialize process group first and do other setup before initializing engine.
PartialState
is an excellent idea, but its API is less intuitive. From its name, I can tell deepspeed.init_distributed
will initialize distributed environment for me. But PartialState
is really confusing. What should I pass to make it set up distributed environment? What will happen if I created an Accelerator
later? I will have to look into the implementation details to find the answer, and this increase learning costs.
For trainer, please please make it not a class in transformers.
I have a simple runner based on accelerator, and training a ReNet on MNIST is as simple as this demo.
I agree the training hyper parameters is overwhelming. So I wrote a config library to automatically parse parameters. See more here
After all, I think the reason why I'm not a huge fan (I'm still a fan) of accelerate is because its lack of modularized APIs. Everything is put into Accelerator
calls, and this confuses me: what should I do if I just wanted to one thing?
I agree with many of the points @ZhiyuanChen is making here; I personally like accelerate just to manage the distributed stuff and device management, and have no interest in other stuff like the trackers.
@muellerzr If you are looking for inspiration for a stripped back trainer that can be easily extended, take a look at this: accelerate based trainer. If this is the direction that you want to go, I'd be happy to help you integrate and deprecate mine.
To be frank, accelerate is designed to have as many utilities as you may want, or need, for distributed training, but we do not force them upon you. You don't have to use the tracking. You don't even have to use the Accelerator, you can just use the CLI. So I disagree with this argument.
In regard to the Trainer, I'm very familiar with your project, and I still don't believe it should exist here (for now) and instead I am trying to solve this in transformers itself. It will take some time, since I've had my focus solely on Accelerate for quite some time due to manpower, however I will soon be looking into the Trainer, how I can simplify it, and abstract it in such a way that it might not be 1000 LOC to work with.
To be frank, accelerate is designed to have as many utilities as you may want, or need, for distributed training, but we do not force them upon you. You don't have to use the tracking. You don't even have to use the Accelerator, you can just use the CLI. So I disagree with this argument.
I believe the reason why I like Accelerate is it's handy and we do not force them upon you
I think my main disappointment is that the Accelerator
is fed in too many things.
The philosophy (I assume) behind accelerate is Composition Over Inheritance. Yet the Accelerator
is neither Composition nor Inheritance.
I think tracking is also super important, but I want it to be a separate standalone module, so that I don't have to dig in complicated cross module calls to find out its implementation details.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
from accelerate import Accelerator
Expected behavior
I have been a huge fan of accelerate, and have used it for almost every project I own.
And it's probably the right time to say good bye.
I consider a PyTorch Trainer should do two jobs:
Accelerate was first developed to solve the first problem. Its performance is really bad, but Usability over Performance is the most important principle of PyTorch, and I agree with it. Then I don't know from what time, accelerate tries to do the second job, that's when the nightmare starts.
As a researcher, I start dozens of experiments each day. They can be classified to 3 categories:
I designed an experiment name generation pipeline to help me identify what's the purpose of each experiment.
And the first thing it need is the initialisation of distributed environment, as I use random string and timestamp as part of the name, and each process have their own Radom state and timestamp, I need to perform a broadcast before I can tell the name of an experiment.
This is not a unicorn use case. Another example is random seed.
I wish to auto generate a random seed if it's not explicitly given for reproducibility. This must happen after the initialisation of accelerate, because again it needs broadcast. But this is a part of the trainer initialisation, which should happens before initialising
Accelerator
as the trainer provides necessary information (build all the kwargs handler) forAccelerator
to initialise.Huggingface seems to have a thing with functional programming. I am not saying object-oriented is better than functional. But look at Accelerate.
You do not want a Runner class, but
Accelerator
stores more information than a normal Runner. It is a de facto Runner class, but just cannot be initialised like a Runner.You could write a runner that automatically adds components to internal states like PyTorch's
nn.Module
, and prepare them in a__post_init__
function. But to make it functional, you have to write a super complicatedprepare
function and complains when prepare order is incorrect in certain case.You could write a runner that maintains a link between
model
,gradient_state
,optimiser
andscheduler
. But to make it functional, you have to wrap everything so that it can access "global" information. What if there are multiple trainer for multiple models?I used to love Accelerate, for its simplicity. But it's becoming a Frankenstein.
I don't know if you can recall the first philosophy of Unix is to:
Not only my trainer is a shit mountain because of accelerator, Transformers trainer is also becoming as shit mountain.
We are trained machine learning engineers and researchers. That's why we have access to hundreds, or even thousands of GPUs, and why we need Accelerate. We know how to write Python. Simplicity should not sacrifice flexibility.
The simplicity is also sacrificed by functionality. I have a
LRScheduler
that scales lr given_step_count
andtotal_steps
. With gradient accumulation, shouldtotal_steps
be actualtotal_steps
? or should it betrainable_steps
? I recall one version of Accelerate just directly callsnext
multiple times to get correct lr.I have no idea why instead of focusing on improving the performance of Accelerate, you chose to add tons of useless features like logging. I just double checked on the About of Accelerate, and it says
š A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
. Is logging even a part of this statement?I have no idea why people are obsessed with Frankenstein nowadays.
If I'm too lazy to write anything, I could have used PyTorch lightning. Accelerate was targeting at bees who write their own training code. What is your target customer now?