Time to move on - Githubissues

System Info

- `Accelerate` version: 0.32.1
- Platform: Linux-4.19.90-24.4.v2101.ky10.aarch64-aarch64-with-glibc2.28
- Python version: 3.10.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 254.07 GB
- GPU type: NVIDIA A100-PCIE-40GB
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

from accelerate import Accelerator

Expected behavior

I have been a huge fan of accelerate, and have used it for almost every project I own.

And it's probably the right time to say good bye.

I consider a PyTorch Trainer should do two jobs:

setup training environments, like distributed, mixed precision, etc.
manage experiments, like logging, saving checkpointing, etc.

Accelerate was first developed to solve the first problem. Its performance is really bad, but Usability over Performance is the most important principle of PyTorch, and I agree with it. Then I don't know from what time, accelerate tries to do the second job, that's when the nightmare starts.

As a researcher, I start dozens of experiments each day. They can be classified to 3 categories:

Modification: new model structure, new pipeline
Tuning: new hyper-parameter
Validation: new dataset / new random seed / new precision

I designed an experiment name generation pipeline to help me identify what's the purpose of each experiment.

And the first thing it need is the initialisation of distributed environment, as I use random string and timestamp as part of the name, and each process have their own Radom state and timestamp, I need to perform a broadcast before I can tell the name of an experiment.

This is not a unicorn use case. Another example is random seed.

I wish to auto generate a random seed if it's not explicitly given for reproducibility. This must happen after the initialisation of accelerate, because again it needs broadcast. But this is a part of the trainer initialisation, which should happens before initialising Accelerator as the trainer provides necessary information (build all the kwargs handler) for Accelerator to initialise.

Huggingface seems to have a thing with functional programming. I am not saying object-oriented is better than functional. But look at Accelerate.

You do not want a Runner class, but Accelerator stores more information than a normal Runner. It is a de facto Runner class, but just cannot be initialised like a Runner.

You could write a runner that automatically adds components to internal states like PyTorch's nn.Module, and prepare them in a __post_init__ function. But to make it functional, you have to write a super complicated prepare function and complains when prepare order is incorrect in certain case.

You could write a runner that maintains a link between model, gradient_state, optimiser and scheduler. But to make it functional, you have to wrap everything so that it can access "global" information. What if there are multiple trainer for multiple models?

I used to love Accelerate, for its simplicity. But it's becoming a Frankenstein.

I don't know if you can recall the first philosophy of Unix is to:

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".

Not only my trainer is a shit mountain because of accelerator, Transformers trainer is also becoming as shit mountain.

We are trained machine learning engineers and researchers. That's why we have access to hundreds, or even thousands of GPUs, and why we need Accelerate. We know how to write Python. Simplicity should not sacrifice flexibility.

The simplicity is also sacrificed by functionality. I have a LRScheduler that scales lr given _step_count and total_steps. With gradient accumulation, should total_steps be actual total_steps? or should it be trainable_steps? I recall one version of Accelerate just directly calls next multiple times to get correct lr.

I have no idea why instead of focusing on improving the performance of Accelerate, you chose to add tons of useless features like logging. I just double checked on the About of Accelerate, and it says 🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support. Is logging even a part of this statement?

I have no idea why people are obsessed with Frankenstein nowadays.

If I'm too lazy to write anything, I could have used PyTorch lightning. Accelerate was targeting at bees who write their own training code. What is your target customer now?

The whole team appreciates your feedback, as we understand it comes from a place of someone that's worked closely with us in the past, and has contributed to accelerate quite a bit. For the sake of adding "bloat", feature creep is a thing that happens as time goes by. It can be frustrating, especially when our focus isn't what you want it to be, and I can completely respect and understand that.

I'd like to address what you've written here (and somewhat attempt to rewrite it differently) in three different points:

Why tracking? We saw users need to deal with trackers and want an API for doing so in distributed contexts, as at the time doing so required many checks to what process they are on, and such. At the end of the day it's a navigator for handling calls/logging, but we do no more than that. Accelerator is a navigator for your code, we don't handle things "automatically" in this regard. You can completely ignore this side of the API and everything runs as normal.
Saving checkpoints... I've lost you on there. Checkpointing was around well into when you were first contributing to Accelerate, as there are numerous behaviors involved with checkpointing when it comes to FSDP/DeepSpeed and whatnot. I don't disagree that right now figuring out how to checkpoint is a bit of a nightmare, should you call unwrap? Call just save_state/load_state? This API has three APIs to do one thing... which is not good. There's a lack of a clear path/direction there, which I understand
Re; 1, this is the exact purpose for our refactor and introduction of the PartialState. If you don't want this bloat, just do state = PartialState() and the distributed process portion is done. If you want automatic mixed precision, then you can use the Accelerator itself and call prepare().

Steps forward:

With 1.0 and its roadmap, there's an attempt to bring more of a focus back into the training backend rather than everything else happening or as you call it "bloat". The attempt is to consolidate, simplify, and make downstream implementations easier and less... frankly weird.

With the Trainer... that's a whole other problem. For awhile I've wanted to gut 99% of that thing and just make a "mini" version that doesn't have a million TrainingArguments. This might live in Accelerate, or this might be something we try and do on the transformers side because it is truly a confusing class.

I appreciate all of your comments here, and while we hope that you'll stick it out a little more, or someday return to Accelerate, we also understand if you choose not to. Hopefully in the future if you run into accelerate, we just might look more like what you'd hoped we'd be

As long as this discussion/issue remains civil, I'm very open to listening to constructive criticism on where we can improve, and working towards this in the very near future/immediately.

Thank you for your prompt response. And I do apologise if this is hard to read, as there were too many conflicts when I was refactoring my codebase and I got very upset.

For tracking. I think what user need is not accelerator with tracking capabilities, but a standalone metric library that computes metric in distributed context. Because when dealing with large training cluster, the least I want is synchronization.

If you are interested, I wrote a Metrics before. It can computes metric at different sync level (gpu level and node level for intermediate logging, and cluster level for evaluation).

I used to hope to merge this into torcheval but it seems Facebook is not actively working on auxiliary libraries like this. If you think it can be a good addition to any HuggingFace library, I would be very happy to contribute.

For checkpointing, I do not mean it is a bad idea to provide APIs for saving checkpoints. What I meant is as accelerate bloats, the only initialization is still Accelerator(). DeepSpeed on the other hand, provides a separate init_distributed API to allow users to initialize process group first and do other setup before initializing engine.

PartialState is an excellent idea, but its API is less intuitive. From its name, I can tell deepspeed.init_distributed will initialize distributed environment for me. But PartialState is really confusing. What should I pass to make it set up distributed environment? What will happen if I created an Accelerator later? I will have to look into the implementation details to find the answer, and this increase learning costs.

For trainer, please please make it not a class in transformers.

I have a simple runner based on accelerator, and training a ReNet on MNIST is as simple as this demo.

I agree the training hyper parameters is overwhelming. So I wrote a config library to automatically parse parameters. See more here

After all, I think the reason why I'm not a huge fan (I'm still a fan) of accelerate is because its lack of modularized APIs. Everything is put into Accelerator calls, and this confuses me: what should I do if I just wanted to one thing?

I agree with many of the points @ZhiyuanChen is making here; I personally like accelerate just to manage the distributed stuff and device management, and have no interest in other stuff like the trackers.

@muellerzr If you are looking for inspiration for a stripped back trainer that can be easily extended, take a look at this: accelerate based trainer. If this is the direction that you want to go, I'd be happy to help you integrate and deprecate mine.

To be frank, accelerate is designed to have as many utilities as you may want, or need, for distributed training, but we do not force them upon you. You don't have to use the tracking. You don't even have to use the Accelerator, you can just use the CLI. So I disagree with this argument.

In regard to the Trainer, I'm very familiar with your project, and I still don't believe it should exist here (for now) and instead I am trying to solve this in transformers itself. It will take some time, since I've had my focus solely on Accelerate for quite some time due to manpower, however I will soon be looking into the Trainer, how I can simplify it, and abstract it in such a way that it might not be 1000 LOC to work with.

To be frank, accelerate is designed to have as many utilities as you may want, or need, for distributed training, but we do not force them upon you. You don't have to use the tracking. You don't even have to use the Accelerator, you can just use the CLI. So I disagree with this argument.

I believe the reason why I like Accelerate is it's handy and we do not force them upon you

I think my main disappointment is that the Accelerator is fed in too many things. The philosophy (I assume) behind accelerate is Composition Over Inheritance. Yet the Accelerator is neither Composition nor Inheritance. I think tracking is also super important, but I want it to be a separate standalone module, so that I don't have to dig in complicated cross module calls to find out its implementation details.

huggingface / accelerate

Time to move on #2995

System Info

Information

Tasks

Reproduction

Expected behavior