Closed ver217 closed 1 year ago
@ver217 There are some suggestions regarding the API design:
PrecisionBolt
as bolt
is rather unclearParallelismPlugin
as we can extend to other features such as quantization. I think it is enough to simply name it as Plugin
.This issue is migrated to #3046 , thus, I will close it for now and all discussions will take place in #3046 .
Proposal
Motivation
if-else
, which is hard to read and modify.Engine
is hard to use. The usage is very different from native torch, and users may take some effort to learn before starting their first applications.Engine
is not flexible. It relies on a configuration file or dict and a global context. If we want to run two models with different parallelism method, it's hard to implement this now. It also only supports single model training, which cannot support some famous RL like PPO.Gemini
and auto-parallelism both have another entry points instead ofEngine
.Design
We keep engine as the main entry point of colossalai training.
Engine has 6 main components:
Engine's features include:
no_sync()
)Engine is not a singleton, though in the most cases single engine is enough.
Possible sample code (pseudo-code)
Single-model supervised learning train loop without pipeline
Single-model supervised learning train loop with pipeline
Multi-model RL train loop without pipeline
Possible class definition (pseudo-code)
Futher work
Huggingface/accelerate and Lightning/fabric may have similar design.
We may provide colossalai plugin / strategy to these libs.
Self-service