Open teja-rao opened 5 months ago
@LucasLLC any comment?
This was a large contention when we were reviewing the design for async_save. Essentially we went back and forth on this issue, and ended up leaning towards the separate functions as a safegaurd against additional parameters which may be async-relevent only.
Fwiw, I agree with @kirtiteja
🚀 The feature, motivation and pitch
Most pytorch methods use a async_op param to do asynchronous operation and return a union of Future or return type. EG:
def broadcast(tensor, src, group=None, async_op=False):
But for checkpointing, we use separate methods async_save and save. The signatures are exactly same except for return type. Having one method with async_op makes the api consistent with rest of pytorch methods and reduce cognitive overhead for developers, allow easy discovery of async save capability and reduce boiler plate code duplication in pytorch and for users.
Most users now have to write something like -
Using one method will allow flexible and simpler code -
or
I believe checkpointing should adopt the goal of supporting training at scale as this is critical for internal consolidation of checkpointing and building community around checkpointing. In large scale training, async checkpointing is more widely used and is probably the main API for checkpointing. so consolidating these methods would simplify integration greatly.
Alternatives
No response
Additional context
No response
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k