Inconsistent OOMs occur during long-running jobs

knagrecha / hydra

Execution framework for multi-task model parallelism. Enables the training of arbitrarily large models with a single GPU, with linear speedups for multi-gpu multi-task execution.

Apache License 2.0

20 stars 3 forks source link

Inconsistent OOMs occur during long-running jobs #2

Open knagrecha opened 2 years ago

knagrecha commented 2 years ago

Problem:

Due to the inexact nature of the Pilot partitioner's memory estimation, it often underestimates the memory costs of minibatch passes. During training, the model exceeds the allocated memory bounds and errors out. Typically this occurs during the backward pass.

Quick fix: Increase double buffer space to reduce shard sizes and guarantee more free room. Longer-term fix: Replace the Pilot Partitioner with a more exact algorithm, or one that doesn't push up on the limits of memory bounds.

csci-acct commented 1 year ago

Affecting recent job, fix?