Lots of AI developer are concerned about their training job to stop unexpected. such as out of disk space, and out of memory.

Feature adding

Can we expand the initial disk to 200GB, instead of 100G currently? because in the era of LLM, the LLM weight' takes so much disk space before we start training.

For example, the detailed disk consumption from my first training experience in BitDeer plt: the anaconda is 30G, two llama-7b models' weight take 26G and one checkpoint will take 10G while keeping some intermediate training state.

after the job starts, it is difficult to mount/add newly purchased disks without stopping.

So, I would suggest it is better to initial Instance with 200G disk

hejing / instance_containize

[Feature] Given more guidance to the user when they want to attach new disks #2

Feature adding