Open xuanhua opened 3 days ago
And here is core logic of model initialization and training
# Load pretrained model only in cpu memory
model = ChatGLMForConditionalGeneration.from_pretrained(
args.model_dir,
device_map="cpu")
tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir)
config = LoraConfig(r=args.lora_r,
lora_alpha=32,
target_modules=["query_key_value"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
inference_mode=False,
)
# create lora model
model = get_peft_model(model, config).half()
# ignore the dataset and dataload stuff here
...
# Initialize the deepspeed engine.
model_engine, optimizer, _, _ = deepspeed.initialize(config=conf,
model=model,
model_parameters=model.parameters())
# Ignore the forward and backward part, they are something like: `outputs = model_engine.forward(input_ids=input_ids, labels=labels)` and `model_engine.backward(loss)`
96GB of VRAM should be plenty for a 6B model. I would start with a simpler ds_config (no offload, and leave the communication values to default) and see the result. Use the DeepSpeed function see_memory_usage
to monitor GPU memory usage in code, or watch the nvidia-smi
manually to observe GPU memory usage.
Hi, guys, I'm not reporting a bug here. But asking for your advice here.
I have a 6b model, like chatglm1-6b. And I also have a machine with 256gb cpu memory and 4 gpus (3090, each with 24gb gpu memory). I want to finetune the lora model of the original model (the trainable parameters is about 0.5% as I saw it reported in training logs).
With a single, or two, or four 3090 gpu, it always report cuda out of memory during the stage of back-propagation. And below is the full logs (with 4 gpus)
What I want to know is if there any configuration of deepspeed that could support this 6b model fine-tuning under current hardware: 256GB cpu memor and 4 gpus of 3090 with 24GB vram for each one. Or this is not achievable with such hardware.
Below is the deepspeed configuration I used and the logs reported from deepspeed.
Full logs of deepspeed: