Open Zarc98 opened 1 year ago
把模型和数据放置到不同的device上就可以并行了,你可以参考这个实现:https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel
我的卡是半精度比单精度快很多的型号,用fp16=true似乎训练速度没有提升,是需要增加其它参数吗?
training_chatglm_csc_demo.py: 102 model.train_model(args.train_file, args={'fp16': True})
我还在解决这个问题,fp16训练当前只减少显存占用了,没有起到加速作用。
我还在解决这个问题,fp16训练当前只减少显存占用了,没有起到加速作用。
赞。我也试了折腾int8,改了一点之后,还是卡在 AttributeError: 'CastOutputToFloat' object has no attribute 'weight'
mambaforge/lib/python3.10/site-packages/peft/utils/other.py:75 in prepare_model_for_int8_training
│ 72 │ if hasattr(model, output_embedding_layer_name): │
│ 73 │ │ output_embedding_layer = getattr(model, output_embedding_layer_name) │
│ 74 │ │ print(f"debug: {output_embedding_layer}") │
│ ❱ 75 │ │ input_dtype = output_embedding_layer.weight.dtype │
│ 76 │ │ │
│ 77 │ │ class CastOutputToFloat(torch.nn.Sequential): │
│ 78 │ │ │ r""" │
output_embedding_layer 打印出来是这个
debug: CastOutputToFloat(
(0): Linear(in_features=4096, out_features=130528, bias=False)
)
下面是一些小改动,改完才可以进到上面的错误中
diff --git a/textgen/chatglm/chatglm_model.py b/textgen/chatglm/chatglm_model.py
index fab945b..dff559e 100644
--- a/textgen/chatglm/chatglm_model.py
+++ b/textgen/chatglm/chatglm_model.py
@@ -103,11 +103,13 @@ class ChatGlmModel:
model_name,
config=config,
trust_remote_code=True,
+ device_map='auto',
load_in_8bit=self.args.int8,
)
- if self.args.fp16:
- self.model.half()
- self.model.to(self.device)
+ if not self.args.int8:
+ if self.args.fp16:
+ self.model.half()
+ self.model.to(self.device)
if self.args.quantization_bit:
logger.debug(f"Quantized to {self.args.quantization_bit} bit")
我还在解决这个问题,fp16训练当前只减少显存占用了,没有起到加速作用。
看到transformer文档里面似乎也是表示fp16可能会在大batch size时省显存,要加速对模型有苛刻要求。 https://huggingface.co/docs/transformers/v4.13.0/en/performance ”So there is only a real memory saving if we train at a high batch size (and it’s not half) and at batch sizes lower than 8, you actually get a bigger memory footprint (because of the overhead mentioned above). The gain for FP16 training is that in each of those cases, the training with the flag --fp16 is twice as fast, which does require every tensor to have every dimension be a multiple of 8 (examples pad the tensors to a sequence length that is a multiple of 8).“
另外,我试着把batch size换成4能提速20%左右(V100有32G内存),换成8还能再提升5%但是训练无效果。
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)
把模型和数据放置到不同的device上就可以并行了,你可以参考这个实现:https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel