请问chatglm with lora 什么时候支持多卡fine tune啊

shibing624 commented 1 year ago

把模型和数据放置到不同的device上就可以并行了，你可以参考这个实现：https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel

bash99 commented 1 year ago

把模型和数据放置到不同的device上就可以并行了，你可以参考这个实现：https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel

我的卡是半精度比单精度快很多的型号，用fp16=true似乎训练速度没有提升，是需要增加其它参数吗？

training_chatglm_csc_demo.py: 102 model.train_model(args.train_file, args={'fp16': True})

shibing624 commented 1 year ago

我还在解决这个问题，fp16训练当前只减少显存占用了，没有起到加速作用。

bash99 commented 1 year ago

我还在解决这个问题，fp16训练当前只减少显存占用了，没有起到加速作用。

赞。我也试了折腾int8，改了一点之后，还是卡在 AttributeError: 'CastOutputToFloat' object has no attribute 'weight'

mambaforge/lib/python3.10/site-packages/peft/utils/other.py:75 in  prepare_model_for_int8_training
│    72 │   if hasattr(model, output_embedding_layer_name):                                        │
│    73 │   │   output_embedding_layer = getattr(model, output_embedding_layer_name)               │
│    74 │   │   print(f"debug: {output_embedding_layer}")                                          │
│ ❱  75 │   │   input_dtype = output_embedding_layer.weight.dtype                                  │
│    76 │   │                                                                                      │
│    77 │   │   class CastOutputToFloat(torch.nn.Sequential):                                      │
│    78 │   │   │   r"""                                                                           │
output_embedding_layer  打印出来是这个
debug: CastOutputToFloat(
  (0): Linear(in_features=4096, out_features=130528, bias=False)
)

下面是一些小改动，改完才可以进到上面的错误中

diff --git a/textgen/chatglm/chatglm_model.py b/textgen/chatglm/chatglm_model.py
index fab945b..dff559e 100644
--- a/textgen/chatglm/chatglm_model.py
+++ b/textgen/chatglm/chatglm_model.py
@@ -103,11 +103,13 @@ class ChatGlmModel:
             model_name,
             config=config,
             trust_remote_code=True,
+            device_map='auto',
             load_in_8bit=self.args.int8,
         )
-        if self.args.fp16:
-            self.model.half()
-        self.model.to(self.device)
+        if not self.args.int8:
+            if self.args.fp16:
+                self.model.half()
+            self.model.to(self.device)

         if self.args.quantization_bit:
             logger.debug(f"Quantized to {self.args.quantization_bit} bit")

bash99 commented 1 year ago

我还在解决这个问题，fp16训练当前只减少显存占用了，没有起到加速作用。

看到transformer文档里面似乎也是表示fp16可能会在大batch size时省显存，要加速对模型有苛刻要求。 https://huggingface.co/docs/transformers/v4.13.0/en/performance ”So there is only a real memory saving if we train at a high batch size (and it’s not half) and at batch sizes lower than 8, you actually get a bigger memory footprint (because of the overhead mentioned above). The gain for FP16 training is that in each of those cases, the training with the flag --fp16 is twice as fast, which does require every tensor to have every dimension be a multiple of 8 (examples pad the tensors to a sequence length that is a multiple of 8).“

另外，我试着把batch size换成4能提速20%左右（V100有32G内存），换成8还能再提升5%但是训练无效果。

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动，机器人自动关闭此问题，如果需要欢迎提问)

shibing624 / textgen

请问chatglm with lora 什么时候支持多卡fine tune啊 #15