ccp123456789 commented 10 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

Expected behavior

No response

System Info

No response

Others

No response

hiyouga commented 10 months ago

NCCL 问题

dumpmemory commented 10 months ago

NCCL 问题

我这里也有类似的，同环境 llama系列正常。

dumpmemory commented 10 months ago

目前看到的状况是qlora应该是正常训练的，但是换成全量参数了就有问题

ccp123456789 commented 10 months ago

NCCL 问题

是nccl问题，但是同样微调baichuan、qwen都没有这个问题。难道是mixtral模型本身导致的？

ccp123456789 commented 10 months ago

目前看到的状况是qlora应该是正常训练的，但是换成全量参数了就有问题

qlora你是说的单卡训练吧？我这个是多卡deepspeed+lora情况下出现的

dumpmemory commented 10 months ago

目前看到的状况是qlora应该是正常训练的，但是换成全量参数了就有问题

qlora你是说的单卡训练吧？我这个是多卡deepspeed+lora情况下出现的

看到的状况是 qlora也是多卡

haochen2115 commented 10 months ago

我这也是类似全参训mixtral 就会有nccl error（TnT) 其他模型都正常跑的

dumpmemory commented 10 months ago

折腾 2天了😭

dumpmemory commented 10 months ago

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

bao-xiaoyi commented 10 months ago

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

看不懂这句英文的意思，nvidia 23-10是什么？7hours是你运行成功的时间吗？

dumpmemory commented 10 months ago

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

看不懂这句英文的意思，nvidia 23-10是什么？7hours是你运行成功的时间吗？

之前是1:30 左右就hang了。目前 nvidia pytorch 容器更新到23-10 就正常了，对应nccl版本2.19.3

hiyouga commented 10 months ago

@dumpmemory 方便分享一下参数吗

dumpmemory commented 10 months ago

@dumpmemory 方便分享一下参数吗

好的

docker nvcr.io/nvidia/pytorch:23.10-py3

transformers version: 4.36.1
Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.23.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0a0+32f93b1 (True)
deepspeed 0.12.3

bao-xiaoyi commented 10 months ago

@dumpmemory 方便分享一下参数吗

好的

docker nvcr.io/nvidia/pytorch:23.10-py3

transformers version: 4.36.1

Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35

Python version: 3.10.12

Huggingface_hub version: 0.19.4

Safetensors version: 0.4.1

Accelerate version: 0.23.0

Accelerate config: not found

PyTorch version (GPU?): 2.1.0a0+32f93b1 (True)

deepspeed 0.12.3

是full-sft成功吗？用了多少资源

dumpmemory commented 10 months ago

@dumpmemory 方便分享一下参数吗

好的 docker nvcr.io/nvidia/pytorch:23.10-py3

transformers version: 4.36.1

Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35

Python version: 3.10.12

Huggingface_hub version: 0.19.4

Safetensors version: 0.4.1

Accelerate version: 0.23.0

Accelerate config: not found

PyTorch version (GPU?): 2.1.0a0+32f93b1 (True)

deepspeed 0.12.3

是full-sft成功吗？用了多少资源

是。资源没法说明

dumpmemory commented 10 months ago

@dumpmemory 方便分享一下参数吗

好的 docker nvcr.io/nvidia/pytorch:23.10-py3

transformers version: 4.36.1

Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35

Python version: 3.10.12

Huggingface_hub version: 0.19.4

Safetensors version: 0.4.1

Accelerate version: 0.23.0

Accelerate config: not found

PyTorch version (GPU?): 2.1.0a0+32f93b1 (True)

deepspeed 0.12.3

是full-sft成功吗？用了多少资源

是。资源没法说明

我现在遇到个问题，就是我已经将nccl更新到2.19.3了（网络存储库版本貌似没跟上，我是通过手动下载安装的）。但是deepspeed在最开始打印nccl的版本的时候，还是显示2.7.8。你更新后deepspeed打印版本情况正常吗

重装呀

bao-xiaoyi commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

dumpmemory commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

bao-xiaoyi commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

dumpmemory commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

使用了nvlink，环境变量可以问你下你们的工程团队有推荐的么

hegang1-tal commented 10 months ago

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

dumpmemory commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

bao-xiaoyi commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

hegang1-tal commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

bao-xiaoyi commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

dumpmemory commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0

export NCCL_P2P_LEVEL=PXB

export NCCL_P2P_LEVEL=NVL

export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2

export NCCL_SOCKET_IFNAME=eth0

export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

bao-xiaoyi commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

你解决了吗，我用了楼上大佬的环境变量还是无法解决问题。用zero2的话内存会爆炸（尽管用的小数据集），用zero3在ddp的时候会hang住

CEDIDataVault commented 10 months ago

这个issue怎么关闭了，有大佬解决了么？我这边也是出现出现mixtral模型hang，不报任何bug，无法计算loss，其他模型都没问题

bao-xiaoyi commented 10 months ago

目前的情况是，计算loss前会报出：Invalidate trace cache @ step 738: expected module 752, but got module 784。然后IB网卡传输量会降为0，训练就此hang住

bao-xiaoyi commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0 #export NCCL_P2P_LEVEL=PXB export NCCL_P2P_LEVEL=NVL

export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2

export NCCL_SOCKET_IFNAME=eth0

export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

大佬你是zero2吗？我用zero2内存占用会直接爆炸

CEDIDataVault commented 10 months ago

目前的情况是，计算loss前会报出：Invalidate trace cache @ step 738: expected module 752, but got module 784。然后IB网卡传输量会降为0，训练就此hang住

我之前也遇到了这个问题，咱俩的出现的问题可能一样，如何加v，仔细讨论一下？

bao-xiaoyi commented 10 months ago

目前的情况是，计算loss前会报出：Invalidate trace cache @ step 738: expected module 752, but got module 784。然后IB网卡传输量会降为0，训练就此hang住

我之前也遇到了这个问题，咱俩的出现的问题可能一样，如何加v，仔细讨论一下？

我加你吧，

hegang1-tal commented 10 months ago

deepspeed zero3 + lora finetune 也会遇到同样的问题,hang住

bao-xiaoyi commented 10 months ago

https://github.com/microsoft/DeepSpeed/issues/4864#issuecomment-1869062612

@hiyouga 大佬可以看看

dumpmemory commented 10 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0 #export NCCL_P2P_LEVEL=PXB export NCCL_P2P_LEVEL=NVL export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2 export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

大佬你是zero2吗？我用zero2内存占用会直接爆炸

zero3 full ft

bao-xiaoyi commented 10 months ago

Invalidate trace cache @ step 738: expected module 752, but got module 784

难道不会报Invalidate trace cache @ step 738: expected module 752, but got module 784吗，我看deepspeed那边issue说好像不支持zero3阿，大佬你咋跑的

bao-xiaoyi commented 9 months ago

能否再参考下您的deepspeed配置@dumpmemory

dumpmemory commented 9 months ago

能否再参考下您的deepspeed配置@dumpmemory

{

    "bf16": {
        "enabled": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },

 "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr":  1e-6,
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

bao-xiaoyi commented 9 months ago

能否再参考下您的deepspeed配置@dumpmemory

{

    "bf16": {
        "enabled": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },

 "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr":  1e-6,
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

能再分享下启动命令吗（hh 更新了您的配置后，能算loss了。但是若干step后还是会超时

dumpmemory commented 9 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0 #export NCCL_P2P_LEVEL=PXB export NCCL_P2P_LEVEL=NVL

export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2

export NCCL_SOCKET_IFNAME=eth0

export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

这些使用的是 torch run

awzhgw commented 9 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

你解决了吗，我用了楼上大佬的环境变量还是无法解决问题。用zero2的话内存会爆炸（尽管用的小数据集），用zero3在ddp的时候会hang住

我也是。。zero2会显存爆炸，但是zero3是hang住。。求解决方法

awzhgw commented 9 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0 #export NCCL_P2P_LEVEL=PXB export NCCL_P2P_LEVEL=NVL export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2 export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

这些使用的是 torch run

大佬。。H800，使用的nvidia2312的镜像，用deepspeed 加载Mixtral 7bX8的话，使用zero2.json，则显存爆炸。。。用zero3.json依旧卡住。。求方法。。

dumpmemory commented 9 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0 #export NCCL_P2P_LEVEL=PXB export NCCL_P2P_LEVEL=NVL export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2 export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

这些使用的是 torch run

大佬。。H800，使用的nvidia2312的镜像，用deepspeed 加载Mixtral 7bX8的话，使用zero2.json，则显存爆炸。。。用zero3.json依旧卡住。。求方法。。试试 docker nvcr.io/nvidia/pytorch:23.10-py3

awzhgw commented 9 months ago

@dumpmemory

return torch._dynamo.disable(fn, recursive)(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 333, in _fn return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 450, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 230, in forward outputs = run_function(args) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/mixtral/modeling_mixtral.py", line 806, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(args, kwargs) File "/export/App/training_platform/PinoModel/omni-llava/llava/train/mixtral_flash_attn_monkey_patch.py", line 85, in forward qkv, indices, cu_q_lens, max_s = unpad_input(qkv, key_padding_mask) File "/usr/local/lib/python3.10/dist-packages/flash_attn/bert_padding.py", line 118, in unpad_input index_first_axis(rearrange(hidden_states, "b s ... -> (b s) ..."), indices), File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/usr/local/lib/python3.10/dist-packages/flash_attn/bert_padding.py", line 17, in forward return torch.gather( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2296.87 GiB. GPU 0 has a total capacty of 79.11 GiB of which 69.83 GiB is free. Process 3557305 has 9.06 GiB memory in use. Of the allocated memory 6.72 GiB is allocated by PyTorch, and 1.16 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/19705 [00:10<?, ?it/s]

当我使用了你的参数后，为啥需要2TB的显存？？？兄弟

awzhgw commented 9 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0 #export NCCL_P2P_LEVEL=PXB export NCCL_P2P_LEVEL=NVL export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2 export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

这些使用的是 torch run

大佬。。H800，使用的nvidia2312的镜像，用deepspeed 加载Mixtral 7bX8的话，使用zero2.json，则显存爆炸。。。用zero3.json依旧卡住。。求方法。。试试 docker nvcr.io/nvidia/pytorch:23.10-py3

{

"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } },

"zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}

当我用了你的参数后，需要2TB的显存。。哥们。。这是为啥？

dumpmemory commented 9 months ago

@dumpmemory 大佬我复刻了您的环境，他报出了misc/cudawrap.cc:33 NCCL WARN Cuda failure 3 'initialization error'

你的GPU是啥检查一下host的驱动是否支持？

H800，使用的nvidia2310镜像，不知道是否是因为nccl相关的环境变量原因？能分享一下吗？还有是否使用的nvlink呢？

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 我这里使用的驱动

@Hlw20171113 这个NCCL问题解决了吗？遇到了同样问题，过不去了

你是hang住，还是报错了

我是hang30分钟后会报错，nccl超时

我暂时还未解决，有结果了告诉你。我之前是一直hang住，hang一天也不会报错，可能还不太一样。目前还在调试

export NCCL_P2P_DISABLE=0 #export NCCL_P2P_LEVEL=PXB export NCCL_P2P_LEVEL=NVL export NCCL_PXN_DISABLE=0 export NCCL_NET_GDR_LEVEL=2 export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_GID_INDEX=3 export NCCL_IB_DISABLE=0 export NCCL_IB_HCA= 自己看一下 export NCCL_IB_QPS_PER_CONNECTION=4 export NCCL_IB_TC=160 export NCCL_IB_TIMEOUT=22

这些使用的是 torch run

大佬。。H800，使用的nvidia2312的镜像，用deepspeed 加载Mixtral 7bX8的话，使用zero2.json，则显存爆炸。。。用zero3.json依旧卡住。。求方法。。试试 docker nvcr.io/nvidia/pytorch:23.10-py3

{

"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } },
"zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

当我用了你的参数后，需要2TB的显存。。哥们。。这是为啥？

没遇到过你的状况😭

ftgreat commented 9 months ago

目前的情况是，计算loss前会报出：Invalidate trace cache @ step 738: expected module 752, but got module 784。然后IB网卡传输量会降为0，训练就此hang住

我之前也遇到了这个问题，咱俩的出现的问题可能一样，如何加v，仔细讨论一下？

请问这个问题解决了么，谢谢 @lkemo @bao-xiaoyi

ftgreat commented 9 months ago

...

bao-xiaoyi commented 9 months ago

...

内存用多大

ftgreat commented 9 months ago

...

内存用多大

纯粹测试，zero2 & zero3 都可以跑。

但是 num_experts_per_tok 修改为2时会在第一步hang。

原因不明，镜像用了ngc-23.10 和 deepspeed 比较新的版本。仅供参考。

hiyouga / LLaMA-Factory

deepspeed微调mixtral报错 #1845

Reminder

Reproduction

Expected behavior

System Info

Others

export NCCL_P2P_LEVEL=PXB