hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.81k stars 4.35k forks source link

[BUG]: use gpt2 example for custom model and dataset, report torch.distributed error. #2919

Closed zixiliuUSC closed 1 year ago

zixiliuUSC commented 1 year ago

🐛 Describe the bug

使用huggingface的模型接口,加载glm,按照gpt2示例的方式运行训练代码,数据集合采用自建数据集,数据集的运行模式如下,仅仅修改了https://github.com/hpcaitech/ColossalAI/blob/dbc01b9c0479a6fd3fb04450b9dc01b5162d8c0d/examples/language/gpt/gemini/train_gpt_demo.py#L342这行代码,train_step 函数并无修改。 运行脚本如下:

set -x
# distplan in ["CAI_ZeRO1", "CAI_ZeRO2", "CAI_Gemini", "Pytorch_DDP", "Pytorch_ZeRO"]
export DISTPLAN=${DISTPLAN:-"CAI_Gemini"}

# The following options only valid when DISTPLAN="colossalai"
export GPUNUM=${GPUNUM:-2}
export TPDEGREE=${TPDEGREE:-1}
export PLACEMENT=${PLACEMENT:-"cpu"}
export USE_SHARD_INIT=${USE_SHARD_INIT:-False}
export BATCH_SIZE=${BATCH_SIZE:-4}
export MODEL_TYPE=${MODEL_TYPE:-"gpt2_medium"}
export TRAIN_STEP=${TRAIN_STEP:-10}
# export PYTHONPATH=$PWD:$PYTHONPATH

if [ ${USE_SHARD_INIT} = "True" ]; then
  USE_SHARD_INIT="--shardinit"
else
  USE_SHARD_INIT=""
fi

mkdir -p gemini_logs

torchrun --standalone --nproc_per_node=${GPUNUM} ./train_glm_demo.py \
--tp_degree=${TPDEGREE} \
--model_type=${MODEL_TYPE} \
--batch_size=${BATCH_SIZE} \
--placement=${PLACEMENT} \
${USE_SHARD_INIT} \
--distplan=${DISTPLAN} \
--train_step=${TRAIN_STEP} \
2>&1 | tee ./gemini_logs/${MODEL_TYPE}_${DISTPLAN}_gpu_${GPUNUM}_bs_${BATCH_SIZE}_tp_${TPDEGREE}_${PLACEMENT}.log

报错如下:

+ export DISTPLAN=CAI_Gemini
+ DISTPLAN=CAI_Gemini
+ export GPUNUM=2
+ GPUNUM=2
+ export TPDEGREE=1
+ TPDEGREE=1
+ export PLACEMENT=cpu
+ PLACEMENT=cpu
+ export USE_SHARD_INIT=False
+ USE_SHARD_INIT=False
+ export BATCH_SIZE=4
+ BATCH_SIZE=4
+ export MODEL_TYPE=gpt2_medium
+ MODEL_TYPE=gpt2_medium
+ export TRAIN_STEP=10
+ TRAIN_STEP=10
+ '[' False = True ']'
+ USE_SHARD_INIT=
+ mkdir -p gemini_logs
+ torchrun --standalone --nproc_per_node=2 ./train_glm_demo.py --tp_degree=1 --model_type=gpt2_medium --batch_size=4 --placement=cpu --distplan=CAI_Gemini --train_step=10
+ tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_2_bs_4_tp_1_cpu.log
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
environmental variable OMP_NUM_THREADS is set to 80.
environmental variable OMP_NUM_THREADS is set to 80.
[02/27/23 17:34:40] INFO     colossalai - colossalai - INFO:                                                                                      
                             /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521     
                             set_device                                                                                                           
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                                  
[02/27/23 17:34:40] INFO     colossalai - colossalai - INFO:                                                                                      
                             /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521     
                             set_device                                                                                                           
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                                  
[02/27/23 17:34:44] INFO     colossalai - colossalai - INFO:                                                                                      
                             /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557     
                             set_seed                                                                                                             
[02/27/23 17:34:44] INFO     colossalai - colossalai - INFO:                                                                                      
                             /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557     
                             set_seed                                                                                                             
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA:     
                             1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.                                      
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA:     
                             1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.                                      
                    INFO     colossalai - colossalai - INFO:                                                                                      
                             /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/initialize.py:116 launch            
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel     
                             size: 1, tensor parallel size: 1                                                                                     
                    INFO     colossalai - colossalai - INFO: /home/liuzixi01/colossal-example/glm/./train_glm_demo.py:213 main                    
                    INFO     colossalai - colossalai - INFO: gpt2_medium, CAI_Gemini, batch size 4                                                
Emitting ninja build file /home/liuzixi01/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
=========================================================================================
No pre-built kernel is found, build and load the cpu_adam kernel during runtime now
=========================================================================================
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/includes -I/usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/TH -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/THC -isystem /home/liuzixi01/.conda/envs/torch-cuda116/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp -o cpu_adam.o 
[2/2] c++ cpu_adam.o -shared -L/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 57.027788400650024 seconds
=========================================================================================
No pre-built kernel is found, build and load the fused_optim kernel during runtime now
=========================================================================================
Detected CUDA files, patching ldflags
Emitting ninja build file /home/liuzixi01/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/7] c++ -MMD -MF colossal_C_frontend.o.d -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/TH -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp -o colossal_C_frontend.o 
[2/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/TH -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu -o multi_tensor_sgd_kernel.cuda.o 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 3; T = SGDFunctor<c10::Half, c10::Half>; ArgTypes = {float, float, float, float, bool, bool, bool, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu:139:179:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 3; T = SGDFunctor<float, float>; ArgTypes = {float, float, float, float, bool, bool, bool, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu:147:171:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 3; T = SGDFunctor<c10::Half, float>; ArgTypes = {float, float, float, float, bool, bool, bool, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu:154:175:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
[3/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/TH -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 4; T = AdamFunctor<float, float>; ArgTypes = {float, float, float, float, float, float, adamMode_t, float, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu:132:392:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 4; T = AdamFunctor<float, c10::Half>; ArgTypes = {float, float, float, float, float, float, adamMode_t, float, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu:132:804:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 4; T = AdamFunctor<c10::Half, float>; ArgTypes = {float, float, float, float, float, float, adamMode_t, float, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu:132:1216:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 4; T = AdamFunctor<c10::Half, c10::Half>; ArgTypes = {float, float, float, float, float, float, adamMode_t, float, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu:132:1635:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
[4/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/TH -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu -o multi_tensor_scale_kernel.cuda.o 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 2; T = ScaleFunctor<float, float>; ArgTypes = {float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu:115:310:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 2; T = ScaleFunctor<float, c10::Half>; ArgTypes = {float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu:115:491:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 2; T = ScaleFunctor<c10::Half, float>; ArgTypes = {float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu:115:285:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 2; T = ScaleFunctor<c10::Half, c10::Half>; ArgTypes = {float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu:115:470:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
[5/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/TH -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu -o multi_tensor_lamb.cuda.o 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu: In function ‘void multi_tensor_lamb_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor> >, float, float, float, float, int, int, float, int, int, at::Tensor, float, c10::optional<bool>)’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:329:329: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                         ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:329:648: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:345:251: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                           ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:345:303: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                               ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:345:559: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:345:611: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 4; T = LAMBStage1Functor<float>; ArgTypes = {float, float, float, float, float, float, adamMode_t, float, float*, float}]’:

评论区还有剩余报错

Environment

python 3.9.13, pytorch 1.13+cu11.6, CUDA 11.6, colossal-ai 0.2.5

zixiliuUSC commented 1 year ago
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:329:345:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 4; T = LAMBStage1Functor<c10::Half>; ArgTypes = {float, float, float, float, float, float, adamMode_t, float, float*, float}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:329:664:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 2; T = LAMBStage2Functor<float>; ArgTypes = {float*, float*, float, float, bool}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:345:334:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 2; T = LAMBStage2Functor<c10::Half>; ArgTypes = {float*, float*, float, float, bool}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu:345:642:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
[6/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/TH -isystem /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/liuzixi01/.conda/envs/torch-cuda116/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu: In function ‘std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor> >, c10::optional<bool>)’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:288:217: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                         ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:288:265: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                         ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:288:504: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:288:552: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:115: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>(
                                                                                                                   ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:163: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>(
                                                                                                                                                                   ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:196: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>(
                                                                                                                                                                                                    ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:241: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>(
                                                                                                                                                                                                                                                 ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu: In function ‘void multi_tensor_norm_out_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor> >, at::Tensor, float, float, int)’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:348:218: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                          ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:348:253: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                             ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:348:475: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:348:510: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:355:217: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                         ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:355:252: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                            ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:355:473: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:355:508: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
     DISPATCH_FLOAT_AND_HALF(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:101: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup_v2<<<ntensors, 512, 0, stream>>>(
                                                                                                     ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:136: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup_v2<<<ntensors, 512, 0, stream>>>(
                                                                                                                                        ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:157: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup_v2<<<ntensors, 512, 0, stream>>>(
                                                                                                                                                             ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:178: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   cleanup_v2<<<ntensors, 512, 0, stream>>>(
                                                                                                                                                                                  ^
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = L2NormFunctor<float>; ArgTypes = {float*, float*, bool, int}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:288:313:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = L2NormFunctor<c10::Half>; ArgTypes = {float*, float*, bool, int}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:288:600:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = MaxNormFunctor<float>; ArgTypes = {float*, float*, bool, int}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:348:283:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = MaxNormFunctor<c10::Half>; ArgTypes = {float*, float*, bool, int}]’:
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:348:540:   required from here
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
                 multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
                                                                                                                                                      ^ 
/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here
   T * data() const {
 ^ ~~
[7/7] c++ colossal_C_frontend.o multi_tensor_sgd_kernel.cuda.o multi_tensor_scale_kernel.cuda.o multi_tensor_adam.cuda.o multi_tensor_l2norm_kernel.cuda.o multi_tensor_lamb.cuda.o -shared -L/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_optim.so
Loading extension module fused_optim...
Time to load fused_optim op: 49.78874206542969 seconds
Loading extension module fused_optim...
searching chunk configuration is completed in 3.64 s.
used number: 9421.95 MB, wasted number: 56.38 MB
total wasted percentage is 0.59%
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 317 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 318) of binary: /home/liuzixi01/.conda/envs/torch-cuda116/bin/python3.9
Traceback (most recent call last):
  File "/home/liuzixi01/.conda/envs/torch-cuda116/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/liuzixi01/.conda/envs/torch-cuda116/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
./train_glm_demo.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-27_17:40:06
  host      : gzailab-liuzixi01-colossalai-0
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 318)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 318
====================================================
Gy-Lu commented 1 year ago

Hi, could you set num_workers=0 in your dataloader and then run it again? If it fails again, then try with a single GPU to locate the error.

zixiliuUSC commented 1 year ago

I resolve this problem by reinstall and i will close the issue, thx

gg22mm commented 1 year ago

数据集格式是什么?/ What is the dataset format?

我看别的框架是这样的 / I think other frameworks look like this:

{ "prompt": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "response": "用电脑能读数据流吗?水温多少", "history": [] } { "prompt": "95", "response": "上下水管温差怎么样啊?空气是不是都排干净了呢?", "history": [ ["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"] ] } { "prompt": "是的。上下水管都好的", "response": "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况!", "history": [ ["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"], ["95", "上下水管温差怎么样啊?空气是不是都排干净了呢?"] ] }

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What is the dataset format? / What is the dataset format?

I think other frameworks look like this / I think other frameworks look like this:

{ "prompt": "Great Wall h3 fan does not turn. The relay is good. The fuse is good. The sensor is new and the fan is new. This is why. The relay is missing a signal line", "response": "Can I read the data stream with a computer? What is the water temperature", "history": [] } { "prompt": "95", "response": "How is the temperature difference between the upper and lower water pipes? Has the air been drained?", "history": [ ["Great Wall h3 fan does not turn. The relay is good. The fuse is good. The sensor is new and the fan is new. This is why. The relay lacks a signal line", "Can I read the data stream with a computer? What is the water temperature"] ] } { "prompt": "Yes. Both the upper and lower water pipes are fine", "response": "Then check the circuit. Generally, the fan relay is controlled by the computer. If the circuit is open or the computer is broken, the relay will not be closed!", "history": [ ["Great Wall h3 fan does not turn. The relay is good. The fuse is good. The sensor is new and the fan is new. This is why. The relay lacks a signal line", "Can I read the data stream with a computer? What is the water temperature"], ["95", "How is the temperature difference between the upper and lower water pipes? Has the air been drained?"] ] }