k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.13k stars 215 forks source link

about a problem of install k2 #569

Closed shanguanma closed 3 years ago

shanguanma commented 3 years ago

I install k2 on another computer serve, I encountered an error during installation, Install step is as follows:

$ conda create -n k2-fsa python=3.7
$ conda activate k2-fas
$  conda install pytorch==1.7.1 cudatoolkit=10.1 -c pytorch
$ conda install -c pytorch torchaudio

$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build

$ cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..
$ make _k2
$ cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

the logger is as follows:

-- The CUDA compiler identification is NVIDIA 10.1.168
-- The CXX compiler identification is GNU 7.4.0
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++
-- Check for working CXX compiler: /usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: Ubuntu 18.04.2 LTS
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now

-- Could NOT find Valgrind (missing: Valgrind_INCLUDE_DIR Valgrind_EXECUTABLE) 
-- Downloading pybind11
-- pybind11 is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "10.1") 
-- Caffe2: CUDA detected: 10.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.1
-- Found CUDNN: /usr/local/cuda/cudnn/lib64  
-- Found cuDNN: v7.6.0  (include: /usr/local/cuda/cudnn/include, library: /usr/local/cuda/cudnn/lib64)
-- Autodetected CUDA architecture(s):  7.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70
-- Found Torch: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.7.1
-- PyTorch cuda version: 10.1
-- Downloading cub
-- cub is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home/users/ntu/tlvu/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home/users/ntu/tlvu/k2-fsa/k2/build

then I run make _k2, the error is as follows:

[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/utils.cu.o
[ 74%] Building CUDA object k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o
[ 77%] Linking CUDA device code CMakeFiles/context.dir/cmake_device_link.o
[ 77%] Linking CUDA shared library ../../lib/libk2context.so
/usr/bin/ld: cannot find -lCUDA_cublas_LIBRARY-NOTFOUND
/usr/bin/ld: cannot find /usr/local/cuda/cudnn/lib64: File format not recognized
collect2: error: ld returned 1 exit status
k2/csrc/CMakeFiles/context.dir/build.make:525: recipe for target 'lib/libk2context.so' failed
make[3]: *** [lib/libk2context.so] Error 1
CMakeFiles/Makefile2:706: recipe for target 'k2/csrc/CMakeFiles/context.dir/all' failed
make[2]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
CMakeFiles/Makefile2:2210: recipe for target 'k2/python/csrc/CMakeFiles/_k2.dir/rule' failed
make[1]: *** [k2/python/csrc/CMakeFiles/_k2.dir/rule] Error 2
Makefile:727: recipe for target '_k2' failed
make: *** [_k2] Error 2
csukuangfj commented 3 years ago

-D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/"

---> change to

-D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" 
danpovey commented 3 years ago

I don't think it's about CUDNN but about CUBLAS. Don't you have to tell it the root of the whole CUDA toolkit? I forget the variable name.

On Tue, Jan 5, 2021 at 1:17 PM Fangjun Kuang notifications@github.com wrote:

-D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/"

---> change to

-D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-754402569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7OHKL4BY7P7JCXIR3SYKOFFANCNFSM4VUIPSPQ .

csukuangfj commented 3 years ago

CUDNN_LIBRARY_PATH expects a .so filename, not a directory. That is why ld complains:

/usr/bin/ld: cannot find /usr/local/cuda/cudnn/lib64: File format not recognized

csukuangfj commented 3 years ago

I forget the variable name

Do you mean

-DCUDA_TOOLKIT_ROOT="/usr/local/cuda"
csukuangfj commented 3 years ago

For the second error:

/usr/bin/ld: cannot find -lCUDA_cublas_LIBRARY-NOTFOUND

Please use

-D CUDA_cublas_LIBRARY="/path/to/libcublas.so"

In general, you do not need to specify so many values for cmake. CMake can figure it out.

shanguanma commented 3 years ago

For the second error:

/usr/bin/ld: cannot find -lCUDA_cublas_LIBRARY-NOTFOUND

Please use

-D CUDA_cublas_LIBRARY="/path/to/libcublas.so"

In general, you do not need to specify so many values for cmake. CMake can figure it out.

if I don't specify cudnn path, CMake can't find it, because cudnn is not in the default location on the computer server cluster.

Cuda and cudnn path of on the computer server cluster:

$ ls /usr/local/cuda
LICENSE  README  bin  compat  cudnn  doc  extras  include  lib64  nvml  nvvm  share  src  targets  version.txt
$ ls /usr/local/cuda/cudnn/*       
/usr/local/cuda/cudnn/doc:
libcudnn7  libcudnn7-dev

/usr/local/cuda/cudnn/include:
cudnn.h

/usr/local/cuda/cudnn/lib64:
libcudnn.so  libcudnn.so.7  libcudnn.so.7.6.0  libcudnn_static.a  libcudnn_static_v7.a

I will follow your suggestion and try to do it.

shanguanma commented 3 years ago

The compile command is as follows:

$ cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

because /usr/local/cuda don't contain libcublas.so,

grep -rn "libcublas.so" /usr/local      
grep: /usr/local/libexec/dgx-cgroup/cgroup-classify: Permission denied
grep: /usr/local/libexec/dgx-cgroup/cgroup-remove: Permission denied
grep: /usr/local/libexec/dgx-cgroup/cgroup-create: Permission denied
grep: /usr/local/libexec/dgx-cgroup/cgroup-cleanup: Permission denied
grep: /usr/local/libexec/dgx-cgroup/common: Permission denied
/usr/local/cuda-9.0/doc/EULA.txt:1009:  Linux   : libcublas.so, libcublas_static.a, libcublas_device.a
/usr/local/cuda-9.0/doc/EULA.txt:1010:  Android : libcublas.so, libcublas_static.a, libcublas_device.a
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libnvgraph.so.9.0.176 matches
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so matches
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcublas.so.9.0.333 matches
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libnvblas.so.9.0.333 matches
grep: /usr/local/bin/pbs-dgx-cgroup-create: Permission denied
grep: /usr/local/bin/pbs-dgx-cleanup: Permission denied
grep: /usr/local/bin/dgx-cgroup-create: Permission denied
grep: /usr/local/bin/dgx-cgroup-remove: Permission denied
grep: /usr/local/bin/dgx-cgroup-classify: Permission denied
grep: /usr/local/bin/dgx-docker-cleanup: Permission denied
grep: /usr/local/bin/pam-sshd-attach: Permission denied
grep: /usr/local/bin/dgx-cgroup-cleanup: Permission denied
grep: /usr/local/etc/dgx-cgroup: Permission denied
grep: /usr/local/sbin/docker-log: Permission denied
grep: /usr/local/sbin/pbs-move-undelivered: Permission denied
grep: /usr/local/sbin/node-load: Permission denied
grep: /usr/local/sbin/purge-log: Permission denied
grep: /usr/local/sbin/cleanup-tmp: Permission denied
/usr/local/cuda-10.1/doc/EULA.txt:649:libcublas.so, libcublasLt.so, libcublas_static.a,
/usr/local/cuda-10.1/doc/EULA.txt:654:libcublas.so, libcublasLt.so, libcublas_static.a,
/usr/local/cuda-8.0/doc/EULA.txt:535:  Linux   : libcublas.so, libcublas_static.a, libcublas_device.a
/usr/local/cuda-8.0/doc/EULA.txt:536:  Android : libcublas.so, libcublas_static.a, libcublas_device.a
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvblas.so.8.0.61 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.61 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvgraph.so.8.0.61 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.88 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvblas.so.8.0.88 matches

when I run make _k2, the error is as follows:

[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/utils.cu.o
[ 74%] Building CUDA object k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o
make[3]: *** No rule to make target '/usr/local/cuda/cudnn/lib64/libcudnn.so', needed by 'k2/csrc/CMakeFiles/context.dir/cmake_device_link.o'.  Stop.
CMakeFiles/Makefile2:706: recipe for target 'k2/csrc/CMakeFiles/context.dir/all' failed
make[2]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
CMakeFiles/Makefile2:2210: recipe for target 'k2/python/csrc/CMakeFiles/_k2.dir/rule' failed
make[1]: *** [k2/python/csrc/CMakeFiles/_k2.dir/rule] Error 2
Makefile:727: recipe for target '_k2' failed
make: *** [_k2] Error 2
danpovey commented 3 years ago

/usr/local/cuda/cudnn/lib64/libcudnn.so exists?

On Tue, Jan 5, 2021 at 2:20 PM shanguanma notifications@github.com wrote:

The compile command is as follows:

$ cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

because /usr/local/cuda don't contain libcublas.so,

grep -rn "libcublas.so" /usr/local grep: /usr/local/libexec/dgx-cgroup/cgroup-classify: Permission denied grep: /usr/local/libexec/dgx-cgroup/cgroup-remove: Permission denied grep: /usr/local/libexec/dgx-cgroup/cgroup-create: Permission denied grep: /usr/local/libexec/dgx-cgroup/cgroup-cleanup: Permission denied grep: /usr/local/libexec/dgx-cgroup/common: Permission denied /usr/local/cuda-9.0/doc/EULA.txt:1009: Linux : libcublas.so, libcublas_static.a, libcublas_device.a /usr/local/cuda-9.0/doc/EULA.txt:1010: Android : libcublas.so, libcublas_static.a, libcublas_device.a Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libnvgraph.so.9.0.176 matches Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so matches Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcublas.so.9.0.333 matches Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libnvblas.so.9.0.333 matches grep: /usr/local/bin/pbs-dgx-cgroup-create: Permission denied grep: /usr/local/bin/pbs-dgx-cleanup: Permission denied grep: /usr/local/bin/dgx-cgroup-create: Permission denied grep: /usr/local/bin/dgx-cgroup-remove: Permission denied grep: /usr/local/bin/dgx-cgroup-classify: Permission denied grep: /usr/local/bin/dgx-docker-cleanup: Permission denied grep: /usr/local/bin/pam-sshd-attach: Permission denied grep: /usr/local/bin/dgx-cgroup-cleanup: Permission denied grep: /usr/local/etc/dgx-cgroup: Permission denied grep: /usr/local/sbin/docker-log: Permission denied grep: /usr/local/sbin/pbs-move-undelivered: Permission denied grep: /usr/local/sbin/node-load: Permission denied grep: /usr/local/sbin/purge-log: Permission denied grep: /usr/local/sbin/cleanup-tmp: Permission denied /usr/local/cuda-10.1/doc/EULA.txt:649:libcublas.so, libcublasLt.so, libcublas_static.a, /usr/local/cuda-10.1/doc/EULA.txt:654:libcublas.so, libcublasLt.so, libcublas_static.a, /usr/local/cuda-8.0/doc/EULA.txt:535: Linux : libcublas.so, libcublas_static.a, libcublas_device.a /usr/local/cuda-8.0/doc/EULA.txt:536: Android : libcublas.so, libcublas_static.a, libcublas_device.a Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvblas.so.8.0.61 matches Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.61 matches Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvgraph.so.8.0.61 matches Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.88 matches Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvblas.so.8.0.88 matches

when I run make _k2, the error is as follows:

[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/utils.cu.o [ 74%] Building CUDA object k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o make[3]: No rule to make target '/usr/local/cuda/cudnn/lib64/libcudnn.so', needed by 'k2/csrc/CMakeFiles/context.dir/cmake_device_link.o'. Stop. CMakeFiles/Makefile2:706: recipe for target 'k2/csrc/CMakeFiles/context.dir/all' failed make[2]: [k2/csrc/CMakeFiles/context.dir/all] Error 2 CMakeFiles/Makefile2:2210: recipe for target 'k2/python/csrc/CMakeFiles/_k2.dir/rule' failed make[1]: [k2/python/csrc/CMakeFiles/_k2.dir/rule] Error 2 Makefile:727: recipe for target '_k2' failed make: [_k2] Error 2

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-754426444, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3L4FZKVM7MQE5CSIDSYKVT5ANCNFSM4VUIPSPQ .

shanguanma commented 3 years ago

/usr/local/cuda/cudnn/lib64/libcudnn.so exists?

yes,

ls /usr/local/cuda/cudnn/lib64/libcudnn.so
/usr/local/cuda/cudnn/lib64/libcudnn.so
shanguanma commented 3 years ago
$ ls /usr/local/cuda/cudnn/lib64/*
/usr/local/cuda/cudnn/lib64/libcudnn.so    /usr/local/cuda/cudnn/lib64/libcudnn.so.7.6.0  /usr/local/cuda/cudnn/lib64/libcudnn_static_v7.a
/usr/local/cuda/cudnn/lib64/libcudnn.so.7  /usr/local/cuda/cudnn/lib64/libcudnn_static.a
danpovey commented 3 years ago

Do ls -l, may be permission or dangling soft link problem

On Tue, Jan 5, 2021 at 2:28 PM shanguanma notifications@github.com wrote:

$ ls /usr/local/cuda/cudnn/lib64/* /usr/local/cuda/cudnn/lib64/libcudnn.so /usr/local/cuda/cudnn/lib64/libcudnn.so.7.6.0 /usr/local/cuda/cudnn/lib64/libcudnn_static_v7.a /usr/local/cuda/cudnn/lib64/libcudnn.so.7 /usr/local/cuda/cudnn/lib64/libcudnn_static.a

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-754429494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRUYSB7UVGMQX6TSLSYKWPJANCNFSM4VUIPSPQ .

csukuangfj commented 3 years ago

What is the output of

cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

You only posted the compilation log, without the configuration log.

shanguanma commented 3 years ago

What is the output of

cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

You only posted the compilation log, without the configuration log.

yes, it is as follows:

$  cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

-- The CUDA compiler identification is NVIDIA 10.1.168
-- The CXX compiler identification is GNU 7.4.0
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++
-- Check for working CXX compiler: /usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: Ubuntu 18.04.2 LTS
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now

-- Could NOT find Valgrind (missing: Valgrind_INCLUDE_DIR Valgrind_EXECUTABLE) 
-- Downloading pybind11
-- pybind11 is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "10.1") 
-- Caffe2: CUDA detected: 10.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.1
-- Found CUDNN: /usr/local/cuda/cudnn/lib64/libcudnn.so  
-- Found cuDNN: v7.6.0  (include: /usr/local/cuda/cudnn/include, library: /usr/local/cuda/cudnn/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70
-- Found Torch: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.7.1
-- PyTorch cuda version: 10.1
-- Downloading cub
-- cub is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home/users/ntu/tlvu/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home/users/ntu/tlvu/k2-fsa/k2/build
shanguanma commented 3 years ago

Do ls -l, may be permission or dangling soft link problem

Yes, Maybe a problem there, Currently, as far as I know, k2 only support cuda=10.1, 10.2, can k2 support more cuda version, e.g.: cuda=10.0, cuda=9.2, etc, I don't know if there is such a plan.

csukuangfj commented 3 years ago

We only check that k2 is compiled with the same CUDA version that PyTorch is using.

You can try k2 with cuda 10.0 or 9.2. It may work but I think it has not been tested.

shanguanma commented 3 years ago

We only check that k2 is compiled with the same CUDA version that PyTorch is using.

You can try k2 with cuda 10.0 or 9.2. It may work but I think it has not been tested.

Previously, I try to do it, but it is failing, Any way, the Server shutdown just now, once It is working, I try to do it again by using the newest master branch

shanguanma commented 3 years ago

I try to install k2 with cuda=10.0, because when cuda=10.0, max support pytorch version =1.4.0, so I use the below command to install k2 step by step:

$ conda create -n k2-fsa python=3.8
$ conda activate k2-fas
$ conda install pytorch==1.4.0 cudatoolkit=10.0 -c pytorch
$ conda install -c pytorch torchaudio

$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build

$ cmake -DCMAKE_BUILD_TYPE=Release .. it don't error, its log is as follow:

-- The CUDA compiler identification is NVIDIA 10.0.130
-- The CXX compiler identification is GNU 7.5.0
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.0/toolkit/10.0.130/bin/nvcc
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.0/toolkit/10.0.130/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: CentOS Linux release 7.1.1503 (Core) 
-- Found Git: /usr/bin/git (found version "1.8.3.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now

-- Found Valgrind: /usr/bin  
-- Found Valgrind: /usr/bin/valgrind
-- To check memory, run `ctest -R <NAME> -D ExperimentalMemCheck`
-- Downloading pybind11
-- pybind11 is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home4/md510/anaconda3/envs/k2-fsa/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home4/md510/anaconda3/envs/k2-fsa/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home4/md510/anaconda3/envs/k2-fsa/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
CMake Warning (dev) at /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Environment variable CUDA_ROOT is set to:

    /cm/shared/apps/cuda10.0/toolkit/10.0.130

  For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
  /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
  cmake/torch.cmake:11 (find_package)
  CMakeLists.txt:134 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /cm/shared/apps/cuda10.0/toolkit/10.0.130 (found version "10.0") 
-- Caffe2: CUDA detected: 10.0
-- Caffe2: CUDA nvcc is: /cm/shared/apps/cuda10.0/toolkit/10.0.130/bin/nvcc
-- Caffe2: CUDA toolkit directory: /cm/shared/apps/cuda10.0/toolkit/10.0.130
-- Caffe2: Header version is: 10.0
-- Found CUDNN: /cm/shared/apps/cudnn-7.6/cuda/lib64/libcudnn.so  
-- Found cuDNN: v7.6.0  (include: /cm/shared/apps/cudnn-7.6/cuda/include, library: /cm/shared/apps/cudnn-7.6/cuda/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  6.0 6.0 6.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_60,code=sm_60
-- Found torch: /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.4.0
-- PyTorch cuda version: 10.0
-- Downloading cub
-- cub is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.5.0
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home4/md510/w2020/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home4/md510/w2020/k2-fsa/k2/build

then I run the make _k2, the error is as follows:

[ 61%] Building CUDA object k2/csrc/CMakeFiles/context.dir/ragged_utils.cu.o
/home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h: In function ‘void k2::CheckLayerEqual(int32_t, int32_t, k2::RaggedShape**)’:
/home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:165:39: warning: ‘row_ids_dim’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (cur_level_ <= level_) printf("%d", i);
                                 ~~~~~~^~~~~~~~ 
/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_utils.cu:33:25: note: ‘row_ids_dim’ was declared here
   int32_t row_splits_dim, row_ids_dim;
                         ^~~~~~~~~~~
/home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:165:39: warning: ‘row_splits_dim’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (cur_level_ <= level_) printf("%d", i);
                                 ~~~~~~^~~~~~~~ 
/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_utils.cu:33:9: note: ‘row_splits_dim’ was declared here
   int32_t row_splits_dim, row_ids_dim;
         ^~~~~~~~~~~~~~
[ 64%] Building CUDA object k2/csrc/CMakeFiles/context.dir/rm_epsilon.cu.o
[ 64%] Building CUDA object k2/csrc/CMakeFiles/context.dir/tensor.cu.o
[ 67%] Building CUDA object k2/csrc/CMakeFiles/context.dir/tensor_ops.cu.o
[ 67%] Building CUDA object k2/csrc/CMakeFiles/context.dir/thread_pool.cu.o
[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/timer.cu.o
[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/utils.cu.o
[ 74%] Building CUDA object k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o
/home4/md510/w2020/k2-fsa/k2/k2/csrc/pytorch_context.cu(196): error: class "c10::Storage" has no member "nbytes"

1 error detected in the compilation of "/tmp/tmpxft_0000524d_00000000-11_pytorch_context.compute_75.cpp1.ii".
make[3]: *** [k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o] Error 1
make[2]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
make[1]: *** [k2/python/csrc/CMakeFiles/_k2.dir/rule] Error 2
make: *** [_k2] Error 2
csukuangfj commented 3 years ago

I try to install k2 with cuda=10.0, because when cuda=10.0, max support pytorch version =1.4.0

I would recommend to use CUDA 9.2 as there are lots of different PyTorch versions for it.

Only PyTorch 1.6.0 and 1.7.0 have been tested and are known to work.

shanguanma commented 3 years ago

I try to install k2 with cuda=10.0, because when cuda=10.0, max support pytorch version =1.4.0

I would recommend to use CUDA 9.2 as there are lots of different PyTorch versions for it.

Only PyTorch 1.6.0 and 1.7.0 have been tested and are known to work.

Sorry, Currently I haven't cuda=9.2 computer server, so I can't test it right now.

danpovey commented 3 years ago

We are using cuda=10.1 for at least some development. You have to find the right version of pytorch that's compiled for that though.

On Tue, Jan 5, 2021 at 8:07 PM Fangjun Kuang notifications@github.com wrote:

I try to install k2 with cuda=10.0, because when cuda=10.0, max support pytorch version =1.4.0

I would recommend to use CUDA 9.2 as there are lots of different PyTorch versions for it.

Only PyTorch 1.6.0 and 1.7.0 have been tested and are known to work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-754596421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2XYEEGPKEX6RNAERTSYL6ITANCNFSM4VUIPSPQ .

shanguanma commented 3 years ago

@danpovey ,OK, I see. Thanks for your reply.

shanguanma commented 3 years ago

@danpovey , @csukuangfj , today(2020-1-12), because the computer server has been updated CUDA to CUDA10.2, cudnn update to cudnn7.6.5. I will compile the latest k2 master branch. I summary the details of install is as follows:

$ conda create -n k2-fsa1 python=3.7
$ conda activate k2-fsa1
$ conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..

-- The CUDA compiler identification is NVIDIA 10.2.89
-- The CXX compiler identification is GNU 7.5.0
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: CentOS Linux release 7.8.2003 (Core)
-- Found Git: /usr/bin/git (found version "1.8.3.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now

-- Found Valgrind: /usr/bin  
-- Found Valgrind: /usr/bin/valgrind
-- To check memory, run `ctest -R <NAME> -D ExperimentalMemCheck`
-- Downloading pybind11
-- pybind11 is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home4/md510/anaconda3/envs/k2-fsa1/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home4/md510/anaconda3/envs/k2-fsa1/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home4/md510/anaconda3/envs/k2-fsa1/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
CMake Warning (dev) at /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Environment variable CUDA_ROOT is set to:

    /cm/shared/apps/cuda10.2/toolkit/10.2.89

  For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
  /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
  cmake/torch.cmake:11 (find_package)
  CMakeLists.txt:134 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /cm/shared/apps/cuda10.2/toolkit/10.2.89 (found version "10.2") 
-- Caffe2: CUDA detected: 10.2
-- Caffe2: CUDA nvcc is: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc
-- Caffe2: CUDA toolkit directory: /cm/shared/apps/cuda10.2/toolkit/10.2.89
-- Caffe2: Header version is: 10.2
-- Found CUDNN: /cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64/libcudnn.so  
-- Found cuDNN: v7.6.5  (include: /cm/shared/apps/cuda10.2/toolkit/10.2.89/include, library: /cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.5 7.5 7.5 7.5 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Found Torch: /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.7.1
-- PyTorch cuda version: 10.2
-- Downloading cub
-- cub is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.5.0
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home4/md510/w2020/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home4/md510/w2020/k2-fsa/k2/build

 $ make _k2 ## no error
$ python3 -m pip install --no-deps --force-reinstall graphviz ## no error
$ make -j  ## no error
$ ctest --parallel 5 ## no error
$ make test  ## no error
$  pip3 install wheel twine
$ ./scripts/build_pip.sh
$ python3 -m pip install --no-deps --force-reinstall dist/k2-*.whl

next install lhoste:

$ pip install --force-reinstall git+https://github.com/lhotse-speech/lhotse

next install snowfall:

$ git clone https://github.com/k2-fsa/snowfall.git
$ cd snowfall
$ vim ../readme.txt 

#k2
kaldialign
#lhotse@git+https://github.com/lhotse-speech/lhotse
tensorboard
#torch>=1.6.0
#torchaudio

$ python3 -m pip install -e .

run the LibriSpeech recipe: $ ./run.sh --stage 1 --stop_stage 5 ## no error

$ ./run.sh --stage 6 its error is as follows:

2021-01-12 17:42:56,883 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001
[F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 block:[0,0,0], thread: [37,0,0] block:[0,0,0], thread: [38,0,0] block:[0,0,0], thread: [39,0,0] block:[0,0,0], thread: [40,0,0] block:[0,0,0], thread: [41,0,0] block:[0,0,0], thread: [42,0,0] block:[0,0,0], thread: [43,0,0] block:[0,0,0], thread: [44,0,0] block:[0,0,0], thread: [45,0,0] block:[0,0,0], thread: [46,0,0] block:[0,0,0], thread: [47,0,0] block:[0,0,0], thread: [49,0,0] block:[0,0,0], thread: [50,0,0] block:[0,0,0], thread: [51,0,0] block:[0,0,0], thread: [52,0,0] block:[0,0,0], thread: [56,0,0] block:[0,0,0], thread: [57,0,0] Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0                 

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [37,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [38,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [39,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [40,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [41,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [42,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [43,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [44,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [45,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [46,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [47,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [49,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [50,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [51,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [52,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [56,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [57,0,0] Assertion `Some bad things happened` failed.
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (710 vs. 0)  Error: device-side assert triggered. 

[ Stack-Trace: ]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x2aaccdcc1904]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x28) [0x2aaccaaf4108]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Array1<int>::operator[](int) const+0x1929) [0x2aaccaaf5d89]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Renumbering::ComputeOld2New()+0x13a) [0x2aaccaaf160a]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Renumbering::ComputeNew2Old()+0x5e0) [0x2aaccaaf2640]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*)+0x13dc) [0x2aaccabf44bc]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x364) [0x2aaccabe6ef4]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x51f23) [0x2aacc742df23]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x1a3a3) [0x2aacc73f63a3]
python3(_PyMethodDef_RawFastCallKeywords+0x316) [0x5555556b99b6]
python3(_PyCFunction_FastCallKeywords+0x21) [0x5555556b9a31]
python3(_PyEval_EvalFrameDefault+0x53e3) [0x555555726483]
python3(_PyFunction_FastCallDict+0x10b) [0x55555566985b]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x93d) [0x2aaab378fa6d]
python3(_PyMethodDef_RawFastCallKeywords+0x1e4) [0x5555556b9884]
python3(_PyCFunction_FastCallKeywords+0x21) [0x5555556b9a31]
python3(_PyEval_EvalFrameDefault+0x4e1d) [0x555555725ebd]
python3(_PyFunction_FastCallKeywords+0xfb) [0x5555556b8e7b]
python3(_PyEval_EvalFrameDefault+0x4a89) [0x555555725b29]
python3(_PyEval_EvalCodeWithName+0xc30) [0x555555669160]
python3(_PyFunction_FastCallKeywords+0x387) [0x5555556b9107]
python3(_PyEval_EvalFrameDefault+0x416) [0x5555557214b6]
python3(_PyEval_EvalCodeWithName+0x2f9) [0x555555668829]
python3(_PyFunction_FastCallKeywords+0x387) [0x5555556b9107]
python3(_PyEval_EvalFrameDefault+0x14e5) [0x555555722585]
python3(_PyFunction_FastCallKeywords+0xfb) [0x5555556b8e7b]
python3(_PyEval_EvalFrameDefault+0x416) [0x5555557214b6]
python3(_PyEval_EvalCodeWithName+0x2f9) [0x555555668829]
python3(PyEval_EvalCodeEx+0x44) [0x555555669714]
python3(PyEval_EvalCode+0x1c) [0x55555566973c]
python3(+0x22cf14) [0x555555780f14]
python3(PyRun_FileExFlags+0xa1) [0x55555578b331]
python3(PyRun_SimpleFileExFlags+0x1c3) [0x55555578b523]
python3(+0x238655) [0x55555578c655]
python3(_Py_UnixMain+0x3c) [0x55555578c77c]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555]
python3(+0x1dcff0) [0x555555730ff0]

Aborted
danpovey commented 3 years ago

Mm. Try running the tests and see if any fail, e.g. cd build ctest

On Tue, Jan 12, 2021 at 5:47 PM shanguanma notifications@github.com wrote:

Today(2020-1-12), because the computer server has been updated CUDA to CUDA10.2, I will compile the latest k2 master branch. I summary the details of install is as follows:

$ conda create -n k2-fsa1 python=3.7

$ conda activate k2-fsa1

$ conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

$ git clone https://github.com/k2-fsa/k2.git

$ cd k2

$ mkdir build

$ cd build

$ cmake -DCMAKE_BUILD_TYPE=Release ..

-- The CUDA compiler identification is NVIDIA 10.2.89

-- The CXX compiler identification is GNU 7.5.0

-- Check for working CUDA compiler: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc

-- Check for working CUDA compiler: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc -- works

-- Detecting CUDA compiler ABI info

-- Detecting CUDA compiler ABI info - done

-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++

-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++ -- works

-- Detecting CXX compiler ABI info

-- Detecting CXX compiler ABI info - done

-- Detecting CXX compile features

-- Detecting CXX compile features - done

-- K2_OS: CentOS Linux release 7.8.2003 (Core)

-- Found Git: /usr/bin/git (found version "1.8.3.1")

-- Looking for C++ include cxxabi.h

-- Looking for C++ include cxxabi.h - found

-- Looking for C++ include execinfo.h

-- Looking for C++ include execinfo.h - found

-- Performing Test K2_COMPILER_SUPPORTS_CXX14

-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success

-- C++ Standard version: 14

CMake Warning at CMakeLists.txt:112 (message):

arch 62/72 are not supported for now

-- Found Valgrind: /usr/bin

-- Found Valgrind: /usr/bin/valgrind

-- To check memory, run ctest -R <NAME> -D ExperimentalMemCheck

-- Downloading pybind11

-- pybind11 is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/pybind11-src

-- pybind11 v2.6.0

-- Found PythonInterp: /home4/md510/anaconda3/envs/k2-fsa1/bin/python (found version "3.7.9")

-- Found PythonLibs: /home4/md510/anaconda3/envs/k2-fsa1/lib/libpython3.7m.so

-- Performing Test HAS_FLTO

-- Performing Test HAS_FLTO - Success

-- Python executable: /home4/md510/anaconda3/envs/k2-fsa1/bin/python

-- Looking for C++ include pthread.h

-- Looking for C++ include pthread.h - found

-- Looking for pthread_create

-- Looking for pthread_create - not found

-- Looking for pthread_create in pthreads

-- Looking for pthread_create in pthreads - not found

-- Looking for pthread_create in pthread

-- Looking for pthread_create in pthread - found

-- Found Threads: TRUE

CMake Warning (dev) at /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):

Policy CMP0074 is not set: find_package uses _ROOT variables.

Run "cmake --help-policy CMP0074" for policy details. Use the cmake_policy

command to set the policy and suppress this warning.

Environment variable CUDA_ROOT is set to:

/cm/shared/apps/cuda10.2/toolkit/10.2.89

For compatibility, CMake is ignoring the variable.

Call Stack (most recent call first):

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)

cmake/torch.cmake:11 (find_package)

CMakeLists.txt:134 (include)

This warning is for project developers. Use -Wno-dev to suppress it.

-- Found CUDA: /cm/shared/apps/cuda10.2/toolkit/10.2.89 (found version "10.2")

-- Caffe2: CUDA detected: 10.2

-- Caffe2: CUDA nvcc is: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc

-- Caffe2: CUDA toolkit directory: /cm/shared/apps/cuda10.2/toolkit/10.2.89

-- Caffe2: Header version is: 10.2

-- Found CUDNN: /cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64/libcudnn.so

-- Found cuDNN: v7.6.5 (include: /cm/shared/apps/cuda10.2/toolkit/10.2.89/include, library: /cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64/libcudnn.so)

-- Autodetected CUDA architecture(s): 7.5 7.5 7.5 7.5 7.5

-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75

-- Found Torch: /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/lib/libtorch.so

-- PyTorch version: 1.7.1

-- PyTorch cuda version: 10.2

-- Downloading cub

-- cub is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/cub-src

-- Downloading moderngpu

-- moderngpu is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/moderngpu-src

-- Downloading googletest

-- googletest is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-src

-- googletest's binary dir is /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-build

-- The C compiler identification is GNU 7.5.0

-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc

-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc -- works

-- Detecting C compiler ABI info

-- Detecting C compiler ABI info - done

-- Detecting C compile features

-- Detecting C compile features - done

-- Generated /home4/md510/w2020/k2-fsa/k2/build/k2/csrc/version.h

-- Configuring done

-- Generating done

-- Build files have been written to: /home4/md510/w2020/k2-fsa/k2/build

$ make _k2 ## no error

$ python3 -m pip install --no-deps --force-reinstall graphviz ## no error

$ make -j ## no error

$ ctest --parallel 5 ## no error

$ make test ## no error

$ pip3 install wheel twine

$ ./scripts/build_pip.sh

$ python3 -m pip install --no-deps --force-reinstall dist/k2-*.whl

next install lhoste:

$ pip install --force-reinstall git+https://github.com/lhotse-speech/lhotse

next install snowfall:

$ git clone https://github.com/k2-fsa/snowfall.git

$ cd snowfall

$ vim ../readme.txt

k2

kaldialign

lhotse@git+https://github.com/lhotse-speech/lhotse

tensorboard

torch>=1.6.0

torchaudio

$ python3 -m pip install -e .

run the LibriSpeech recipe: $ ./run.sh --stage 1 --stop_stage 5 ## no error

$ ./run.sh --stage 6 its error is as follows:

2021-01-12 17:42:56,883 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001

[F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda ->void::operator()(int)->void:722 block:[0,0,0], thread: [37,0,0] block:[0,0,0], thread: [38,0,0] block:[0,0,0], thread: [39,0,0] block:[0,0,0], thread: [40,0,0] block:[0,0,0], thread: [41,0,0] block:[0,0,0], thread: [42,0,0] block:[0,0,0], thread: [43,0,0] block:[0,0,0], thread: [44,0,0] block:[0,0,0], thread: [45,0,0] block:[0,0,0], thread: [46,0,0] block:[0,0,0], thread: [47,0,0] block:[0,0,0], thread: [49,0,0] block:[0,0,0], thread: [50,0,0] block:[0,0,0], thread: [51,0,0] block:[0,0,0], thread: [52,0,0] block:[0,0,0], thread: [56,0,0] block:[0,0,0], thread: [57,0,0] Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [37,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [38,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [39,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [40,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [41,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [42,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [43,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [44,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [45,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [46,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [47,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [49,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [50,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [51,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [52,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [56,0,0] Assertion Some bad things happened failed.

/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda ->void::operator()(int)->void: block: [0,0,0], thread: [57,0,0] Assertion Some bad things happened failed.

[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1::operator const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (710 vs. 0) Error: device-side assert triggered.

[ Stack-Trace: ]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x2aaccdcc1904]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x28) [0x2aaccaaf4108]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Array1::operator const+0x1929) [0x2aaccaaf5d89]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Renumbering::ComputeOld2New()+0x13a) [0x2aaccaaf160a]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Renumbering::ComputeNew2Old()+0x5e0) [0x2aaccaaf2640]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1, k2::Array1)+0x13dc) [0x2aaccabf44bc]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged&, k2::DenseFsaVec&, float, k2::Ragged, k2::Array1, k2::Array1*)+0x364) [0x2aaccabe6ef4]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x51f23) [0x2aacc742df23]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x1a3a3) [0x2aacc73f63a3]

python3(_PyMethodDef_RawFastCallKeywords+0x316) [0x5555556b99b6]

python3(_PyCFunction_FastCallKeywords+0x21) [0x5555556b9a31]

python3(_PyEval_EvalFrameDefault+0x53e3) [0x555555726483]

python3(_PyFunction_FastCallDict+0x10b) [0x55555566985b]

/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object)+0x93d) [0x2aaab378fa6d]

python3(_PyMethodDef_RawFastCallKeywords+0x1e4) [0x5555556b9884]

python3(_PyCFunction_FastCallKeywords+0x21) [0x5555556b9a31]

python3(_PyEval_EvalFrameDefault+0x4e1d) [0x555555725ebd]

python3(_PyFunction_FastCallKeywords+0xfb) [0x5555556b8e7b]

python3(_PyEval_EvalFrameDefault+0x4a89) [0x555555725b29]

python3(_PyEval_EvalCodeWithName+0xc30) [0x555555669160]

python3(_PyFunction_FastCallKeywords+0x387) [0x5555556b9107]

python3(_PyEval_EvalFrameDefault+0x416) [0x5555557214b6]

python3(_PyEval_EvalCodeWithName+0x2f9) [0x555555668829]

python3(_PyFunction_FastCallKeywords+0x387) [0x5555556b9107]

python3(_PyEval_EvalFrameDefault+0x14e5) [0x555555722585]

python3(_PyFunction_FastCallKeywords+0xfb) [0x5555556b8e7b]

python3(_PyEval_EvalFrameDefault+0x416) [0x5555557214b6]

python3(_PyEval_EvalCodeWithName+0x2f9) [0x555555668829]

python3(PyEval_EvalCodeEx+0x44) [0x555555669714]

python3(PyEval_EvalCode+0x1c) [0x55555566973c]

python3(+0x22cf14) [0x555555780f14]

python3(PyRun_FileExFlags+0xa1) [0x55555578b331]

python3(PyRun_SimpleFileExFlags+0x1c3) [0x55555578b523]

python3(+0x238655) [0x55555578c655]

python3(_Py_UnixMain+0x3c) [0x55555578c77c]

/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555]

python3(+0x1dcff0) [0x555555730ff0]

Aborted

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758537804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4EEAINY7P5USCZK23SZQLDXANCNFSM4VUIPSPQ .

shanguanma commented 3 years ago

Yes. try to do it again. it is no error.

[md510@node02 k2]$ cd build/
[md510@node02 build]$ ctest 
Test project /home4/md510/w2020/k2-fsa/k2/build
      Start  1: Test.Cuda.cu_algorithms_test
 1/75 Test  #1: Test.Cuda.cu_algorithms_test .......   Passed    6.42 sec
      Start  2: Test.Cuda.cu_array_ops_test
 2/75 Test  #2: Test.Cuda.cu_array_ops_test ........   Passed    8.96 sec
      Start  3: Test.Cuda.cu_array_test
 3/75 Test  #3: Test.Cuda.cu_array_test ............   Passed    6.39 sec
      Start  4: Test.Cuda.cu_fsa_algo_test
 4/75 Test  #4: Test.Cuda.cu_fsa_algo_test .........   Passed    8.75 sec
      Start  5: Test.Cuda.cu_fsa_test
 5/75 Test  #5: Test.Cuda.cu_fsa_test ..............   Passed    6.53 sec
      Start  6: Test.Cuda.cu_fsa_utils_test
 6/75 Test  #6: Test.Cuda.cu_fsa_utils_test ........   Passed    6.84 sec
      Start  7: Test.Cuda.cu_hash_test
 7/75 Test  #7: Test.Cuda.cu_hash_test .............   Passed    6.75 sec
      Start  8: Test.Cuda.cu_host_shim_test
 8/75 Test  #8: Test.Cuda.cu_host_shim_test ........   Passed    0.19 sec
      Start  9: Test.Cuda.cu_intersect_test
 9/75 Test  #9: Test.Cuda.cu_intersect_test ........   Passed    7.11 sec
      Start 10: Test.Cuda.cu_log_test
10/75 Test #10: Test.Cuda.cu_log_test ..............   Passed    6.42 sec
      Start 11: Test.Cuda.cu_macros_test
11/75 Test #11: Test.Cuda.cu_macros_test ...........   Passed    6.32 sec
      Start 12: Test.Cuda.cu_nvtx_test
12/75 Test #12: Test.Cuda.cu_nvtx_test .............   Passed    4.21 sec
      Start 13: Test.Cuda.cu_pinned_context_test
13/75 Test #13: Test.Cuda.cu_pinned_context_test ...   Passed   40.72 sec
      Start 14: Test.Cuda.cu_ragged_shape_test
14/75 Test #14: Test.Cuda.cu_ragged_shape_test .....   Passed    6.40 sec
      Start 15: Test.Cuda.cu_ragged_test
15/75 Test #15: Test.Cuda.cu_ragged_test ...........   Passed    7.07 sec
      Start 16: Test.Cuda.cu_ragged_utils_test
16/75 Test #16: Test.Cuda.cu_ragged_utils_test .....   Passed    6.32 sec
      Start 17: Test.Cuda.cu_rm_epsilon_test
17/75 Test #17: Test.Cuda.cu_rm_epsilon_test .......   Passed    7.27 sec
      Start 18: Test.Cuda.cu_tensor_ops_test
18/75 Test #18: Test.Cuda.cu_tensor_ops_test .......   Passed    6.71 sec
      Start 19: Test.Cuda.cu_tensor_test
19/75 Test #19: Test.Cuda.cu_tensor_test ...........   Passed    0.19 sec
      Start 20: Test.Cuda.cu_thread_pool_test
20/75 Test #20: Test.Cuda.cu_thread_pool_test ......   Passed    0.28 sec
      Start 21: Test.Cuda.cu_top_sort_test
21/75 Test #21: Test.Cuda.cu_top_sort_test .........   Passed    8.10 sec
      Start 22: Test.Cuda.cu_utils_test
22/75 Test #22: Test.Cuda.cu_utils_test ............   Passed    6.78 sec
      Start 23: Test.arcsort_test
23/75 Test #23: Test.arcsort_test ..................   Passed    0.01 sec
      Start 24: Test.array_test
24/75 Test #24: Test.array_test ....................   Passed    0.01 sec
      Start 25: Test.aux_labels_test
25/75 Test #25: Test.aux_labels_test ...............   Passed    0.01 sec
      Start 26: Test.connect_test
26/75 Test #26: Test.connect_test ..................   Passed    0.01 sec
      Start 27: Test.determinize_test
27/75 Test #27: Test.determinize_test ..............   Passed    0.02 sec
      Start 28: Test.fsa_equivalent_test
28/75 Test #28: Test.fsa_equivalent_test ...........   Passed    0.01 sec
      Start 29: Test.fsa_renderer_test
29/75 Test #29: Test.fsa_renderer_test .............   Passed    0.01 sec
      Start 30: Test.fsa_test
30/75 Test #30: Test.fsa_test ......................   Passed    0.01 sec
      Start 31: Test.fsa_util_test
31/75 Test #31: Test.fsa_util_test .................   Passed    0.01 sec
      Start 32: Test.intersect_test
32/75 Test #32: Test.intersect_test ................   Passed    0.01 sec
      Start 33: Test.properties_test
33/75 Test #33: Test.properties_test ...............   Passed    0.01 sec
      Start 34: Test.rmepsilon_test
34/75 Test #34: Test.rmepsilon_test ................   Passed    0.01 sec
      Start 35: Test.topsort_test
35/75 Test #35: Test.topsort_test ..................   Passed    0.01 sec
      Start 36: Test.weights_test
36/75 Test #36: Test.weights_test ..................   Passed    0.01 sec
      Start 37: add_epsilon_self_loops_test_py
37/75 Test #37: add_epsilon_self_loops_test_py .....   Passed    1.07 sec
      Start 38: arc_sort_test_py
38/75 Test #38: arc_sort_test_py ...................   Passed    0.68 sec
      Start 39: closure_test_py
39/75 Test #39: closure_test_py ....................   Passed    7.34 sec
      Start 40: compose_test_py
40/75 Test #40: compose_test_py ....................   Passed    0.74 sec
      Start 41: connect_test_py
41/75 Test #41: connect_test_py ....................   Passed    0.79 sec
      Start 42: ctc_gradients_test_py
42/75 Test #42: ctc_gradients_test_py ..............   Passed    8.10 sec
      Start 43: dense_fsa_vec_test_py
43/75 Test #43: dense_fsa_vec_test_py ..............   Passed    6.63 sec
      Start 44: determinize_test_py
44/75 Test #44: determinize_test_py ................   Passed    0.73 sec
      Start 45: fsa_test_py
45/75 Test #45: fsa_test_py ........................   Passed    7.19 sec
      Start 46: get_tot_scores_test_py
46/75 Test #46: get_tot_scores_test_py .............   Passed    6.39 sec
      Start 47: index_add_test_py
47/75 Test #47: index_add_test_py ..................   Passed    7.25 sec
      Start 48: index_select_test_py
48/75 Test #48: index_select_test_py ...............   Passed    7.22 sec
      Start 49: index_test_py
49/75 Test #49: index_test_py ......................   Passed    7.26 sec
      Start 50: intersect_dense_pruned_test_py
50/75 Test #50: intersect_dense_pruned_test_py .....   Passed    6.69 sec
      Start 51: intersect_dense_test_py
51/75 Test #51: intersect_dense_test_py ............   Passed    6.80 sec
      Start 52: intersect_test_py
52/75 Test #52: intersect_test_py ..................   Passed    0.74 sec
      Start 53: invert_test_py
53/75 Test #53: invert_test_py .....................   Passed    0.67 sec
      Start 54: linear_fsa_test_py
54/75 Test #54: linear_fsa_test_py .................   Passed    0.66 sec
      Start 55: numerical_gradient_check_test_py
55/75 Test #55: numerical_gradient_check_test_py ...   Passed   10.05 sec
      Start 56: ragged_ops_test_py
56/75 Test #56: ragged_ops_test_py .................   Passed    0.79 sec
      Start 57: ragged_shape_test_py
57/75 Test #57: ragged_shape_test_py ...............   Passed    6.92 sec
      Start 58: ragged_test_py
58/75 Test #58: ragged_test_py .....................   Passed    0.66 sec
      Start 59: remove_epsilon_test_py
59/75 Test #59: remove_epsilon_test_py .............   Passed    0.66 sec
      Start 60: shortest_path_test_py
60/75 Test #60: shortest_path_test_py ..............   Passed    0.74 sec
      Start 61: symbol_table_test_py
61/75 Test #61: symbol_table_test_py ...............   Passed    0.73 sec
      Start 62: top_sort_test_py
62/75 Test #62: top_sort_test_py ...................   Passed    0.68 sec
      Start 63: union_test_py
63/75 Test #63: union_test_py ......................   Passed    6.74 sec
      Start 64: host_arcsort_test_py
64/75 Test #64: host_arcsort_test_py ...............   Passed    0.68 sec
      Start 65: host_array_test_py
65/75 Test #65: host_array_test_py .................   Passed    0.70 sec
      Start 66: host_aux_labels_test_py
66/75 Test #66: host_aux_labels_test_py ............   Passed    0.68 sec
      Start 67: host_connect_test_py
67/75 Test #67: host_connect_test_py ...............   Passed    0.67 sec
      Start 68: host_determinize_test_py
68/75 Test #68: host_determinize_test_py ...........   Passed    0.63 sec
      Start 69: host_fsa_equivalent_test_py
69/75 Test #69: host_fsa_equivalent_test_py ........   Passed    0.69 sec
      Start 70: host_fsa_test_py
70/75 Test #70: host_fsa_test_py ...................   Passed    0.68 sec
      Start 71: host_intersect_test_py
71/75 Test #71: host_intersect_test_py .............   Passed    0.65 sec
      Start 72: host_properties_test_py
72/75 Test #72: host_properties_test_py ............   Passed    0.65 sec
      Start 73: host_rmepsilon_test_py
73/75 Test #73: host_rmepsilon_test_py .............   Passed    0.62 sec
      Start 74: host_topsort_test_py
74/75 Test #74: host_topsort_test_py ...............   Passed    0.71 sec
      Start 75: host_weights_test_py
75/75 Test #75: host_weights_test_py ...............   Passed    0.71 sec

100% tests passed, 0 tests failed out of 75

Total Test time (real) = 278.15 sec
danpovey commented 3 years ago

Try pip uninstalling the package and reinstalling.. Otherwise I'm not sure, it would require debugging by modifying code, possibly.

On Tue, Jan 12, 2021 at 6:41 PM shanguanma notifications@github.com wrote:

Yes. try to do it again. it is no error.

[md510@node02 k2]$ cd build/ [md510@node02 build]$ ctest Test project /home4/md510/w2020/k2-fsa/k2/build Start 1: Test.Cuda.cu_algorithms_test 1/75 Test #1: Test.Cuda.cu_algorithms_test ....... Passed 6.42 sec Start 2: Test.Cuda.cu_array_ops_test 2/75 Test #2: Test.Cuda.cu_array_ops_test ........ Passed 8.96 sec Start 3: Test.Cuda.cu_array_test 3/75 Test #3: Test.Cuda.cu_array_test ............ Passed 6.39 sec Start 4: Test.Cuda.cu_fsa_algo_test 4/75 Test #4: Test.Cuda.cu_fsa_algo_test ......... Passed 8.75 sec Start 5: Test.Cuda.cu_fsa_test 5/75 Test #5: Test.Cuda.cu_fsa_test .............. Passed 6.53 sec Start 6: Test.Cuda.cu_fsa_utils_test 6/75 Test #6: Test.Cuda.cu_fsa_utils_test ........ Passed 6.84 sec Start 7: Test.Cuda.cu_hash_test 7/75 Test #7: Test.Cuda.cu_hash_test ............. Passed 6.75 sec Start 8: Test.Cuda.cu_host_shim_test 8/75 Test #8: Test.Cuda.cu_host_shim_test ........ Passed 0.19 sec Start 9: Test.Cuda.cu_intersect_test 9/75 Test #9: Test.Cuda.cu_intersect_test ........ Passed 7.11 sec Start 10: Test.Cuda.cu_log_test 10/75 Test #10: Test.Cuda.cu_log_test .............. Passed 6.42 sec Start 11: Test.Cuda.cu_macros_test 11/75 Test #11: Test.Cuda.cu_macros_test ........... Passed 6.32 sec Start 12: Test.Cuda.cu_nvtx_test 12/75 Test #12: Test.Cuda.cu_nvtx_test ............. Passed 4.21 sec Start 13: Test.Cuda.cu_pinned_context_test 13/75 Test #13: Test.Cuda.cu_pinned_context_test ... Passed 40.72 sec Start 14: Test.Cuda.cu_ragged_shape_test 14/75 Test #14: Test.Cuda.cu_ragged_shape_test ..... Passed 6.40 sec Start 15: Test.Cuda.cu_ragged_test 15/75 Test #15: Test.Cuda.cu_ragged_test ........... Passed 7.07 sec Start 16: Test.Cuda.cu_ragged_utils_test 16/75 Test #16: Test.Cuda.cu_ragged_utils_test ..... Passed 6.32 sec Start 17: Test.Cuda.cu_rm_epsilon_test 17/75 Test #17: Test.Cuda.cu_rm_epsilon_test ....... Passed 7.27 sec Start 18: Test.Cuda.cu_tensor_ops_test 18/75 Test #18: Test.Cuda.cu_tensor_ops_test ....... Passed 6.71 sec Start 19: Test.Cuda.cu_tensor_test 19/75 Test #19: Test.Cuda.cu_tensor_test ........... Passed 0.19 sec Start 20: Test.Cuda.cu_thread_pool_test 20/75 Test #20: Test.Cuda.cu_thread_pool_test ...... Passed 0.28 sec Start 21: Test.Cuda.cu_top_sort_test 21/75 Test #21: Test.Cuda.cu_top_sort_test ......... Passed 8.10 sec Start 22: Test.Cuda.cu_utils_test 22/75 Test #22: Test.Cuda.cu_utils_test ............ Passed 6.78 sec Start 23: Test.arcsort_test 23/75 Test #23: Test.arcsort_test .................. Passed 0.01 sec Start 24: Test.array_test 24/75 Test #24: Test.array_test .................... Passed 0.01 sec Start 25: Test.aux_labels_test 25/75 Test #25: Test.aux_labels_test ............... Passed 0.01 sec Start 26: Test.connect_test 26/75 Test #26: Test.connect_test .................. Passed 0.01 sec Start 27: Test.determinize_test 27/75 Test #27: Test.determinize_test .............. Passed 0.02 sec Start 28: Test.fsa_equivalent_test 28/75 Test #28: Test.fsa_equivalent_test ........... Passed 0.01 sec Start 29: Test.fsa_renderer_test 29/75 Test #29: Test.fsa_renderer_test ............. Passed 0.01 sec Start 30: Test.fsa_test 30/75 Test #30: Test.fsa_test ...................... Passed 0.01 sec Start 31: Test.fsa_util_test 31/75 Test #31: Test.fsa_util_test ................. Passed 0.01 sec Start 32: Test.intersect_test 32/75 Test #32: Test.intersect_test ................ Passed 0.01 sec Start 33: Test.properties_test 33/75 Test #33: Test.properties_test ............... Passed 0.01 sec Start 34: Test.rmepsilon_test 34/75 Test #34: Test.rmepsilon_test ................ Passed 0.01 sec Start 35: Test.topsort_test 35/75 Test #35: Test.topsort_test .................. Passed 0.01 sec Start 36: Test.weights_test 36/75 Test #36: Test.weights_test .................. Passed 0.01 sec Start 37: add_epsilon_self_loops_test_py 37/75 Test #37: add_epsilon_self_loops_test_py ..... Passed 1.07 sec Start 38: arc_sort_test_py 38/75 Test #38: arc_sort_test_py ................... Passed 0.68 sec Start 39: closure_test_py 39/75 Test #39: closure_test_py .................... Passed 7.34 sec Start 40: compose_test_py 40/75 Test #40: compose_test_py .................... Passed 0.74 sec Start 41: connect_test_py 41/75 Test #41: connect_test_py .................... Passed 0.79 sec Start 42: ctc_gradients_test_py 42/75 Test #42: ctc_gradients_test_py .............. Passed 8.10 sec Start 43: dense_fsa_vec_test_py 43/75 Test #43: dense_fsa_vec_test_py .............. Passed 6.63 sec Start 44: determinize_test_py 44/75 Test #44: determinize_test_py ................ Passed 0.73 sec Start 45: fsa_test_py 45/75 Test #45: fsa_test_py ........................ Passed 7.19 sec Start 46: get_tot_scores_test_py 46/75 Test #46: get_tot_scores_test_py ............. Passed 6.39 sec Start 47: index_add_test_py 47/75 Test #47: index_add_test_py .................. Passed 7.25 sec Start 48: index_select_test_py 48/75 Test #48: index_select_test_py ............... Passed 7.22 sec Start 49: index_test_py 49/75 Test #49: index_test_py ...................... Passed 7.26 sec Start 50: intersect_dense_pruned_test_py 50/75 Test #50: intersect_dense_pruned_test_py ..... Passed 6.69 sec Start 51: intersect_dense_test_py 51/75 Test #51: intersect_dense_test_py ............ Passed 6.80 sec Start 52: intersect_test_py 52/75 Test #52: intersect_test_py .................. Passed 0.74 sec Start 53: invert_test_py 53/75 Test #53: invert_test_py ..................... Passed 0.67 sec Start 54: linear_fsa_test_py 54/75 Test #54: linear_fsa_test_py ................. Passed 0.66 sec Start 55: numerical_gradient_check_test_py 55/75 Test #55: numerical_gradient_check_test_py ... Passed 10.05 sec Start 56: ragged_ops_test_py 56/75 Test #56: ragged_ops_test_py ................. Passed 0.79 sec Start 57: ragged_shape_test_py 57/75 Test #57: ragged_shape_test_py ............... Passed 6.92 sec Start 58: ragged_test_py 58/75 Test #58: ragged_test_py ..................... Passed 0.66 sec Start 59: remove_epsilon_test_py 59/75 Test #59: remove_epsilon_test_py ............. Passed 0.66 sec Start 60: shortest_path_test_py 60/75 Test #60: shortest_path_test_py .............. Passed 0.74 sec Start 61: symbol_table_test_py 61/75 Test #61: symbol_table_test_py ............... Passed 0.73 sec Start 62: top_sort_test_py 62/75 Test #62: top_sort_test_py ................... Passed 0.68 sec Start 63: union_test_py 63/75 Test #63: union_test_py ...................... Passed 6.74 sec Start 64: host_arcsort_test_py 64/75 Test #64: host_arcsort_test_py ............... Passed 0.68 sec Start 65: host_array_test_py 65/75 Test #65: host_array_test_py ................. Passed 0.70 sec Start 66: host_aux_labels_test_py 66/75 Test #66: host_aux_labels_test_py ............ Passed 0.68 sec Start 67: host_connect_test_py 67/75 Test #67: host_connect_test_py ............... Passed 0.67 sec Start 68: host_determinize_test_py 68/75 Test #68: host_determinize_test_py ........... Passed 0.63 sec Start 69: host_fsa_equivalent_test_py 69/75 Test #69: host_fsa_equivalent_test_py ........ Passed 0.69 sec Start 70: host_fsa_test_py 70/75 Test #70: host_fsa_test_py ................... Passed 0.68 sec Start 71: host_intersect_test_py 71/75 Test #71: host_intersect_test_py ............. Passed 0.65 sec Start 72: host_properties_test_py 72/75 Test #72: host_properties_test_py ............ Passed 0.65 sec Start 73: host_rmepsilon_test_py 73/75 Test #73: host_rmepsilon_test_py ............. Passed 0.62 sec Start 74: host_topsort_test_py 74/75 Test #74: host_topsort_test_py ............... Passed 0.71 sec Start 75: host_weights_test_py 75/75 Test #75: host_weights_test_py ............... Passed 0.71 sec

100% tests passed, 0 tests failed out of 75

Total Test time (real) = 278.15 sec

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758568708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6S4NZW6335GGPHG23SZQROHANCNFSM4VUIPSPQ .

danpovey commented 3 years ago

Also this could result from over-aggressive compiler optimization. It is checking that -inf == -inf, probably. Sometimes comparisons involving infinity can be optimized out, e.g. if the compiler assumes that fabs(a-b) should be zero if a==b. So touching the file and doing make again in build/, to see the compilation commands and associated flags, may be useful. And debug vs. release mode may matter.

shanguanma commented 3 years ago

Your means that let me to pip uninstall pytorch , Torchaudio, and reinstall k2? ok,I will to do it again. While I found that https://github.com/k2-fsa/k2/blob/master/.github/workflows/build.yml#L25, k2 build environment is only ubuntu16.04 ubuntu18.04, but my system os of computer server cluster is centos 7.

danpovey commented 3 years ago

No I meant ununistall just k2. the build.yml is just for github actions.

On Tue, Jan 12, 2021 at 8:16 PM shanguanma notifications@github.com wrote:

Your means that let me to pip uninstall pytorch , Torchaudio, and reinstall k2? ok,I will to do it again. While I found that https://github.com/k2-fsa/k2/blob/master/.github/workflows/build.yml#L25, k2 build environment is only ubuntu16.04 ubuntu18.04, but my system os of computer server cluster is centos 7.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758618155, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYD77J2K5AVSKLG6D3SZQ4RFANCNFSM4VUIPSPQ .

shanguanma commented 3 years ago

OK, I have reinstall k2 via below command as your suggestion:

$ conda create -n k2-fsa2 python=3.8
$ conda activate k2-fsa2
$ conda install pytorch  torchaudio cudatoolkit=10.2 -c pytorch

$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug ..
$ make
$  python3 -m pip install --no-deps --force-reinstall graphviz
$ ctest
$ cd ..
$ pip3 install wheel twine
$ ./scripts/build_pip.sh
$ python3 -m pip install --no-deps --force-reinstall dist/k2-*.whl
install snowfall
$ git clone https://github.com/k2-fsa/snowfall.git
$ cd  snowfall
$ python3 -m pip install -e .

compile processing and install processing are no error, when I run gdb --args python3 mmi_bigram_train.py It gives an error and it isn't same to previous error:

[md510@node02 simple_v1]$ gdb --args  python3 mmi_bigram_train.py 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home4/md510/anaconda3/envs/k2-fsa2/bin/python3.8...done.
(gdb) r
Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py
warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed
Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug
Missing separate debuginfo for /lib64/libpthread.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /lib64/libc.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug
Missing separate debuginfo for /lib64/libdl.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug
Missing separate debuginfo for /lib64/libutil.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug
Missing separate debuginfo for /lib64/librt.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug
Missing separate debuginfo for /lib64/libm.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug
[Detaching after fork from child process 46736]
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug
[Detaching after fork from child process 46744]
Missing separate debuginfo for /lib64/libsndfile.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug
Missing separate debuginfo for /lib64/libgsm.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug
Missing separate debuginfo for /lib64/libFLAC.so.8
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug
Missing separate debuginfo for /lib64/libvorbisenc.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug
Missing separate debuginfo for /lib64/libvorbis.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug
Missing separate debuginfo for /lib64/libogg.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  warnings.warn(
[New Thread 0x2aab3309b700 (LWP 46745)]
2021-01-12 22:54:24,746 INFO [mmi_bigram_train.py:310] Loading L.fst
2021-01-12 22:54:25,032 INFO [mmi_bigram_train.py:328] About to get train cuts
2021-01-12 22:54:30,810 INFO [mmi_bigram_train.py:330] About to get dev cuts
2021-01-12 22:54:30,903 INFO [mmi_bigram_train.py:333] About to create train dataset
2021-01-12 22:54:31,388 INFO [mmi_bigram_train.py:337] About to create dev dataset
2021-01-12 22:54:31,409 INFO [mmi_bigram_train.py:341] About to create train dataloader
2021-01-12 22:54:31,409 INFO [mmi_bigram_train.py:343] About to create dev dataloader
[New Thread 0x2aab451f3700 (LWP 46754)]
2021-01-12 22:54:31,441 INFO [mmi_bigram_train.py:350] About to create model
[New Thread 0x2aab453f4700 (LWP 46755)]
[New Thread 0x2aab455f5700 (LWP 46756)]
================================================================================
Model parameters summary:
================================================================================
* P_scores:                                                                 7568
* tdnn.0.weight:                                                           60000
* tdnn.0.bias:                                                               500
* tdnn.3.weight:                                                          750000
* tdnn.3.bias:                                                               500
* tdnn.6.weight:                                                          750000
* tdnn.6.bias:                                                               500
* lstms.0.weight_ih_l0:                                                  1000000
* lstms.0.weight_hh_l0:                                                  1000000
* lstms.0.bias_ih_l0:                                                       2000
* lstms.0.bias_hh_l0:                                                       2000
* lstms.1.weight_ih_l0:                                                  1000000
* lstms.1.weight_hh_l0:                                                  1000000
* lstms.1.bias_ih_l0:                                                       2000
* lstms.1.bias_hh_l0:                                                       2000
* lstms.2.weight_ih_l0:                                                  1000000
* lstms.2.weight_hh_l0:                                                  1000000
* lstms.2.bias_ih_l0:                                                       2000
* lstms.2.bias_hh_l0:                                                       2000
* lstms.3.weight_ih_l0:                                                  1000000
* lstms.3.weight_hh_l0:                                                  1000000
* lstms.3.bias_ih_l0:                                                       2000
* lstms.3.bias_hh_l0:                                                       2000
* lstms.4.weight_ih_l0:                                                  1000000
* lstms.4.weight_hh_l0:                                                  1000000
* lstms.4.bias_ih_l0:                                                       2000
* lstms.4.bias_hh_l0:                                                       2000
* linear.weight:                                                           43500
* linear.bias:                                                                87
================================================================================
Total: 11632655
================================================================================
2021-01-12 22:54:38,940 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001
[Detaching after fork from child process 46807]
[Detaching after fork from child process 46808]
[Detaching after fork from child process 46809]
[Detaching after fork from child process 46810]
[New Thread 0x2aab45a08700 (LWP 46811)]
[New Thread 0x2aab45c09700 (LWP 46812)]
[New Thread 0x2aab45e0a700 (LWP 46813)]
[New Thread 0x2aab48200700 (LWP 46814)]
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged.cu:bool k2::RaggedShape::Validate(bool) const:385 Problem validating row-ids: for layers_[0], row_splits = [ 0 1 3 5 9 13 15 17 20 22 25 27 29 34 39 41 43 48 53 58 60 63 65 68 71 73 76 79 81 84 87 89 91 100 102 109 111 113 115 117 119 122 124 126 129 131 134 136 139 141 144 146 149 151 154 156 159 161 164 166 169 172 174 179 181 184 186 189 191 193 196 198 201 204 206 211 ....here I ignore some number, because it contain many numbers
077 35077 35077 35077 ], see index 96409 of row_ids, whose dim is 101526

[ Stack-Trace: ]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::Validate(bool) const+0xe8a) [0x2aab2d083846]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::Check()+0x1e) [0x2aab2cfdba5e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::RaggedShape(std::vector<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> > const&, bool)+0x57) [0x2aab2cfdba1b]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int)+0x59a) [0x2aab2d08ec52]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int, k2::Array1<int>*, k2::Array1<int>*, int)+0x27a) [0x2aab2d08f86c]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged<k2::Arc>&, k2::Array1<int> const&)+0x38b) [0x2aab2cfc7398]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x91) [0x2aab2d03b65e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x93d) [0x2aaacd9aa98d]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted.
0x00002aaaaaf21387 in raise () from /lib64/libc.so.6
(gdb) 
danpovey commented 3 years ago

Make sure your k2 codebase is reasonably up to date and that the file time of /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_ k2.cpython-38-x86_64-linux-gnu.so is recent. Also show nvidia-smi output. May be build problem.

On Tue, Jan 12, 2021 at 11:05 PM shanguanma notifications@github.com wrote:

OK, I have reinstall k2 via below command as your suggestion:

$ conda create -n k2-fsa2 python=3.8 $ conda activate k2-fsa2 $ conda install pytorch torchaudio cudatoolkit=10.2 -c pytorch

$ git clone https://github.com/k2-fsa/k2.git $ cd k2 $ mkdir build $ cd build $ cmake -DCMAKE_BUILD_TYPE=Debug .. $ make $ python3 -m pip install --no-deps --force-reinstall graphviz $ ctest $ cd .. $ pip3 install wheel twine $ ./scripts/build_pip.sh $ python3 -m pip install --no-deps --force-reinstall dist/k2-*.whl install snowfall $ git clone https://github.com/k2-fsa/snowfall.git $ cd snowfall $ python3 -m pip install -e .

compile processing and install processing are no error, when I run gdb --args python3 mmi_bigram_train.py It gives an error and it isn't same to previous error:

[md510@node02 simple_v1]$ gdb --args python3 mmi_bigram_train.py GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /home4/md510/anaconda3/envs/k2-fsa2/bin/python3.8...done. (gdb) r Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug Missing separate debuginfo for /lib64/libpthread.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfo for /lib64/libc.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug Missing separate debuginfo for /lib64/libdl.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug Missing separate debuginfo for /lib64/libutil.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug Missing separate debuginfo for /lib64/librt.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug Missing separate debuginfo for /lib64/libm.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug [Detaching after fork from child process 46736] Missing separate debuginfo for /lib64/libcuda.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug [Detaching after fork from child process 46744] Missing separate debuginfo for /lib64/libsndfile.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug Missing separate debuginfo for /lib64/libgsm.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug Missing separate debuginfo for /lib64/libFLAC.so.8 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug Missing separate debuginfo for /lib64/libvorbisenc.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug Missing separate debuginfo for /lib64/libvorbis.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug Missing separate debuginfo for /lib64/libogg.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. warnings.warn( [New Thread 0x2aab3309b700 (LWP 46745)] 2021-01-12 22:54:24,746 INFO [mmi_bigram_train.py:310] Loading L.fst 2021-01-12 22:54:25,032 INFO [mmi_bigram_train.py:328] About to get train cuts 2021-01-12 22:54:30,810 INFO [mmi_bigram_train.py:330] About to get dev cuts 2021-01-12 22:54:30,903 INFO [mmi_bigram_train.py:333] About to create train dataset 2021-01-12 22:54:31,388 INFO [mmi_bigram_train.py:337] About to create dev dataset 2021-01-12 22:54:31,409 INFO [mmi_bigram_train.py:341] About to create train dataloader 2021-01-12 22:54:31,409 INFO [mmi_bigram_train.py:343] About to create dev dataloader [New Thread 0x2aab451f3700 (LWP 46754)] 2021-01-12 22:54:31,441 INFO [mmi_bigram_train.py:350] About to create model [New Thread 0x2aab453f4700 (LWP 46755)] [New Thread 0x2aab455f5700 (LWP 46756)]

Model parameters summary:

  • P_scores: 7568
  • tdnn.0.weight: 60000
  • tdnn.0.bias: 500
  • tdnn.3.weight: 750000
  • tdnn.3.bias: 500
  • tdnn.6.weight: 750000
  • tdnn.6.bias: 500
  • lstms.0.weight_ih_l0: 1000000
  • lstms.0.weight_hh_l0: 1000000
  • lstms.0.bias_ih_l0: 2000
  • lstms.0.bias_hh_l0: 2000
  • lstms.1.weight_ih_l0: 1000000
  • lstms.1.weight_hh_l0: 1000000
  • lstms.1.bias_ih_l0: 2000
  • lstms.1.bias_hh_l0: 2000
  • lstms.2.weight_ih_l0: 1000000
  • lstms.2.weight_hh_l0: 1000000
  • lstms.2.bias_ih_l0: 2000
  • lstms.2.bias_hh_l0: 2000
  • lstms.3.weight_ih_l0: 1000000
  • lstms.3.weight_hh_l0: 1000000
  • lstms.3.bias_ih_l0: 2000
  • lstms.3.bias_hh_l0: 2000
  • lstms.4.weight_ih_l0: 1000000
  • lstms.4.weight_hh_l0: 1000000
  • lstms.4.bias_ih_l0: 2000
  • lstms.4.bias_hh_l0: 2000
  • linear.weight: 43500
  • linear.bias: 87

    Total: 11632655

    2021-01-12 22:54:38,940 INFO [mmi_bigramtrain.py:400] epoch 0, learning rate 0.001 [Detaching after fork from child process 46807] [Detaching after fork from child process 46808] [Detaching after fork from child process 46809] [Detaching after fork from child process 46810] [New Thread 0x2aab45a08700 (LWP 46811)] [New Thread 0x2aab45c09700 (LWP 46812)] [New Thread 0x2aab45e0a700 (LWP 46813)] [New Thread 0x2aab48200700 (LWP 46814)] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged.cu:bool k2::RaggedShape::Validate(bool) const:385 Problem validating row-ids: for layers[0], row_splits = [ 0 1 3 5 9 13 15 17 20 22 25 27 29 34 39 41 43 48 53 58 60 63 65 68 71 73 76 79 81 84 87 89 91 100 102 109 111 113 115 117 119 122 124 126 129 131 134 136 139 141 144 146 149 151 154 156 159 161 164 166 169 172 174 179 181 184 186 189 191 193 196 198 201 204 206 211 ....here I ignore some number, because it contain many numbers 077 35077 35077 35077 ], see index 96409 of row_ids, whose dim is 101526

[ Stack-Trace: ] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::Validate(bool) const+0xe8a) [0x2aab2d083846] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::Check()+0x1e) [0x2aab2cfdba5e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::RaggedShape(std::vector<k2::RaggedShapeLayer, std::allocator > const&, bool)+0x57) [0x2aab2cfdba1b] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1, k2::Array1, int)+0x59a) [0x2aab2d08ec52] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1, k2::Array1, int, k2::Array1, k2::Array1, int)+0x27a) [0x2aab2d08f86c] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged&, k2::Array1 const&)+0x38b) [0x2aab2cfc7398] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged&, k2::DenseFsaVec&, float, k2::Ragged, k2::Array1, k2::Array1)+0x91) [0x2aab2d03b65e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object*)+0x93d) [0x2aaacd9aa98d] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted. 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6 (gdb)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758716032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4J3RZHU22CJMY7GVDSZRQLHANCNFSM4VUIPSPQ .

shanguanma commented 3 years ago

Yes, k2 codebase is from latest master branch. This is build file just now.

[md510@node02 k2]$ ls dist/ -larth
total 54M
drwxr-xr-x 12 md510 users 4.0K Jan 12 22:47 ..
drwxr-xr-x  2 md510 users 4.0K Jan 12 22:47 .
-rw-r--r--  1 md510 users  54M Jan 12 22:47 k2-0.1.3+cu102.dev20210112-cp38-cp38-linux_x86_64.whl

[md510@node02 k2]$ ls /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so  -larth
-rwxr-xr-x 1 md510 users 34M Jan 12 22:48 /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so

Also show nvidia-smi output. May be build problem.


[md510@node02 k2]$ nvidia-smi
Tue Jan 12 23:17:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:1D:00.0 Off |                    0 |
| 33%   49C    P2   158W / 260W |   3835MiB / 45553MiB |     49%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:1E:00.0 Off |                    0 |
| 33%   55C    P2   110W / 260W |   4013MiB / 45553MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000     On   | 00000000:20:00.0 Off |                    0 |
| 33%   46C    P2   136W / 260W |   3721MiB / 45553MiB |     41%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 8000     On   | 00000000:21:00.0 Off |                    0 |
| 40%   64C    P2   260W / 260W |  32389MiB / 45553MiB |     91%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Quadro RTX 8000     On   | 00000000:24:00.0 Off |                    0 |
| 40%   64C    P2   226W / 260W |  22959MiB / 45553MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 164572 C ...t/tools/venv/envs/md_espnet/bin/python3 3823MiB | | 1 164573 C ...t/tools/venv/envs/md_espnet/bin/python3 4001MiB | | 2 164574 C ...t/tools/venv/envs/md_espnet/bin/python3 3709MiB | | 3 56223 C nnet3-chain-train 22947MiB | | 4 56905 C nnet3-chain-train 22947MiB | +-----------------------------------------------------------------------------+

danpovey commented 3 years ago

try running it in gdb and showing the whole stack trace with line numbers. gdb python3 train.py (gdb) r ...

On Tue, Jan 12, 2021 at 11:22 PM shanguanma notifications@github.com wrote:

Yes, k2 codebase is from latest master branch. This is build file just now.

[md510@node02 k2]$ ls dist/ -larth total 54M drwxr-xr-x 12 md510 users 4.0K Jan 12 22:47 .. drwxr-xr-x 2 md510 users 4.0K Jan 12 22:47 . -rw-r--r-- 1 md510 users 54M Jan 12 22:47 k2-0.1.3+cu102.dev20210112-cp38-cp38-linux_x86_64.whl

[md510@node02 k2]$ ls /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so -larth -rwxr-xr-x 1 md510 users 34M Jan 12 22:48 /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so

Also show nvidia-smi output. May be build problem.

[md510@node02 k2]$ nvidia-smi Tue Jan 12 23:17:50 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro RTX 8000 On | 00000000:1D:00.0 Off | 0 | | 33% 49C P2 158W / 260W | 3835MiB / 45553MiB | 49% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 8000 On | 00000000:1E:00.0 Off | 0 | | 33% 55C P2 110W / 260W | 4013MiB / 45553MiB | 56% Default | +-------------------------------+----------------------+----------------------+ | 2 Quadro RTX 8000 On | 00000000:20:00.0 Off | 0 | | 33% 46C P2 136W / 260W | 3721MiB / 45553MiB | 41% Default | +-------------------------------+----------------------+----------------------+ | 3 Quadro RTX 8000 On | 00000000:21:00.0 Off | 0 | | 40% 64C P2 260W / 260W | 32389MiB / 45553MiB | 91% Default | +-------------------------------+----------------------+----------------------+ | 4 Quadro RTX 8000 On | 00000000:24:00.0 Off | 0 | | 40% 64C P2 226W / 260W | 22959MiB / 45553MiB | 84% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 164572 C ...t/tools/venv/envs/md_espnet/bin/python3 3823MiB | | 1 164573 C ...t/tools/venv/envs/md_espnet/bin/python3 4001MiB | | 2 164574 C ...t/tools/venv/envs/md_espnet/bin/python3 3709MiB | | 3 56223 C nnet3-chain-train 22947MiB | | 4 56905 C nnet3-chain-train 22947MiB | +-----------------------------------------------------------------------------+

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758728213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5CAVS6HJLX7QN2WTLSZRSMDANCNFSM4VUIPSPQ .

shanguanma commented 3 years ago
[md510@node02 simple_v1]$ gdb --args  python3 mmi_bigram_train.py 
(gdb) r
Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py
warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed
Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug
Missing separate debuginfo for /lib64/libpthread.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /lib64/libc.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug
Missing separate debuginfo for /lib64/libdl.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug
Missing separate debuginfo for /lib64/libutil.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug
Missing separate debuginfo for /lib64/librt.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug
Missing separate debuginfo for /lib64/libm.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug
[Detaching after fork from child process 66884]
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug
[Detaching after fork from child process 66894]
Missing separate debuginfo for /lib64/libsndfile.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug
Missing separate debuginfo for /lib64/libgsm.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug
Missing separate debuginfo for /lib64/libFLAC.so.8
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug
Missing separate debuginfo for /lib64/libvorbisenc.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug
Missing separate debuginfo for /lib64/libvorbis.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug
Missing separate debuginfo for /lib64/libogg.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  warnings.warn(
[New Thread 0x2aab3309b700 (LWP 66896)]
2021-01-12 23:40:11,250 INFO [mmi_bigram_train.py:310] Loading L.fst
2021-01-12 23:40:11,533 INFO [mmi_bigram_train.py:328] About to get train cuts
2021-01-12 23:40:17,630 INFO [mmi_bigram_train.py:330] About to get dev cuts
2021-01-12 23:40:17,727 INFO [mmi_bigram_train.py:333] About to create train dataset
2021-01-12 23:40:18,201 INFO [mmi_bigram_train.py:337] About to create dev dataset
2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:341] About to create train dataloader
2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:343] About to create dev dataloader
[New Thread 0x2aab451f3700 (LWP 66931)]
2021-01-12 23:40:18,276 INFO [mmi_bigram_train.py:350] About to create model
[New Thread 0x2aab453f4700 (LWP 66933)]
[New Thread 0x2aab455f5700 (LWP 66934)]
================================================================================
Model parameters summary:
================================================================================
* P_scores:                                                                 7568
* tdnn.0.weight:                                                           60000
* tdnn.0.bias:                                                               500
* tdnn.3.weight:                                                          750000
* tdnn.3.bias:                                                               500
* tdnn.6.weight:                                                          750000
* tdnn.6.bias:                                                               500
* lstms.0.weight_ih_l0:                                                  1000000
* lstms.0.weight_hh_l0:                                                  1000000
* lstms.0.bias_ih_l0:                                                       2000
* lstms.0.bias_hh_l0:                                                       2000
* lstms.1.weight_ih_l0:                                                  1000000
* lstms.1.weight_hh_l0:                                                  1000000
* lstms.1.bias_ih_l0:                                                       2000
* lstms.1.bias_hh_l0:                                                       2000
* lstms.2.weight_ih_l0:                                                  1000000
* lstms.2.weight_hh_l0:                                                  1000000
* lstms.2.bias_ih_l0:                                                       2000
* lstms.2.bias_hh_l0:                                                       2000
* lstms.3.weight_ih_l0:                                                  1000000
* lstms.3.weight_hh_l0:                                                  1000000
* lstms.3.bias_ih_l0:                                                       2000
* lstms.3.bias_hh_l0:                                                       2000
* lstms.4.weight_ih_l0:                                                  1000000
* lstms.4.weight_hh_l0:                                                  1000000
* lstms.4.bias_ih_l0:                                                       2000
* lstms.4.bias_hh_l0:                                                       2000
* linear.weight:                                                           43500
* linear.bias:                                                                87
================================================================================
Total: 11632655
================================================================================
2021-01-12 23:40:21,868 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001
[Detaching after fork from child process 66939]
[Detaching after fork from child process 66940]
[Detaching after fork from child process 66941]
[Detaching after fork from child process 66942]
[New Thread 0x2aab45a08700 (LWP 66943)]
[New Thread 0x2aab45c09700 (LWP 66944)]
[New Thread 0x2aab45e0a700 (LWP 66945)]
[New Thread 0x2aab48200700 (LWP 66946)]
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (700 vs. 0)  Error: an illegal memory access was encountered. 

[ Stack-Trace: ]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1<int>::operator[](int) const+0x56c) [0x2aab2cf3ad80]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1<int>::Back() const+0x130) [0x2aab2cf385a0]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int)+0x27f) [0x2aab2d08e937]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int, k2::Array1<int>*, k2::Array1<int>*, int)+0x70) [0x2aab2d08f662]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged<k2::Arc>&, k2::Array1<int> const&)+0x38b) [0x2aab2cfc7398]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x91) [0x2aab2d03b65e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x93d) [0x2aaacd9aa98d]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted.
0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

(gdb) bt full 
#0  0x00002aaaaaf21387 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00002aaaaaf22a78 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00002aab2cf36630 in k2::internal::Logger::~Logger (this=0x7fffffffb340, __in_chrg=<optimized out>) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:149
        stack_trace = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
            _M_p = 0x5555c7e0dee8 "[ Stack-Trace: ]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/sit"...}}
#3  0x00002aab2cf3ad80 in k2::Array1<int>::operator[] (this=0x7fffffffb680, i=64) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:280
        ans = 21845
        ret = cudaErrorIllegalAddress
        __PRETTY_FUNCTION__ = "T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]"
        k2_nvtx_6 = {<No data fields>}
        data = 0x2aabaae45100
        type = k2::kCuda
#4  0x00002aab2cf385a0 in k2::Array1<int>::Back (this=0x7fffffffb680) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:289
        __PRETTY_FUNCTION__ = "T k2::Array1<T>::Back() const [with T = int]"
#5  0x00002aab2d08e937 in k2::RaggedShape2 (row_splits=0x7fffffffb680, row_ids=0x7fffffffb6a0, cached_tot_size=35078) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:112
        k2_nvtx_65 = {<No data fields>}
        __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int32_t)"
        ctx = {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
        axes = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
            _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
              _M_start = 0x5555c69c4e38, _M_finish = 0x7fffffffb498, _M_end_of_storage = 0xffffffffffffb460}}, <No data fields>}
#6  0x00002aab2d08f662 in k2::RaggedShape3 (row_splits1=0x7fffffffb680, row_ids1=0x7fffffffb6a0, cached_tot_size1=35078, row_splits2=0x7fffffffb6c0, row_ids2=0x7fffffffb6e0, 
    cached_tot_size2=101526) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:193
        k2_nvtx_68 = {<No data fields>}
        __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int32_t, k2::Array1<int>*, k2::Array1<int>*, int32_t)"
        shape1 = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
              _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
                _M_start = 0x5555c69bd278, _M_finish = 0x7fffffffb5b8, _M_end_of_storage = 0x2aab29689143
     <__gnu_cxx::__atomic_add_dispatch(_Atomic_word*, int)+46>}}, <No data fields>}}
        temp_array = {dim_ = -962881248, byte_offset_ = 140737488337984, 
          region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fffffffb5a0, _M_refcount = {_M_pi = 0x12cf6eaa2}}, <No data fields>}}
#7  0x00002aab2cfc7398 in k2::GetIncomingArcs (fsas=..., dest_states=...) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/fsa_utils.cu:837
        k2_nvtx_76 = {<No data fields>}
        __PRETTY_FUNCTION__ = "k2::Ragged<int> k2::GetIncomingArcs(k2::FsaVec&, const k2::Array1<int>&)"
        c = @0x5555c8017fa0: {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
        dest_states_tensor = {shape = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
                _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
                  _M_start = 0x5555c8014070, _M_finish = 0x5555c8014100, _M_end_of_storage = 0x5555c8014100}}, <No data fields>}}, values = {dim_ = 101526, byte_offset_ = 0, 
            region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c8056db0, _M_refcount = {_M_pi = 0x5555c8056da0}}, <No data fields>}}}
        num_fsas = 64
        num_states = 35078
        num_arcs = 101526
        incoming_arcs_order = {dim_ = 101526, byte_offset_ = 0, 
          region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c7fc3b10, _M_refcount = {_M_pi = 0x5555c7fc3b00}}, <No data fields>}}
        ans_row_ids2 = {dim_ = 101526, byte_offset_ = 0, 
danpovey commented 3 years ago

Do the same after doing export K2_SYNC_KERNELS=1 .. wanna see if the error was the first one.

On Tue, Jan 12, 2021 at 11:49 PM shanguanma notifications@github.com wrote:

[md510@node02 simple_v1]$ gdb --args python3 mmi_bigram_train.py (gdb) r Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug Missing separate debuginfo for /lib64/libpthread.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfo for /lib64/libc.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug Missing separate debuginfo for /lib64/libdl.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug Missing separate debuginfo for /lib64/libutil.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug Missing separate debuginfo for /lib64/librt.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug Missing separate debuginfo for /lib64/libm.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug [Detaching after fork from child process 66884] Missing separate debuginfo for /lib64/libcuda.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug [Detaching after fork from child process 66894] Missing separate debuginfo for /lib64/libsndfile.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug Missing separate debuginfo for /lib64/libgsm.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug Missing separate debuginfo for /lib64/libFLAC.so.8 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug Missing separate debuginfo for /lib64/libvorbisenc.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug Missing separate debuginfo for /lib64/libvorbis.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug Missing separate debuginfo for /lib64/libogg.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. warnings.warn( [New Thread 0x2aab3309b700 (LWP 66896)] 2021-01-12 23:40:11,250 INFO [mmi_bigram_train.py:310] Loading L.fst 2021-01-12 23:40:11,533 INFO [mmi_bigram_train.py:328] About to get train cuts 2021-01-12 23:40:17,630 INFO [mmi_bigram_train.py:330] About to get dev cuts 2021-01-12 23:40:17,727 INFO [mmi_bigram_train.py:333] About to create train dataset 2021-01-12 23:40:18,201 INFO [mmi_bigram_train.py:337] About to create dev dataset 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:341] About to create train dataloader 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:343] About to create dev dataloader [New Thread 0x2aab451f3700 (LWP 66931)] 2021-01-12 23:40:18,276 INFO [mmi_bigram_train.py:350] About to create model [New Thread 0x2aab453f4700 (LWP 66933)] [New Thread 0x2aab455f5700 (LWP 66934)]

Model parameters summary:

  • P_scores: 7568
  • tdnn.0.weight: 60000
  • tdnn.0.bias: 500
  • tdnn.3.weight: 750000
  • tdnn.3.bias: 500
  • tdnn.6.weight: 750000
  • tdnn.6.bias: 500
  • lstms.0.weight_ih_l0: 1000000
  • lstms.0.weight_hh_l0: 1000000
  • lstms.0.bias_ih_l0: 2000
  • lstms.0.bias_hh_l0: 2000
  • lstms.1.weight_ih_l0: 1000000
  • lstms.1.weight_hh_l0: 1000000
  • lstms.1.bias_ih_l0: 2000
  • lstms.1.bias_hh_l0: 2000
  • lstms.2.weight_ih_l0: 1000000
  • lstms.2.weight_hh_l0: 1000000
  • lstms.2.bias_ih_l0: 2000
  • lstms.2.bias_hh_l0: 2000
  • lstms.3.weight_ih_l0: 1000000
  • lstms.3.weight_hh_l0: 1000000
  • lstms.3.bias_ih_l0: 2000
  • lstms.3.bias_hh_l0: 2000
  • lstms.4.weight_ih_l0: 1000000
  • lstms.4.weight_hh_l0: 1000000
  • lstms.4.bias_ih_l0: 2000
  • lstms.4.bias_hh_l0: 2000
  • linear.weight: 43500
  • linear.bias: 87

    Total: 11632655

    2021-01-12 23:40:21,868 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001 [Detaching after fork from child process 66939] [Detaching after fork from child process 66940] [Detaching after fork from child process 66941] [Detaching after fork from child process 66942] [New Thread 0x2aab45a08700 (LWP 66943)] [New Thread 0x2aab45c09700 (LWP 66944)] [New Thread 0x2aab45e0a700 (LWP 66945)] [New Thread 0x2aab48200700 (LWP 66946)] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1::operator const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.

[ Stack-Trace: ] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::operator const+0x56c) [0x2aab2cf3ad80] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::Back() const+0x130) [0x2aab2cf385a0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1, k2::Array1, int)+0x27f) [0x2aab2d08e937] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1, k2::Array1, int, k2::Array1, k2::Array1, int)+0x70) [0x2aab2d08f662] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged&, k2::Array1 const&)+0x38b) [0x2aab2cfc7398] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged&, k2::DenseFsaVec&, float, k2::Ragged, k2::Array1, k2::Array1)+0x91) [0x2aab2d03b65e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object*)+0x93d) [0x2aaacd9aa98d] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted. 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

(gdb) bt full

0 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

No symbol table info available.

1 0x00002aaaaaf22a78 in abort () from /lib64/libc.so.6

No symbol table info available.

2 0x00002aab2cf36630 in k2::internal::Logger::~Logger (this=0x7fffffffb340, __in_chrg=) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:149

    stack_trace = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x5555c7e0dee8 "[ Stack-Trace: ]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/sit"...}}

3 0x00002aab2cf3ad80 in k2::Array1::operator[] (this=0x7fffffffb680, i=64) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:280

    ans = 21845
    ret = cudaErrorIllegalAddress
    __PRETTY_FUNCTION__ = "T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]"
    k2_nvtx_6 = {<No data fields>}
    data = 0x2aabaae45100
    type = k2::kCuda

4 0x00002aab2cf385a0 in k2::Array1::Back (this=0x7fffffffb680) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:289

    __PRETTY_FUNCTION__ = "T k2::Array1<T>::Back() const [with T = int]"

5 0x00002aab2d08e937 in k2::RaggedShape2 (row_splits=0x7fffffffb680, row_ids=0x7fffffffb6a0, cached_tot_size=35078) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:112

    k2_nvtx_65 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    ctx = {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    axes = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
        _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
          _M_start = 0x5555c69c4e38, _M_finish = 0x7fffffffb498, _M_end_of_storage = 0xffffffffffffb460}}, <No data fields>}

6 0x00002aab2d08f662 in k2::RaggedShape3 (row_splits1=0x7fffffffb680, row_ids1=0x7fffffffb6a0, cached_tot_size1=35078, row_splits2=0x7fffffffb6c0, row_ids2=0x7fffffffb6e0,

cached_tot_size2=101526) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:193
    k2_nvtx_68 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int32_t, k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    shape1 = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
          _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
            _M_start = 0x5555c69bd278, _M_finish = 0x7fffffffb5b8, _M_end_of_storage = 0x2aab29689143
 <__gnu_cxx::__atomic_add_dispatch(_Atomic_word*, int)+46>}}, <No data fields>}}
    temp_array = {dim_ = -962881248, byte_offset_ = 140737488337984,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fffffffb5a0, _M_refcount = {_M_pi = 0x12cf6eaa2}}, <No data fields>}}

7 0x00002aab2cfc7398 in k2::GetIncomingArcs (fsas=..., dest_states=...) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/fsa_utils.cu:837

    k2_nvtx_76 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::Ragged<int> k2::GetIncomingArcs(k2::FsaVec&, const k2::Array1<int>&)"
    c = @0x5555c8017fa0: {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    dest_states_tensor = {shape = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
            _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
              _M_start = 0x5555c8014070, _M_finish = 0x5555c8014100, _M_end_of_storage = 0x5555c8014100}}, <No data fields>}}, values = {dim_ = 101526, byte_offset_ = 0,
        region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c8056db0, _M_refcount = {_M_pi = 0x5555c8056da0}}, <No data fields>}}}
    num_fsas = 64
    num_states = 35078
    num_arcs = 101526
    incoming_arcs_order = {dim_ = 101526, byte_offset_ = 0,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c7fc3b10, _M_refcount = {_M_pi = 0x5555c7fc3b00}}, <No data fields>}}
    ans_row_ids2 = {dim_ = 101526, byte_offset_ = 0,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758748586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7CLNFHITUTT3VAZHLSZRVQLANCNFSM4VUIPSPQ .

danpovey commented 3 years ago

.. it could be a bug in GetTransposeReordering() which is called by GetIncomingArcs(). If anyone has time to suggest what debug code to add, to verify the output of that, it might be good. getting late for me.

On Wed, Jan 13, 2021 at 12:12 AM Daniel Povey dpovey@gmail.com wrote:

Do the same after doing export K2_SYNC_KERNELS=1 .. wanna see if the error was the first one.

On Tue, Jan 12, 2021 at 11:49 PM shanguanma notifications@github.com wrote:

[md510@node02 simple_v1]$ gdb --args python3 mmi_bigram_train.py (gdb) r Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug Missing separate debuginfo for /lib64/libpthread.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfo for /lib64/libc.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug Missing separate debuginfo for /lib64/libdl.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug Missing separate debuginfo for /lib64/libutil.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug Missing separate debuginfo for /lib64/librt.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug Missing separate debuginfo for /lib64/libm.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug [Detaching after fork from child process 66884] Missing separate debuginfo for /lib64/libcuda.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug [Detaching after fork from child process 66894] Missing separate debuginfo for /lib64/libsndfile.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug Missing separate debuginfo for /lib64/libgsm.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug Missing separate debuginfo for /lib64/libFLAC.so.8 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug Missing separate debuginfo for /lib64/libvorbisenc.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug Missing separate debuginfo for /lib64/libvorbis.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug Missing separate debuginfo for /lib64/libogg.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. warnings.warn( [New Thread 0x2aab3309b700 (LWP 66896)] 2021-01-12 23:40:11,250 INFO [mmi_bigram_train.py:310] Loading L.fst 2021-01-12 23:40:11,533 INFO [mmi_bigram_train.py:328] About to get train cuts 2021-01-12 23:40:17,630 INFO [mmi_bigram_train.py:330] About to get dev cuts 2021-01-12 23:40:17,727 INFO [mmi_bigram_train.py:333] About to create train dataset 2021-01-12 23:40:18,201 INFO [mmi_bigram_train.py:337] About to create dev dataset 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:341] About to create train dataloader 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:343] About to create dev dataloader [New Thread 0x2aab451f3700 (LWP 66931)] 2021-01-12 23:40:18,276 INFO [mmi_bigram_train.py:350] About to create model [New Thread 0x2aab453f4700 (LWP 66933)] [New Thread 0x2aab455f5700 (LWP 66934)]

Model parameters summary:

  • P_scores: 7568
  • tdnn.0.weight: 60000
  • tdnn.0.bias: 500
  • tdnn.3.weight: 750000
  • tdnn.3.bias: 500
  • tdnn.6.weight: 750000
  • tdnn.6.bias: 500
  • lstms.0.weight_ih_l0: 1000000
  • lstms.0.weight_hh_l0: 1000000
  • lstms.0.bias_ih_l0: 2000
  • lstms.0.bias_hh_l0: 2000
  • lstms.1.weight_ih_l0: 1000000
  • lstms.1.weight_hh_l0: 1000000
  • lstms.1.bias_ih_l0: 2000
  • lstms.1.bias_hh_l0: 2000
  • lstms.2.weight_ih_l0: 1000000
  • lstms.2.weight_hh_l0: 1000000
  • lstms.2.bias_ih_l0: 2000
  • lstms.2.bias_hh_l0: 2000
  • lstms.3.weight_ih_l0: 1000000
  • lstms.3.weight_hh_l0: 1000000
  • lstms.3.bias_ih_l0: 2000
  • lstms.3.bias_hh_l0: 2000
  • lstms.4.weight_ih_l0: 1000000
  • lstms.4.weight_hh_l0: 1000000
  • lstms.4.bias_ih_l0: 2000
  • lstms.4.bias_hh_l0: 2000
  • linear.weight: 43500
  • linear.bias: 87

    Total: 11632655

    2021-01-12 23:40:21,868 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001 [Detaching after fork from child process 66939] [Detaching after fork from child process 66940] [Detaching after fork from child process 66941] [Detaching after fork from child process 66942] [New Thread 0x2aab45a08700 (LWP 66943)] [New Thread 0x2aab45c09700 (LWP 66944)] [New Thread 0x2aab45e0a700 (LWP 66945)] [New Thread 0x2aab48200700 (LWP 66946)] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1::operator const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.

[ Stack-Trace: ] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::operator const+0x56c) [0x2aab2cf3ad80] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::Back() const+0x130) [0x2aab2cf385a0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1, k2::Array1, int)+0x27f) [0x2aab2d08e937] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1, k2::Array1, int, k2::Array1, k2::Array1, int)+0x70) [0x2aab2d08f662] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged&, k2::Array1 const&)+0x38b) [0x2aab2cfc7398] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged&, k2::DenseFsaVec&, float, k2::Ragged, k2::Array1, k2::Array1)+0x91) [0x2aab2d03b65e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object*)+0x93d) [0x2aaacd9aa98d] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted. 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

(gdb) bt full

0 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

No symbol table info available.

1 0x00002aaaaaf22a78 in abort () from /lib64/libc.so.6

No symbol table info available.

2 0x00002aab2cf36630 in k2::internal::Logger::~Logger (this=0x7fffffffb340, __in_chrg=) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:149

    stack_trace = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x5555c7e0dee8 "[ Stack-Trace: ]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/sit"...}}

3 0x00002aab2cf3ad80 in k2::Array1::operator[] (this=0x7fffffffb680, i=64) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:280

    ans = 21845
    ret = cudaErrorIllegalAddress
    __PRETTY_FUNCTION__ = "T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]"
    k2_nvtx_6 = {<No data fields>}
    data = 0x2aabaae45100
    type = k2::kCuda

4 0x00002aab2cf385a0 in k2::Array1::Back (this=0x7fffffffb680) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:289

    __PRETTY_FUNCTION__ = "T k2::Array1<T>::Back() const [with T = int]"

5 0x00002aab2d08e937 in k2::RaggedShape2 (row_splits=0x7fffffffb680, row_ids=0x7fffffffb6a0, cached_tot_size=35078) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:112

    k2_nvtx_65 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    ctx = {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    axes = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
        _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
          _M_start = 0x5555c69c4e38, _M_finish = 0x7fffffffb498, _M_end_of_storage = 0xffffffffffffb460}}, <No data fields>}

6 0x00002aab2d08f662 in k2::RaggedShape3 (row_splits1=0x7fffffffb680, row_ids1=0x7fffffffb6a0, cached_tot_size1=35078, row_splits2=0x7fffffffb6c0, row_ids2=0x7fffffffb6e0,

cached_tot_size2=101526) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:193
    k2_nvtx_68 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int32_t, k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    shape1 = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
          _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
            _M_start = 0x5555c69bd278, _M_finish = 0x7fffffffb5b8, _M_end_of_storage = 0x2aab29689143
 <__gnu_cxx::__atomic_add_dispatch(_Atomic_word*, int)+46>}}, <No data fields>}}
    temp_array = {dim_ = -962881248, byte_offset_ = 140737488337984,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fffffffb5a0, _M_refcount = {_M_pi = 0x12cf6eaa2}}, <No data fields>}}

7 0x00002aab2cfc7398 in k2::GetIncomingArcs (fsas=..., dest_states=...) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/fsa_utils.cu:837

    k2_nvtx_76 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::Ragged<int> k2::GetIncomingArcs(k2::FsaVec&, const k2::Array1<int>&)"
    c = @0x5555c8017fa0: {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    dest_states_tensor = {shape = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
            _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
              _M_start = 0x5555c8014070, _M_finish = 0x5555c8014100, _M_end_of_storage = 0x5555c8014100}}, <No data fields>}}, values = {dim_ = 101526, byte_offset_ = 0,
        region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c8056db0, _M_refcount = {_M_pi = 0x5555c8056da0}}, <No data fields>}}}
    num_fsas = 64
    num_states = 35078
    num_arcs = 101526
    incoming_arcs_order = {dim_ = 101526, byte_offset_ = 0,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c7fc3b10, _M_refcount = {_M_pi = 0x5555c7fc3b00}}, <No data fields>}}
    ans_row_ids2 = {dim_ = 101526, byte_offset_ = 0,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758748586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7CLNFHITUTT3VAZHLSZRVQLANCNFSM4VUIPSPQ .

danpovey commented 3 years ago

... GetTransposeReordering() should return a permutation of the numbrers (0,1,2,3,4...). That should be reasonably easy to test, e.g. by summing it and comparing with the formula. We should first add a Sum() function for arrays, e.g. model it on the Max() function declared in array_ops.h.

On Wed, Jan 13, 2021 at 12:13 AM Daniel Povey dpovey@gmail.com wrote:

.. it could be a bug in GetTransposeReordering() which is called by GetIncomingArcs(). If anyone has time to suggest what debug code to add, to verify the output of that, it might be good. getting late for me.

On Wed, Jan 13, 2021 at 12:12 AM Daniel Povey dpovey@gmail.com wrote:

Do the same after doing export K2_SYNC_KERNELS=1 .. wanna see if the error was the first one.

On Tue, Jan 12, 2021 at 11:49 PM shanguanma notifications@github.com wrote:

[md510@node02 simple_v1]$ gdb --args python3 mmi_bigram_train.py (gdb) r Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug Missing separate debuginfo for /lib64/libpthread.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfo for /lib64/libc.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug Missing separate debuginfo for /lib64/libdl.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug Missing separate debuginfo for /lib64/libutil.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug Missing separate debuginfo for /lib64/librt.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug Missing separate debuginfo for /lib64/libm.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug [Detaching after fork from child process 66884] Missing separate debuginfo for /lib64/libcuda.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug [Detaching after fork from child process 66894] Missing separate debuginfo for /lib64/libsndfile.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug Missing separate debuginfo for /lib64/libgsm.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug Missing separate debuginfo for /lib64/libFLAC.so.8 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug Missing separate debuginfo for /lib64/libvorbisenc.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug Missing separate debuginfo for /lib64/libvorbis.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug Missing separate debuginfo for /lib64/libogg.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. warnings.warn( [New Thread 0x2aab3309b700 (LWP 66896)] 2021-01-12 23:40:11,250 INFO [mmi_bigram_train.py:310] Loading L.fst 2021-01-12 23:40:11,533 INFO [mmi_bigram_train.py:328] About to get train cuts 2021-01-12 23:40:17,630 INFO [mmi_bigram_train.py:330] About to get dev cuts 2021-01-12 23:40:17,727 INFO [mmi_bigram_train.py:333] About to create train dataset 2021-01-12 23:40:18,201 INFO [mmi_bigram_train.py:337] About to create dev dataset 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:341] About to create train dataloader 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:343] About to create dev dataloader [New Thread 0x2aab451f3700 (LWP 66931)] 2021-01-12 23:40:18,276 INFO [mmi_bigram_train.py:350] About to create model [New Thread 0x2aab453f4700 (LWP 66933)] [New Thread 0x2aab455f5700 (LWP 66934)]

Model parameters summary:

  • P_scores: 7568
  • tdnn.0.weight: 60000
  • tdnn.0.bias: 500
  • tdnn.3.weight: 750000
  • tdnn.3.bias: 500
  • tdnn.6.weight: 750000
  • tdnn.6.bias: 500
  • lstms.0.weight_ih_l0: 1000000
  • lstms.0.weight_hh_l0: 1000000
  • lstms.0.bias_ih_l0: 2000
  • lstms.0.bias_hh_l0: 2000
  • lstms.1.weight_ih_l0: 1000000
  • lstms.1.weight_hh_l0: 1000000
  • lstms.1.bias_ih_l0: 2000
  • lstms.1.bias_hh_l0: 2000
  • lstms.2.weight_ih_l0: 1000000
  • lstms.2.weight_hh_l0: 1000000
  • lstms.2.bias_ih_l0: 2000
  • lstms.2.bias_hh_l0: 2000
  • lstms.3.weight_ih_l0: 1000000
  • lstms.3.weight_hh_l0: 1000000
  • lstms.3.bias_ih_l0: 2000
  • lstms.3.bias_hh_l0: 2000
  • lstms.4.weight_ih_l0: 1000000
  • lstms.4.weight_hh_l0: 1000000
  • lstms.4.bias_ih_l0: 2000
  • lstms.4.bias_hh_l0: 2000
  • linear.weight: 43500
  • linear.bias: 87

    Total: 11632655

    2021-01-12 23:40:21,868 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001 [Detaching after fork from child process 66939] [Detaching after fork from child process 66940] [Detaching after fork from child process 66941] [Detaching after fork from child process 66942] [New Thread 0x2aab45a08700 (LWP 66943)] [New Thread 0x2aab45c09700 (LWP 66944)] [New Thread 0x2aab45e0a700 (LWP 66945)] [New Thread 0x2aab48200700 (LWP 66946)] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1::operator const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.

[ Stack-Trace: ] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::operator const+0x56c) [0x2aab2cf3ad80] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::Back() const+0x130) [0x2aab2cf385a0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1, k2::Array1, int)+0x27f) [0x2aab2d08e937] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1, k2::Array1, int, k2::Array1, k2::Array1, int)+0x70) [0x2aab2d08f662] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged&, k2::Array1 const&)+0x38b) [0x2aab2cfc7398] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged&, k2::DenseFsaVec&, float, k2::Ragged, k2::Array1, k2::Array1)+0x91) [0x2aab2d03b65e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object*)+0x93d) [0x2aaacd9aa98d] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted. 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

(gdb) bt full

0 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

No symbol table info available.

1 0x00002aaaaaf22a78 in abort () from /lib64/libc.so.6

No symbol table info available.

2 0x00002aab2cf36630 in k2::internal::Logger::~Logger (this=0x7fffffffb340, __in_chrg=) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:149

    stack_trace = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x5555c7e0dee8 "[ Stack-Trace: ]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/sit"...}}

3 0x00002aab2cf3ad80 in k2::Array1::operator[] (this=0x7fffffffb680, i=64) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:280

    ans = 21845
    ret = cudaErrorIllegalAddress
    __PRETTY_FUNCTION__ = "T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]"
    k2_nvtx_6 = {<No data fields>}
    data = 0x2aabaae45100
    type = k2::kCuda

4 0x00002aab2cf385a0 in k2::Array1::Back (this=0x7fffffffb680) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:289

    __PRETTY_FUNCTION__ = "T k2::Array1<T>::Back() const [with T = int]"

5 0x00002aab2d08e937 in k2::RaggedShape2 (row_splits=0x7fffffffb680, row_ids=0x7fffffffb6a0, cached_tot_size=35078) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:112

    k2_nvtx_65 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    ctx = {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    axes = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
        _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
          _M_start = 0x5555c69c4e38, _M_finish = 0x7fffffffb498, _M_end_of_storage = 0xffffffffffffb460}}, <No data fields>}

6 0x00002aab2d08f662 in k2::RaggedShape3 (row_splits1=0x7fffffffb680, row_ids1=0x7fffffffb6a0, cached_tot_size1=35078, row_splits2=0x7fffffffb6c0, row_ids2=0x7fffffffb6e0,

cached_tot_size2=101526) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:193
    k2_nvtx_68 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int32_t, k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    shape1 = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
          _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
            _M_start = 0x5555c69bd278, _M_finish = 0x7fffffffb5b8, _M_end_of_storage = 0x2aab29689143
 <__gnu_cxx::__atomic_add_dispatch(_Atomic_word*, int)+46>}}, <No data fields>}}
    temp_array = {dim_ = -962881248, byte_offset_ = 140737488337984,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fffffffb5a0, _M_refcount = {_M_pi = 0x12cf6eaa2}}, <No data fields>}}

7 0x00002aab2cfc7398 in k2::GetIncomingArcs (fsas=..., dest_states=...) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/fsa_utils.cu:837

    k2_nvtx_76 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::Ragged<int> k2::GetIncomingArcs(k2::FsaVec&, const k2::Array1<int>&)"
    c = @0x5555c8017fa0: {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    dest_states_tensor = {shape = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
            _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
              _M_start = 0x5555c8014070, _M_finish = 0x5555c8014100, _M_end_of_storage = 0x5555c8014100}}, <No data fields>}}, values = {dim_ = 101526, byte_offset_ = 0,
        region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c8056db0, _M_refcount = {_M_pi = 0x5555c8056da0}}, <No data fields>}}}
    num_fsas = 64
    num_states = 35078
    num_arcs = 101526
    incoming_arcs_order = {dim_ = 101526, byte_offset_ = 0,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c7fc3b10, _M_refcount = {_M_pi = 0x5555c7fc3b00}}, <No data fields>}}
    ans_row_ids2 = {dim_ = 101526, byte_offset_ = 0,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-758748586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7CLNFHITUTT3VAZHLSZRVQLANCNFSM4VUIPSPQ .

csukuangfj commented 3 years ago

Will training with CPU give the same error?

Tuesday, 12 January 2021, 23:49 +0800 from notifications@github.com notifications@github.com:

[md510@node02 simple_v1]$ gdb --args python3 mmi_bigram_train.py (gdb) r Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug Missing separate debuginfo for /lib64/libpthread.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfo for /lib64/libc.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug Missing separate debuginfo for /lib64/libdl.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug Missing separate debuginfo for /lib64/libutil.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug Missing separate debuginfo for /lib64/librt.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug Missing separate debuginfo for /lib64/libm.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug [Detaching after fork from child process 66884] Missing separate debuginfo for /lib64/libcuda.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug [Detaching after fork from child process 66894] Missing separate debuginfo for /lib64/libsndfile.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug Missing separate debuginfo for /lib64/libgsm.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug Missing separate debuginfo for /lib64/libFLAC.so.8 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug Missing separate debuginfo for /lib64/libvorbisenc.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug Missing separate debuginfo for /lib64/libvorbis.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug Missing separate debuginfo for /lib64/libogg.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. warnings.warn( [New Thread 0x2aab3309b700 (LWP 66896)] 2021-01-12 23:40:11,250 INFO [mmi_bigram_train.py:310] Loading L.fst 2021-01-12 23:40:11,533 INFO [mmi_bigram_train.py:328] About to get train cuts 2021-01-12 23:40:17,630 INFO [mmi_bigram_train.py:330] About to get dev cuts 2021-01-12 23:40:17,727 INFO [mmi_bigram_train.py:333] About to create train dataset 2021-01-12 23:40:18,201 INFO [mmi_bigram_train.py:337] About to create dev dataset 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:341] About to create train dataloader 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:343] About to create dev dataloader [New Thread 0x2aab451f3700 (LWP 66931)] 2021-01-12 23:40:18,276 INFO [mmi_bigram_train.py:350] About to create model [New Thread 0x2aab453f4700 (LWP 66933)] [New Thread 0x2aab455f5700 (LWP 66934)]

Model parameters summary:

  • P_scores: 7568
  • tdnn.0.weight: 60000
  • tdnn.0.bias: 500
  • tdnn.3.weight: 750000
  • tdnn.3.bias: 500
  • tdnn.6.weight: 750000
  • tdnn.6.bias: 500
  • lstms.0.weight_ih_l0: 1000000
  • lstms.0.weight_hh_l0: 1000000
  • lstms.0.bias_ih_l0: 2000
  • lstms.0.bias_hh_l0: 2000
  • lstms.1.weight_ih_l0: 1000000
  • lstms.1.weight_hh_l0: 1000000
  • lstms.1.bias_ih_l0: 2000
  • lstms.1.bias_hh_l0: 2000
  • lstms.2.weight_ih_l0: 1000000
  • lstms.2.weight_hh_l0: 1000000
  • lstms.2.bias_ih_l0: 2000
  • lstms.2.bias_hh_l0: 2000
  • lstms.3.weight_ih_l0: 1000000
  • lstms.3.weight_hh_l0: 1000000
  • lstms.3.bias_ih_l0: 2000
  • lstms.3.bias_hh_l0: 2000
  • lstms.4.weight_ih_l0: 1000000
  • lstms.4.weight_hh_l0: 1000000
  • lstms.4.bias_ih_l0: 2000
  • lstms.4.bias_hh_l0: 2000
  • linear.weight: 43500
  • linear.bias: 87

    Total: 11632655

    2021-01-12 23:40:21,868 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001 [Detaching after fork from child process 66939] [Detaching after fork from child process 66940] [Detaching after fork from child process 66941] [Detaching after fork from child process 66942] [New Thread 0x2aab45a08700 (LWP 66943)] [New Thread 0x2aab45c09700 (LWP 66944)] [New Thread 0x2aab45e0a700 (LWP 66945)] [New Thread 0x2aab48200700 (LWP 66946)] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1::operator const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.

[ Stack-Trace: ] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::operator const+0x56c) [0x2aab2cf3ad80] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::Back() const+0x130) [0x2aab2cf385a0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1, k2::Array1, int)+0x27f) [0x2aab2d08e937] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1, k2::Array1, int, k2::Array1, k2::Array1, int)+0x70) [0x2aab2d08f662] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged&, k2::Array1 const&)+0x38b) [0x2aab2cfc7398] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged&, k2::DenseFsaVec&, float, k2::Ragged, k2::Array1, k2::Array1)+0x91) [0x2aab2d03b65e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object*)+0x93d) [0x2aaacd9aa98d] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted. 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

(gdb) bt full

0 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

No symbol table info available.

1 0x00002aaaaaf22a78 in abort () from /lib64/libc.so.6

No symbol table info available.

2 0x00002aab2cf36630 in k2::internal::Logger::~Logger (this=0x7fffffffb340, __in_chrg=) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:149

   stack_trace = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
       _M_p = 0x5555c7e0dee8 "[ Stack-Trace: ]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/sit"...}}

3 0x00002aab2cf3ad80 in k2::Array1::operator[] (this=0x7fffffffb680, i=64) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:280

   ans = 21845
   ret = cudaErrorIllegalAddress
   __PRETTY_FUNCTION__ = "T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]"
   k2_nvtx_6 = {<No data fields>}
   data = 0x2aabaae45100
   type = k2::kCuda

4 0x00002aab2cf385a0 in k2::Array1::Back (this=0x7fffffffb680) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:289

   __PRETTY_FUNCTION__ = "T k2::Array1<T>::Back() const [with T = int]"

5 0x00002aab2d08e937 in k2::RaggedShape2 (row_splits=0x7fffffffb680, row_ids=0x7fffffffb6a0, cached_tot_size=35078) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:112

   k2_nvtx_65 = {<No data fields>}
   __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int32_t)"
   ctx = {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
   axes = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
       _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
         _M_start = 0x5555c69c4e38, _M_finish = 0x7fffffffb498, _M_end_of_storage = 0xffffffffffffb460}}, <No data fields>}

6 0x00002aab2d08f662 in k2::RaggedShape3 (row_splits1=0x7fffffffb680, row_ids1=0x7fffffffb6a0, cached_tot_size1=35078, row_splits2=0x7fffffffb6c0, row_ids2=0x7fffffffb6e0,

cached_tot_size2=101526) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:193 k2_nvtx_68 = {} PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape3(k2::Array1, k2::Array1, int32_t, k2::Array1, k2::Array1, int32t)" shape1 = {layers = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator >> = { _M_impl = {<std::allocator> = {<gnu_cxx::new_allocator> = {}, }, _M_start = 0x5555c69bd278, _M_finish = 0x7fffffffb5b8, _M_end_of_storage = 0x2aab29689143 <gnu_cxx::__atomic_add_dispatch(_Atomic_word*, int)+46>}}, }} temparray = {dim = -962881248, byteoffset = 140737488337984, region_ = {<std::shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::shared_ptr_access<k2::Region, (gnu_cxx::_Lock_policy)2, false, false>> = {}, _M_ptr = 0x7fffffffb5a0, _M_refcount = {_M_pi = 0x12cf6eaa2}}, }}

7 0x00002aab2cfc7398 in k2::GetIncomingArcs (fsas=..., dest_states=...) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/fsa_utils.cu:837

   k2_nvtx_76 = {<No data fields>}
   __PRETTY_FUNCTION__ = "k2::Ragged<int> k2::GetIncomingArcs(k2::FsaVec&, const k2::Array1<int>&)"
   c = @0x5555c8017fa0: {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
   dest_states_tensor = {shape = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
           _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
             _M_start = 0x5555c8014070, _M_finish = 0x5555c8014100, _M_end_of_storage = 0x5555c8014100}}, <No data fields>}}, values = {dim_ = 101526, byte_offset_ = 0, 
       region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c8056db0, _M_refcount = {_M_pi = 0x5555c8056da0}}, <No data fields>}}}
   num_fsas = 64
   num_states = 35078
   num_arcs = 101526
   incoming_arcs_order = {dim_ = 101526, byte_offset_ = 0, 
     region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c7fc3b10, _M_refcount = {_M_pi = 0x5555c7fc3b00}}, <No data fields>}}
   ans_row_ids2 = {dim_ = 101526, byte_offset_ = 0, 

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .

shanguanma commented 3 years ago

Will training with CPU give the same error? Tuesday, 12 January 2021, 23:49 +0800 from notifications@github.com notifications@github.com:

I run python3 mmi_bigram_train.py with cpu. it should be no error. the logger is as follows:

[md510@node02 simple_v1]$ python3 mmi_bigram_train.py 
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  warnings.warn(
2021-01-13 09:46:02,825 INFO [mmi_bigram_train.py:310] Loading L.fst
2021-01-13 09:46:03,058 INFO [mmi_bigram_train.py:328] About to get train cuts
2021-01-13 09:46:07,104 INFO [mmi_bigram_train.py:330] About to get dev cuts
2021-01-13 09:46:07,176 INFO [mmi_bigram_train.py:333] About to create train dataset
2021-01-13 09:46:07,599 INFO [mmi_bigram_train.py:337] About to create dev dataset
2021-01-13 09:46:07,613 INFO [mmi_bigram_train.py:341] About to create train dataloader
2021-01-13 09:46:07,613 INFO [mmi_bigram_train.py:343] About to create dev dataloader
2021-01-13 09:46:07,697 INFO [mmi_bigram_train.py:350] About to create model
================================================================================
Model parameters summary:
================================================================================
* P_scores:                                                                 7568
* tdnn.0.weight:                                                           60000
* tdnn.0.bias:                                                               500
* tdnn.3.weight:                                                          750000
* tdnn.3.bias:                                                               500
* tdnn.6.weight:                                                          750000
* tdnn.6.bias:                                                               500
* lstms.0.weight_ih_l0:                                                  1000000
* lstms.0.weight_hh_l0:                                                  1000000
* lstms.0.bias_ih_l0:                                                       2000
* lstms.0.bias_hh_l0:                                                       2000
* lstms.1.weight_ih_l0:                                                  1000000
* lstms.1.weight_hh_l0:                                                  1000000
* lstms.1.bias_ih_l0:                                                       2000
* lstms.1.bias_hh_l0:                                                       2000
* lstms.2.weight_ih_l0:                                                  1000000
* lstms.2.weight_hh_l0:                                                  1000000
* lstms.2.bias_ih_l0:                                                       2000
* lstms.2.bias_hh_l0:                                                       2000
* lstms.3.weight_ih_l0:                                                  1000000
* lstms.3.weight_hh_l0:                                                  1000000
* lstms.3.bias_ih_l0:                                                       2000
* lstms.3.bias_hh_l0:                                                       2000
* lstms.4.weight_ih_l0:                                                  1000000
* lstms.4.weight_hh_l0:                                                  1000000
* lstms.4.bias_ih_l0:                                                       2000
* lstms.4.bias_hh_l0:                                                       2000
* linear.weight:                                                           43500
* linear.bias:                                                                87
================================================================================
Total: 11632655
================================================================================
2021-01-13 09:46:07,771 INFO [mmi_bigram_train.py:401] epoch 0, learning rate 0.001
2021-01-13 09:47:32,896 INFO [mmi_bigram_train.py:220] batch 0, epoch 0/10 global average objf: 1.989916 over 29599.0 frames (100.0% kept), current batch average objf: 1.989915 over 29599 frames (100.0% kept) avg time waiting for batch 3.367s
2021-01-13 09:58:43,705 INFO [mmi_bigram_train.py:220] batch 10, epoch 0/10 global average objf: 1.760037 over 327009.0 frames (100.0% kept), current batch average objf: 1.610216 over 29735 frames (100.0% kept) avg time waiting for batch 0.343s
shanguanma commented 3 years ago

Do the same after doing export K2_SYNC_KERNELS=1 .. wanna see if the error was the first one.

Yes, I try it again, it occurs same error, note: I use single GPU on mmi_bigram_train.py The error is same as https://github.com/k2-fsa/k2/issues/569#issuecomment-758748586

danpovey commented 3 years ago

I am creating some extra checking code that may discover the source of the problem.

On Wed, Jan 13, 2021 at 10:08 AM shanguanma notifications@github.com wrote:

Do the same after doing export K2_SYNC_KERNELS=1 .. wanna see if the error was the first one. … <#m4160541195865294463>

Yes, I try it again, it occurs same error, note: I use single GPU on mmi_bigram_train.py The error is same as #569 (comment) https://github.com/k2-fsa/k2/issues/569#issuecomment-758748586

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-759155548, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZGRF33LSPAMXQSAOTSZT6ATANCNFSM4VUIPSPQ .

danpovey commented 3 years ago

Please try running with this code: https://github.com/k2-fsa/k2/pull/585 which may make the error show up earlier. Note: haven't finished running tests yet.

shanguanma commented 3 years ago

OK, I try to run it.

csukuangfj commented 3 years ago

Will fix it after lunch.

Wednesday, 13 January 2021, 00:14 +0800 from notifications@github.com notifications@github.com:

.. it could be a bug in GetTransposeReordering() which is called by GetIncomingArcs(). If anyone has time to suggest what debug code to add, to verify the output of that, it might be good. getting late for me.

On Wed, Jan 13, 2021 at 12:12 AM Daniel Povey < dpovey@gmail.com > wrote:

Do the same after doing export K2_SYNC_KERNELS=1 .. wanna see if the error was the first one.

On Tue, Jan 12, 2021 at 11:49 PM shanguanma < notifications@github.com > wrote:

[md510@node02 simple_v1]$ gdb --args python3 mmi_bigram_train.py (gdb) r Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug Missing separate debuginfo for /lib64/libpthread.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfo for /lib64/libc.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug Missing separate debuginfo for /lib64/libdl.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug Missing separate debuginfo for /lib64/libutil.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug Missing separate debuginfo for /lib64/librt.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug Missing separate debuginfo for /lib64/libm.so.6 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug [Detaching after fork from child process 66884] Missing separate debuginfo for /lib64/libcuda.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug [Detaching after fork from child process 66894] Missing separate debuginfo for /lib64/libsndfile.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug Missing separate debuginfo for /lib64/libgsm.so.1 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug Missing separate debuginfo for /lib64/libFLAC.so.8 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug Missing separate debuginfo for /lib64/libvorbisenc.so.2 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug Missing separate debuginfo for /lib64/libvorbis.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug Missing separate debuginfo for /lib64/libogg.so.0 Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. warnings.warn( [New Thread 0x2aab3309b700 (LWP 66896)] 2021-01-12 23:40:11,250 INFO [mmi_bigram_train.py:310] Loading L.fst 2021-01-12 23:40:11,533 INFO [mmi_bigram_train.py:328] About to get train cuts 2021-01-12 23:40:17,630 INFO [mmi_bigram_train.py:330] About to get dev cuts 2021-01-12 23:40:17,727 INFO [mmi_bigram_train.py:333] About to create train dataset 2021-01-12 23:40:18,201 INFO [mmi_bigram_train.py:337] About to create dev dataset 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:341] About to create train dataloader 2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:343] About to create dev dataloader [New Thread 0x2aab451f3700 (LWP 66931)] 2021-01-12 23:40:18,276 INFO [mmi_bigram_train.py:350] About to create model [New Thread 0x2aab453f4700 (LWP 66933)] [New Thread 0x2aab455f5700 (LWP 66934)]

Model parameters summary:

  • P_scores: 7568
  • tdnn.0.weight: 60000
  • tdnn.0.bias: 500
  • tdnn.3.weight: 750000
  • tdnn.3.bias: 500
  • tdnn.6.weight: 750000
  • tdnn.6.bias: 500
  • lstms.0.weight_ih_l0: 1000000
  • lstms.0.weight_hh_l0: 1000000
  • lstms.0.bias_ih_l0: 2000
  • lstms.0.bias_hh_l0: 2000
  • lstms.1.weight_ih_l0: 1000000
  • lstms.1.weight_hh_l0: 1000000
  • lstms.1.bias_ih_l0: 2000
  • lstms.1.bias_hh_l0: 2000
  • lstms.2.weight_ih_l0: 1000000
  • lstms.2.weight_hh_l0: 1000000
  • lstms.2.bias_ih_l0: 2000
  • lstms.2.bias_hh_l0: 2000
  • lstms.3.weight_ih_l0: 1000000
  • lstms.3.weight_hh_l0: 1000000
  • lstms.3.bias_ih_l0: 2000
  • lstms.3.bias_hh_l0: 2000
  • lstms.4.weight_ih_l0: 1000000
  • lstms.4.weight_hh_l0: 1000000
  • lstms.4.bias_ih_l0: 2000
  • lstms.4.bias_hh_l0: 2000
  • linear.weight: 43500
  • linear.bias: 87

    Total: 11632655

    2021-01-12 23:40:21,868 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001 [Detaching after fork from child process 66939] [Detaching after fork from child process 66940] [Detaching after fork from child process 66941] [Detaching after fork from child process 66942] [New Thread 0x2aab45a08700 (LWP 66943)] [New Thread 0x2aab45c09700 (LWP 66944)] [New Thread 0x2aab45e0a700 (LWP 66945)] [New Thread 0x2aab48200700 (LWP 66946)] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1::operator const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.

[ Stack-Trace: ] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::operator const+0x56c) [0x2aab2cf3ad80] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1::Back() const+0x130) [0x2aab2cf385a0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1, k2::Array1, int)+0x27f) [0x2aab2d08e937] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1, k2::Array1, int, k2::Array1, k2::Array1, int)+0x70) [0x2aab2d08f662] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged&, k2::Array1 const&)+0x38b) [0x2aab2cfc7398] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged&, k2::DenseFsaVec&, float, k2::Ragged, k2::Array1, k2::Array1)+0x91) [0x2aab2d03b65e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041] /home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object, _object*)+0x93d) [0x2aaacd9aa98d] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555] /home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]

Program received signal SIGABRT, Aborted. 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

(gdb) bt full

0 0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

No symbol table info available.

1 0x00002aaaaaf22a78 in abort () from /lib64/libc.so.6

No symbol table info available.

2 0x00002aab2cf36630 in k2::internal::Logger::~Logger (this=0x7fffffffb340, __in_chrg=) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:149

    stack_trace = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x5555c7e0dee8 "[ Stack-Trace: ]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/sit"...}}

3 0x00002aab2cf3ad80 in k2::Array1::operator[] (this=0x7fffffffb680, i=64) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:280

    ans = 21845
    ret = cudaErrorIllegalAddress
    __PRETTY_FUNCTION__ = "T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]"
    k2_nvtx_6 = {<No data fields>}
    data = 0x2aabaae45100
    type = k2::kCuda

4 0x00002aab2cf385a0 in k2::Array1::Back (this=0x7fffffffb680) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:289

    __PRETTY_FUNCTION__ = "T k2::Array1<T>::Back() const [with T = int]"

5 0x00002aab2d08e937 in k2::RaggedShape2 (row_splits=0x7fffffffb680, row_ids=0x7fffffffb6a0, cached_tot_size=35078) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:112

    k2_nvtx_65 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    ctx = {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    axes = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
        _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
          _M_start = 0x5555c69c4e38, _M_finish = 0x7fffffffb498, _M_end_of_storage = 0xffffffffffffb460}}, <No data fields>}

6 0x00002aab2d08f662 in k2::RaggedShape3 (row_splits1=0x7fffffffb680, row_ids1=0x7fffffffb6a0, cached_tot_size1=35078, row_splits2=0x7fffffffb6c0, row_ids2=0x7fffffffb6e0,

cached_tot_size2=101526) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:193
    k2_nvtx_68 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int32_t, k2::Array1<int>*, k2::Array1<int>*, int32_t)"
    shape1 = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
          _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
            _M_start = 0x5555c69bd278, _M_finish = 0x7fffffffb5b8, _M_end_of_storage = 0x2aab29689143
 <__gnu_cxx::__atomic_add_dispatch(_Atomic_word*, int)+46>}}, <No data fields>}}
    temp_array = {dim_ = -962881248, byte_offset_ = 140737488337984,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fffffffb5a0, _M_refcount = {_M_pi = 0x12cf6eaa2}}, <No data fields>}}

7 0x00002aab2cfc7398 in k2::GetIncomingArcs (fsas=..., dest_states=...) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/fsa_utils.cu:837

    k2_nvtx_76 = {<No data fields>}
    __PRETTY_FUNCTION__ = "k2::Ragged<int> k2::GetIncomingArcs(k2::FsaVec&, const k2::Array1<int>&)"
    c = @0x5555c8017fa0: {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
    dest_states_tensor = {shape = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
            _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>},
              _M_start = 0x5555c8014070, _M_finish = 0x5555c8014100, _M_end_of_storage = 0x5555c8014100}}, <No data fields>}}, values = {dim_ = 101526, byte_offset_ = 0,
        region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c8056db0, _M_refcount = {_M_pi = 0x5555c8056da0}}, <No data fields>}}}
    num_fsas = 64
    num_states = 35078
    num_arcs = 101526
    incoming_arcs_order = {dim_ = 101526, byte_offset_ = 0,
      region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c7fc3b10, _M_refcount = {_M_pi = 0x5555c7fc3b00}}, <No data fields>}}
    ans_row_ids2 = {dim_ = 101526, byte_offset_ = 0,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/k2-fsa/k2/issues/569#issuecomment-758748586 >, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO7CLNFHITUTT3VAZHLSZRVQLANCNFSM4VUIPSPQ > .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .

csukuangfj commented 3 years ago

@shanguanma Could you try this pull-request: https://github.com/k2-fsa/k2/pull/586

I think it should fix your problem.

shanguanma commented 3 years ago

Please try running with this code:

585

which may make the error show up earlier. Note: haven't finished running tests yet.

I follow your code to my k2 codebase, then re-install k2. then run python3 mmi_bigram_train.py , the error is as follows:

2021-01-13 12:44:34,119 INFO [mmi_bigram_train.py:401] epoch 0, learning rate 0.001
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:void k2::CheckGetTransposeReordering(k2::Ragged<int>&, k2::Array1<int>&):1171 Check failed: IsPermutation(ans) 

@shanguanma Could you try this pull-request: #586

I think it should fix your problem.

OK, I try to do it now.

shanguanma commented 3 years ago

@shanguanma Could you try this pull-request: #586

I think it should fix your problem.

I follow your code and re-install k2, when I run make, got the below error :

[ 19%] Building CUDA object k2/csrc/CMakeFiles/context.dir/ragged_ops.cu.o
/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: variable "context" is not a type name

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: variable "temp_storage_bytes" is not a type name

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: expected a ")"

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1223): error: expression must have class type

4 errors detected in the compilation of "/tmp/tmpxft_000251ad_00000000-11_ragged_ops.compute_75.cpp1.ii".
make[2]: *** [k2/csrc/CMakeFiles/context.dir/ragged_ops.cu.o] Error 1
make[1]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
make: *** [all] Error 2
csukuangfj commented 3 years ago

Can you check that you did git checkout the correct commit?

shanguanma commented 3 years ago

I add your https://github.com/k2-fsa/k2/pull/586#issue-553915405 to my local k2 codebase. then re-install. is the way wrong?

csukuangfj commented 3 years ago

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: variable "context" is not a type name

What does the line 1221 in your local ragged_ops.cu look like? Is it the same as the one in #586 ?

danpovey commented 3 years ago

Maybe he merged with my PR?

On Wed, Jan 13, 2021 at 2:22 PM Fangjun Kuang notifications@github.com wrote:

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: variable "context" is not a type name

What does the line 1221 in your local ragged_ops.cu look like? Is it the same as the one in #586 https://github.com/k2-fsa/k2/pull/586 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/569#issuecomment-759233976, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO367XJYMYO2TES5U3TSZU34HANCNFSM4VUIPSPQ .