nii-yamagishilab / project-CURRENNT-scripts

This repository contains the scripts to use CURRENNT
BSD 3-Clause "New" or "Revised" License
64 stars 17 forks source link

Wave Net Pretrained CUBLAS matrix multiplication failed #6

Open philippbb opened 3 years ago

philippbb commented 3 years ago

Hi

My goal was to train another dataset for CURRENT but before that I wanted to check out the pretrained scripts.

Now I think I setup everything according to the documentation readme files, but when I run for example

01_gen.sh in project-WaveNet-pretrained

i get below error message. I used python 2.7 with cython, numpy and scipy installed on it. Only thing I could think of at the moment is wrong cuda version, but it went trough building process of CURRENNT withouth error i think. I will try to get some insight in debug builds...

... (249) postprocessingL1 feedforward_tanh [size: 256, bias: 1.0, weights: 65792] (250) postprocessingL2 feedforward_tanh [size: 512, bias: 1.0, weights: 131584] (251) output feedforward_identity [size: 1024, bias: 1.0, weights: 525312] (252) postoutput mdn [size: 1] Total weights: 2440640

 de-normalization is skipped for dimension from 1 to 1

Outputs from layer -1, HTK format (float32, big-endian), de-normalized Computing outputs for data fraction 1 ... arctic_a0113 SSAMPOpt: 4, SSAMPPara: 0 FAILED in running CURENNT: CUBLAS matrix multiplication failed Failed to run:/home/philipp/AITeam/project-CURRENNT-public/CURRENNT_codes/build/currennt --train false --ff_output_format htk --parallel_sequences 1 --input_noise_sigma 0 --random_seed 12345231 --shuffle_fractions false --shuffle_sequences false --revert_std true --ScheduleSampOpt 4 --ScheduleSampPara 0 --mdnSoftmaxGenMethod 2 --network /home/philipp/AITeam/project-CURRENNT-scripts/waveform-modeling/project-WaveNet-pretrained/MODELS/wavenet001///trained_network.jsn --ff_output_file /home/philipp/AITeam/project-CURRENNT-scripts/waveform-modeling/project-WaveNet-pretrained/MODELS/wavenet001//output --ff_input_file /home/philipp/AITeam/project-CURRENNT-scripts/waveform-modeling/project-WaveNet-pretrained/TESTDATATEMP/ncData/DATA_TEST/data.nc1 --ExtInputDirs /home/philipp/AITeam/project-CURRENNT-scripts/waveform-modeling/project-WaveNet-pretrained/../TESTDATA-for-pretrained/mfbsp --ExtInputExts .mfbsp --ExtInputDims 80 --resolutions 80 --waveNetMemSave 1 Please check the printed error message Process terminated with 2

TonyWangX commented 3 years ago

Hello,

"CUBLAS matrix multiplication failed" is a runtime error when doing matrix multiplication using cublasSgemm https://github.com/nii-yamagishilab/project-CURRENNT-public/blob/7ca0103e13d7e868a451690679e16fa6a59d1146/CURRENNT_codes/currennt_lib/src/helpers/cublas.cu#L83

It can be the issue of CUDA version (which version do you use? CUDA7.0, 8.0, 9.0, and 10.0 work on my side). It may also be the issue of data -- wrong format leads to wrong dimension size

Please try to use CPU by setting flag_CPU_gen = 1 here https://github.com/nii-yamagishilab/project-CURRENNT-scripts/blob/3de6d32e5e556a71fac1b4010d00b7c000fa5912/waveform-modeling/project-WaveNet-pretrained/config.py#L113

This will avoid using GPU and cublas for matrix multiplication. If the code works, it indicates an issue with CUDA.

philippbb commented 3 years ago

It worked on cpu, thank you very much.

I checked CUDA. Cuda directory under home usr cuda is linked to 9.0.

Also I rebuild CURRENNT again to check the output which I attached below. Cuda version should be correct but maybe the cuda library mentioned in the output could be wrong version

/usr/lib/x86_64-linux-gnu/libcuda.so

I am not sure when I switch between cuda version what happens with this library.

-- CUDA_VERSION: 9.0 -- CUDA_INCLUDE_DIRS: /usr/local/cuda-9.0/include -- CUDA_CUDA_LIBRARY: /usr/lib/x86_64-linux-gnu/libcuda.so -- CUDA_CUDART_LIBRARY: /usr/local/cuda-9.0/lib64/libcudart.so -- CUDA_cublas_LIBRARY: /usr/local/cuda-9.0/lib64/libcublas.so -- CUDA_CUFFT_LIBRARIES: /usr/local/cuda-9.0/lib64/libcufft.so -- CUDA_curand_LIBRARY: /usr/local/cuda-9.0/lib64/libcurand.so -- Boost_INCLUDE_DIRS: /home/philipp/AITeam/boost_1_59_0 -- Boost_LIBRARIES: /home/philipp/AITeam/boost_1_59_0/stage/lib/libboost_program_options.so;/home/philipp/AITeam/boost_1_59_0/stage/lib/libboost_system.so;/home/philipp/AITeam/boost_1_59_0/stage/lib/libboost_filesystem.so;/home/philipp/AITeam/boost_1_59_0/stage/lib/libboost_random.so;/home/philipp/AITeam/boost_1_59_0/stage/lib/libboost_thread.so;-lpthread -- NetCDF Lib: /home/philipp/AITeam/netcdf/lib -- Configuring done -- Generating done -- Build files have been written to: /home/philipp/AITeam/project-CURRENNT-public/CURRENNT_codes/build

Edit: I gonna check later if its the lib inside linux-gnu causing the problem. thank you.

TonyWangX commented 3 years ago

I know little on linking and compiling, but you may try this tool to check the actual lib linked to the executable code. https://man7.org/linux/man-pages/man1/ldd.1.html

ldd currennt

You may see something like

libcublas.so.xx => ...
libcufft.so.xx =>  ...

This may tell more.

At last, cuda9.0 should work. I used cuda9.0 a long time ago.

TonyWangX commented 3 years ago

FYI

I recently swithed to Pytorch. I re-implemented the code including the WaveNet. You may check it here https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts. There is a demo project to run the WaveNet on CMU arctic.

There is a Jupyter notebook on WaveNet too: https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/blob/master/tutorials/s3_demonstration_wavenet.ipynb