Building Warp_transducer failed in google colab.

omerasif-itu commented 4 years ago

Hi, I am running experiment in Google Colab.

gcc version 7.5.0
g++ version 7.5.0

Error:

CMake Error at /usr/local/lib/python2.7/dist-packages/cmake/data/share/cmake-3.12/Modules/CMakeDetermineCCompiler.cmake:48 (message):
  Could not find compiler set in environment variable CC:

  gcc-4.8.
Call Stack (most recent call first):
  CMakeLists.txt:7 (project)

CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
See also "/content/rnnt-speech-recognition/warp-transducer/build/CMakeFiles/CMakeOutput.log".
make: *** No targets specified and no makefile found.  Stop.
CUDA_HOME not found in the environment so building without GPU support. To build with GPU support please define the CUDA_HOME environment variable. This should be a path which contains include/cuda.h
Could not find libwarprnnt.so in ../build.
Build warp-rnnt and set WARP_RNNT_PATH to the location of libwarprnnt.so (default is '../build')

@noahchalifour Please advise.

jtdutta1 commented 4 years ago

I have the same question as to why the gcc commands had their versions hardcoded! And would it have any other effect if I preferred to use another gcc!

omerasif-itu commented 4 years ago

A temporary quick fix:

[While in linux] If your Environmental Variables are not set then use:

which gcc
which g++
which cuda

to find paths and then pass that as arguments in script

cp cmake/warp-rnnt-cmakelist.txt warp-transducer/CMakeLists.txt

cd warp-transducer

mkdir build
cd build

cmake \
        -DCMAKE_C_COMPILER_LAUNCHER=/usr/bin/gcc \
        -DCMAKE_CXX_COMPILER_LAUNCHER=/usr/bin/g++  \
        -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME ..
make clean
cd ../tensorflow_binding

python setup.py install
cd ../../

and then run:

CUDA_HOME=/usr/local/cuda ./scripts/build_rnnt.sh # to setup the rnnt loss

See Issue #26 #30 (reply) #31

Out:

-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "10.1") 
-- cuda found TRUE
-- Building shared library with GPU support
-- Configuring done
-- Generating done
-- Build files have been written to: /content/rnnt-speech-recognition/warp-transducer/build
[  7%] Building NVCC (Device) object CMakeFiles/warprnnt.dir/src/warprnnt_generated_rnnt_entrypoint.cu.o
Scanning dependencies of target warprnnt
[ 14%] Linking CXX shared library libwarprnnt.so
[ 14%] Built target warprnnt
[ 21%] Building NVCC (Device) object CMakeFiles/test_time_gpu.dir/tests/test_time_gpu_generated_test_time.cu.o
Scanning dependencies of target test_time_gpu
[ 28%] Building CXX object CMakeFiles/test_time_gpu.dir/tests/random.cpp.o
g++: warning: /usr/bin/c++: linker input file unused because linking not done
[ 35%] Linking CXX executable test_time_gpu
[ 35%] Built target test_time_gpu
[ 42%] Building NVCC (Device) object CMakeFiles/test_gpu.dir/tests/test_gpu_generated_test_gpu.cu.o
Scanning dependencies of target test_gpu
[ 50%] Building CXX object CMakeFiles/test_gpu.dir/tests/random.cpp.o
g++: warning: /usr/bin/c++: linker input file unused because linking not done
[ 57%] Linking CXX executable test_gpu
[ 57%] Built target test_gpu
Scanning dependencies of target test_time
[ 64%] Building CXX object CMakeFiles/test_time.dir/tests/test_time.cpp.o
g++: warning: /usr/bin/c++: linker input file unused because linking not done
[ 71%] Building CXX object CMakeFiles/test_time.dir/tests/random.cpp.o
g++: warning: /usr/bin/c++: linker input file unused because linking not done
[ 78%] Linking CXX executable test_time
[ 78%] Built target test_time
Scanning dependencies of target test_cpu
[ 85%] Building CXX object CMakeFiles/test_cpu.dir/tests/test_cpu.cpp.o
g++: warning: /usr/bin/c++: linker input file unused because linking not done
[ 92%] Building CXX object CMakeFiles/test_cpu.dir/tests/random.cpp.o
g++: warning: /usr/bin/c++: linker input file unused because linking not done
[100%] Linking CXX executable test_cpu
[100%] Built target test_cpu
setup.py:63: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors
  warnings.warn("Assuming tensorflow was compiled without C++11 ABI. "
running install
running bdist_egg
running egg_info
creating warprnnt_tensorflow.egg-info
writing warprnnt_tensorflow.egg-info/PKG-INFO
writing dependency_links to warprnnt_tensorflow.egg-info/dependency_links.txt
writing top-level names to warprnnt_tensorflow.egg-info/top_level.txt
writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/warprnnt_tensorflow
copying warprnnt_tensorflow/__init__.py -> build/lib.linux-x86_64-3.6/warprnnt_tensorflow
running build_ext
building 'warprnnt_tensorflow.kernels' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -I/usr/local/lib/python3.6/dist-packages/tensorflow -I/content/rnnt-speech-recognition/warp-transducer/tensorflow_binding/../include -I/usr/local/lib/python3.6/dist-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/content/rnnt-speech-recognition/warp-transducer/tensorflow_binding/include -I/usr/include/python3.6m -c src/warprnnt_op.cc -o build/temp.linux-x86_64-3.6/src/warprnnt_op.o -std=c++11 -fPIC -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-return-type -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -DWARPRNNT_ENABLE_GPU
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/src/warprnnt_op.o -L../build -Wl,--enable-new-dtags,-R/content/rnnt-speech-recognition/warp-transducer/build -lwarprnnt -o build/lib.linux-x86_64-3.6/warprnnt_tensorflow/kernels.cpython-36m-x86_64-linux-gnu.so -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/warprnnt_tensorflow
copying build/lib.linux-x86_64-3.6/warprnnt_tensorflow/__init__.py -> build/bdist.linux-x86_64/egg/warprnnt_tensorflow
copying build/lib.linux-x86_64-3.6/warprnnt_tensorflow/kernels.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/warprnnt_tensorflow
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_tensorflow/__init__.py to __init__.cpython-36.pyc
creating stub loader for warprnnt_tensorflow/kernels.cpython-36m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_tensorflow/kernels.py to kernels.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
warprnnt_tensorflow.__pycache__.__init__.cpython-36: module references __path__
warprnnt_tensorflow.__pycache__.kernels.cpython-36: module references __file__
creating dist
creating 'dist/warprnnt_tensorflow-0.1-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing warprnnt_tensorflow-0.1-py3.6-linux-x86_64.egg
creating /usr/local/lib/python3.6/dist-packages/warprnnt_tensorflow-0.1-py3.6-linux-x86_64.egg
Extracting warprnnt_tensorflow-0.1-py3.6-linux-x86_64.egg to /usr/local/lib/python3.6/dist-packages
Adding warprnnt-tensorflow 0.1 to easy-install.pth file

Installed /usr/local/lib/python3.6/dist-packages/warprnnt_tensorflow-0.1-py3.6-linux-x86_64.egg
Processing dependencies for warprnnt-tensorflow==0.1
Finished processing dependencies for warprnnt-tensorflow==0.1

stefan-falk commented 4 years ago

@omerasif-itu Thank you for your help!

However, as I compile the warp-transducer I keep seeing

-- Found CUDA: /usr/local/cuda-10.1 (found version "10.1") 
-- cuda found TRUE
-- Building shared library with no GPU support
                                ^^

Am I missing something? It tries to build the library without GPU support ..

stefan-falk commented 4 years ago

Okay, nvm .. I thought I did it already but I had to delete warp-transducer/build to make it work.

stefan-falk commented 4 years ago

@omerasif-itu Alright so, they training seems to run now but it does seem as if the model does not really learn anything.

This is what my test word-error-rate and training loss look like:

Do I simply have to wait longer or is there still something not working? Have you been able to produce a working model, @omerasif-itu ?

omerasif-itu commented 4 years ago

I wanted to smoke test the setup very quickly so to make sure every thing is working before I go for complete training run.

Dataset: Arabic Common Voice

Command: python3 run_rnnt.py \ --mode train \ --data_dir arcvp/ \ --n_epochs 10 \ --batch_size 16 \ --eval_size 500 # this is strange :'D

Result:

Epoch: 9, Batch: 126, Global Step: 1269, Step Time: 3.8020, Loss: 39.3240
EPOCH RESULTS: Loss: 39.3240
Performing evaluation.
VALIDATION RESULTS: Time: 113.8872, Loss: 70.5400, Accuracy: 0.6138, WER: 1.0000
Saving checkpoint ./model/checkpoint_1270_70.5400.hdf5

Complete output: Pastebin

Looks like same result as yours

So, Next I tried loading the model with streaming_transcribe.py which failed as I was on Google colab Then I tried transcribe_file.py: which ended up in some error: #32

all inputs to the Python function must be convertible to tensors

I couldn't solve the issue quickly so I gave up and currently experimenting with deepspeech by mozilla.

I would suggest to open new issue for this. Also If you want to explore alternatives. Here is one awni/speech in torch framework.

Regards and Good Luck :+1:

stefan-falk commented 4 years ago

@omerasif-itu I see, thank you. This loos similar to my output and by now the word-error-rate dropped a little so I guess the training works as such.

But i noticed that restarting the training does not reuse the existing checkpoint but instead starts the training from scratch ..

omerasif-itu commented 4 years ago

But I noticed that restarting the training does not reuse the existing checkpoint but instead starts the training from scratch

I think you need to provide a path to checkpoint directory.

In run_rnnt.py at line 38:

flags.DEFINE_string(
    'output_dir', './model',
    'Directory to save model.')
flags.DEFINE_string(
    'checkpoint', None,
    'Checkpoint to restore from.')

Given these flags, command to load check would be something like this:

python3 run_rnnt.py
--mode train \
--data_dir data/ \
--checkpoint ./model

stefan-falk commented 4 years ago

Thank you, @omerasif-itu!

noahchalifour / rnnt-speech-recognition

Building Warp_transducer failed in google colab. #31