inits::normal() is broken for odd number of parameters

fiqas commented 3 years ago

Bug description

I'm trying to generate a node with normal distribution but it fails on both GPU and CPU.

[2021-06-03 13:53:11] Error: Curand error 105 - ./marian-pruned/src/tensors/rand.cpp:106: curandGenerateNormal(generator_, tensor->data(), tensor->size(), mean, stddev)
[2021-06-03 13:53:11] Error: Aborted from virtual void marian::CurandRandomGenerator::normal(marian::Tensor, float, float) in ./marian-pruned/src/tensors/rand.cpp:106

[CALL STACK]
[0xd9befe]          marian::CurandRandomGenerator::  normal  (IntrusivePtr<marian::TensorBase>,  float,  float) + 0x5de
[0xa12bae]
[0xa1d519]          marian::inits::LambdaInitConvert::  apply  (IntrusivePtr<marian::TensorBase>) + 0x7a9
[0xa0eb18]          marian::ConstantNode::  init  ()                   + 0x48
[0xa00405]          marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x95
[0xa021f4]          marian::ExpressionGraph::  forwardNext  ()         + 0x184
[0xbc920a]          marian::GraphGroup::  collectStats  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::models::ICriterionFunction>,  std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&,  double) + 0xe2a
[0xba51de]          marian::SyncGraphGroup::  collectStats  (std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&) + 0x13e
[0x82674f]          marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x37f
[0x74eb78]          mainTrainer  (int,  char**)                        + 0xc8
[0x70a94a]          main                                               + 0x8a
[0x7fffe6ee5840]    __libc_start_main                                  + 0xf0
[0x74c7b9]          _start                                             + 0x29

It fails at CURAND_CHECK:

102     void CurandRandomGenerator::normal(Tensor tensor, float mean, float stddev) {
103         matchOrAbort<float>(tensor->type());
104
105         tensor->getBackend()->setDevice();
106         CURAND_CHECK(curandGenerateNormal(generator_, tensor->data(), tensor->size(), mean, stddev));
107     }

How to reproduce

I just did:

auto u = W->graph()->constant({1, 1}, inits::normal()); For example, inits::uniform() works fine. I'm working on my branch, but I don't think it's my code that's at fault. I'm just trying to use inits::normal().

Context

Marian version: v1.10.19; cda55c3 2021-06-01 16:33:16 +0000
CMake command: cmake .. -DCOMPILE_TESTS=ON -DUSE_SENTENCEPIECE=ON -DCMAKE_BUILD_TYPE=Release
--build-info all

AVX2_FOUND=true
AVX512_FOUND=false
AVX_FOUND=true
BUILD_ARCH=native
CMAKE_AR=/usr/bin/ar
CMAKE_BUILD_TYPE=Release
CMAKE_COLOR_MAKEFILE=ON
CMAKE_CXX_COMPILER=/usr/bin/c++
CMAKE_CXX_FLAGS=-std=c++11 -pthread -Wl,--no-as-needed -fPIC -Wno-unused-result  -march=native  -DUSE_SENTENCEPIECE -DCUDA_FOUND -DUSE_NCCL -DMKL_ILP64 -m64
CMAKE_CXX_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_CXX_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_CXX_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_CXX_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_COMPILER=/usr/bin/cc
CMAKE_C_FLAGS=-pthread -Wl,--no-as-needed -fPIC -Wno-unused-result  -march=native  -DMKL_ILP64 -m64
CMAKE_C_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_C_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_EXPORT_COMPILE_COMMANDS=OFF
CMAKE_INSTALL_BINDIR=bin
CMAKE_INSTALL_DATAROOTDIR=share
CMAKE_INSTALL_INCLUDEDIR=include
CMAKE_INSTALL_LIBDIR=lib
CMAKE_INSTALL_LIBEXECDIR=libexec
CMAKE_INSTALL_LOCALSTATEDIR=var
CMAKE_INSTALL_OLDINCLUDEDIR=/usr/include
CMAKE_INSTALL_PREFIX=/usr/local
CMAKE_INSTALL_SBINDIR=sbin
CMAKE_INSTALL_SHAREDSTATEDIR=com
CMAKE_INSTALL_SYSCONFDIR=etc
CMAKE_LINKER=/usr/bin/ld
CMAKE_MAKE_PROGRAM=/usr/bin/make
CMAKE_NM=/usr/bin/nm
CMAKE_OBJCOPY=/usr/bin/objcopy
CMAKE_OBJDUMP=/usr/bin/objdump
CMAKE_RANLIB=/usr/bin/ranlib
CMAKE_SKIP_INSTALL_RPATH=NO
CMAKE_SKIP_RPATH=NO
CMAKE_STRIP=/usr/bin/strip
CMAKE_VERBOSE_MAKEFILE=FALSE
COMPILE_AVX=ON
COMPILE_AVX2=ON
COMPILE_AVX512=ON
COMPILE_CPU=ON
COMPILE_CUDA=ON
COMPILE_EXAMPLES=OFF
COMPILE_KEPLER=OFF
COMPILE_LIBRARY_ONLY=OFF
COMPILE_MAXWELL=OFF
COMPILE_PASCAL=ON
COMPILE_SERVER=OFF
COMPILE_SSE2=ON
COMPILE_SSE3=ON
COMPILE_SSE4_1=ON
COMPILE_SSE4_2=ON
COMPILE_TESTS=ON
COMPILE_TURING=ON
COMPILE_VOLTA=ON                                                                                                                                                                                      
CUDA_64_BIT_DEVICE_CODE=ON
CUDA_ATTACH_VS_BUILD_RULE_TO_CUDA_FILE=ON
CUDA_BUILD_CUBIN=OFF
CUDA_BUILD_EMULATION=OFF
CUDA_CUDART_LIBRARY=/usr/local/cuda-10.2/lib64/libcudart.so
CUDA_CUDA_LIBRARY=/usr/lib/x86_64-linux-gnu/libcuda.so
CUDA_HOST_COMPILATION_CPP=ON
CUDA_HOST_COMPILER=/usr/bin/cc
CUDA_NVCC_EXECUTABLE=/usr/local/cuda-10.2/bin/nvcc
CUDA_NVCC_FLAGS=-DUSE_SENTENCEPIECE-DCUDA_FOUND-DUSE_NCCL--default-streamper-thread-O3-g--use_fast_math-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-arch=sm_70-gencode=arch=compute_70,code=sm_70-gencode=arch=compute_70,code=compute_70-gencode=arch=compute_75,code=sm_75-gencode=arch=compute_75,code=compute_75-ccbin/usr/bin/cc-std=c++11-Xcompiler -fPIC-Xcompiler -Wno-unused-result-Xcompiler -Wno-deprecated-Xcompiler -Wno-pragmas-Xcompiler -Wno-unused-value-Xcompiler -Werror
CUDA_PROPAGATE_HOST_FLAGS=OFF
CUDA_SDK_ROOT_DIR=CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_SEPARABLE_COMPILATION=OFF
CUDA_TOOLKIT_INCLUDE=/usr/local/cuda-10.2/include
CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.2
CUDA_TOOLKIT_TARGET_DIR=/usr/local/cuda-10.2
CUDA_USE_STATIC_CUDA_RUNTIME=ON
CUDA_VERBOSE_BUILD=OFF
CUDA_VERSION=10.2
CUDA_cublas_LIBRARY=/fs/zisa0/mbehnke/anaconda3/envs/shrink/lib/libcublas.so
CUDA_cudart_static_LIBRARY=/usr/local/cuda-10.2/lib64/libcudart_static.a
CUDA_cufft_LIBRARY=/usr/local/cuda-10.2/lib64/libcufft.so
CUDA_cupti_LIBRARY=CUDA_cupti_LIBRARY-NOTFOUND
CUDA_curand_LIBRARY=/usr/local/cuda-10.2/lib64/libcurand.so
CUDA_cusolver_LIBRARY=/usr/local/cuda-10.2/lib64/libcusolver.so
CUDA_cusparse_LIBRARY=/usr/local/cuda-10.2/lib64/libcusparse.so
CUDA_nppc_LIBRARY=/usr/local/cuda-10.2/lib64/libnppc.so
CUDA_nppi_LIBRARY=CUDA_nppi_LIBRARY-NOTFOUND
CUDA_npps_LIBRARY=/usr/local/cuda-10.2/lib64/libnpps.so
CUDA_rt_LIBRARY=/usr/lib/x86_64-linux-gnu/librt.so
DOXYGEN_DOT_EXECUTABLE=/usr/bin/dot
DOXYGEN_EXECUTABLE=/usr/bin/doxygen
GENERATE_MARIAN_INSTALL_TARGETS=OFF
GIT_EXECUTABLE=/usr/bin/git
INTEL_ROOT=/opt/intel
INTGEMM_DONT_BUILD_TESTS=ON
MKL_CORE_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_core.a
MKL_INCLUDE_DIR=/opt/intel/mkl/include
MKL_INCLUDE_DIRS=/opt/intel/mkl/include
MKL_INTERFACE_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a
MKL_LIBRARIES=-Wl,--start-group/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a/opt/intel/mkl/lib/intel64/libmkl_sequential.a/opt/intel/mkl/lib/intel64/libmkl_core.a-Wl,--end-group
MKL_ROOT=/opt/intel/mkl
MKL_SEQUENTIAL_LAYER_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_sequential.a
SPM_BUILD_TEST=OFF
SPM_COVERAGE=OFF
SPM_ENABLE_NFKC_COMPILE=OFF
SPM_ENABLE_SHARED=OFF
SPM_ENABLE_TCMALLOC=ON
SPM_ENABLE_TENSORFLOW_SHARED=OFF
SPM_NO_THREADLOCAL=OFF
SPM_TCMALLOC_STATIC=OFF
SPM_USE_BUILTIN_PROTOBUF=ON
SQLITE_ENABLE_ASSERT_HANDLER=OFF
SQLITE_ENABLE_COLUMN_METADATA=ON
SQLITE_USE_LEGACY_STRUCT=OFF
SSE2_FOUND=true
SSE3_FOUND=true
SSE4_1_FOUND=true
SSE4_2_FOUND=true
SSSE3_FOUND=true
TCMALLOC_LIB=/usr/lib/libtcmalloc_minimal.so
Tcmalloc_INCLUDE_DIR=/usr/include
Tcmalloc_LIBRARY=/usr/lib/libtcmalloc_minimal.so
USE_APPLE_ACCELERATE=OFF
USE_CCACHE=OFF
USE_CUDNN=OFF
USE_DOXYGEN=ON
USE_FBGEMM=OFF
USE_MKL=ON
USE_MPI=OFF
USE_NCCL=ON
USE_OPENMP=OFF
USE_SENTENCEPIECE=ON
USE_STATIC_LIBS=OFF

Log file: I can add if necessary.

emjotde commented 3 years ago

It might still be your code. Curand and CUDA errors in general tend to occur after other code has invalidated memory. If there is a chance that you are accessing GPU memory in a bad way in your own code, this might just be a symptom of that.

emjotde commented 3 years ago

Can you check the same thing in master, maybe? To exclude your code as a source.

emjotde commented 3 years ago

Ah, you said it also fails on the CPU. That's more suspicious. Is the error message the same?

graemenail commented 3 years ago

curand wants to generate in multiples of 2. We use curand also on the CPU, when compiled with CUDA on. On CPU-only builds this works because it uses the STL random generator which doesn't require an even number.

emjotde commented 3 years ago

OK. thanks. That's annoying. I will take a look what I can do.

marian-nmt / marian-dev