Partial specialization failure on Intel compiler

maedoc commented 6 years ago

Summary:

Attempting to link a model w/ threading on Intel compiler produces a partial specialization error.

Description:

I am attempting to test scaling of threaded map_rect on a Xeon Phi system, with the following Stan model,

functions {
    vector lr(vector mu_s, vector j, real[] r, int[] i) {
        real mu = mu_s[1];
        real s = mu_s[2];
        return [normal_lpdf(r | mu, s)]';
    }
}
data { int n; int m; }
transformed data {
    int nm = n * m;
    real x[m, n];
    for (i in 1:m)
        for (j in 1:n)
            x[i, j] = normal_rng(0.0, 1.0);
}
parameters { vector[2] mu_s; vector[0] j[m]; }
model {
    int i[m, 0];
    target += sum(map_rect(lr, mu_s, j, x, i));
}

I confirmed locally use of multiple CPUs and tried to get it working on a HPC Xeon Phi node, first w/ GCC 5.5.0, which compiles fine ~but uses only one CPU. I assumed GCC doesn't want to thread for the Xeon Phi, so~ (top doesn't report CPU usage correctly on Xeon Phi). I tried Intel compilers (icpc version 18.0.2, gcc 5.5.0 compatible), but encounter

stan/lib/stan_math/stan/math/rev/scal/meta/operands_and_partials.hpp(63): error: more than one partial specialization matches the template argument list of class "stan::math::internal::ops_partials_edge<double, Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>>"
            "stan::math::internal::ops_partials_edge<double, Eigen::Matrix<stan::math::var, R, C, <expression>, R, C>>"
            "stan::math::internal::ops_partials_edge<ViewElt, Eigen::Matrix<Op, R, C, <expression>, R, C>>"
    internal::ops_partials_edge<double, Op1> edge1_;

full output below.

Reproducible Steps:

~/ $ git clone --recursive https://github.com/stan-dev/cmdstan && cd cmdstan
~/cmdstan $ cat <<EOF > make/local
CXX := icpc
CC := icpc
CFLAGS += -std=c11 -O3 -xMIC-AVX512 -fma -align -finline-functions -DEIGEN_ENABLE_AVX512
CXXFLAGS += -DSTAN_THREADS
EOF
~/cmdstan $ make -B -j24 build examples/bernoulli/Bernoulli
...
~/cmdstan $ make examples/map_rect/mrnorm.stan # model source in description above

Current Output:

--- Linking C++ model ---
icpc -Wall -I . -isystem stan/lib/stan_math/lib/eigen_3.3.3 -isystem stan/lib/stan_math/lib/boost_1.66.0 -isystem stan/lib/stan_math/lib/sundials_3.1.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -I src -isystem stan/src -isystem stan/lib/stan_math/ -DFUSION_MAX_VECTOR_SIZE=12 -Wno-unused-local-typedefs -DEIGEN_NO_DEBUG -DSTAN_THREADS -DNO_FPRINTF_OUTPUT -pipe    -O3 -o examples/map_rect/mrnorm src/cmdstan/main.cpp -include examples/map_rect/mrnorm.hpp stan/lib/stan_math/lib/sundials_3.1.0/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_3.1.0/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_3.1.0/lib/libsundials_idas.a
icpc: command line warning #10006: ignoring unknown option '-Wno-unused-local-typedefs'
stan/lib/stan_math/stan/math/rev/scal/meta/operands_and_partials.hpp(63): error: more than one partial specialization matches the template argument list of class "stan::math::internal::ops_partials_edge<double, Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>>"
            "stan::math::internal::ops_partials_edge<double, Eigen::Matrix<stan::math::var, R, C, <expression>, R, C>>"
            "stan::math::internal::ops_partials_edge<ViewElt, Eigen::Matrix<Op, R, C, <expression>, R, C>>"
    internal::ops_partials_edge<double, Op1> edge1_;
                                             ^
          detected during:
            instantiation of class "stan::math::operands_and_partials<Op1, Op2, Op3, Op4, Op5, stan::math::var> [with Op1=Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, Op2=Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, Op3=double, Op4=double, Op5=double]" at line 161 of "/usr/local/software/jurecabooster/Stages/2018a/software/GCCcore/5.5.0/include/c++/5.5.0/bits/stl_vector.h"
            instantiation of "std::_Vector_base<_Tp, _Alloc>::~_Vector_base() [with _Tp=stan::math::operands_and_partials<Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, double, double, double, stan::math::var>, _Alloc=std::allocator<stan::math::operands_and_partials<Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, double, double, double, stan::math::var>>]" at line 257 of
                      "/usr/local/software/jurecabooster/Stages/2018a/software/GCCcore/5.5.0/include/c++/5.5.0/bits/stl_vector.h"
            instantiation of "std::vector<_Tp, _Alloc>::vector() [with _Tp=stan::math::operands_and_partials<Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, double, double, double, stan::math::var>, _Alloc=std::allocator<stan::math::operands_and_partials<Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, Eigen::Matrix<stan::math::var, -1, 1, 0, -1, 1>, double, double, double, stan::math::var>>]" at line 60 of
                      "stan/lib/stan_math/stan/math/prim/mat/functor/map_rect_combine.hpp"
            instantiation of "stan::math::internal::map_rect_combine<F, T_shared_param, T_job_param>::map_rect_combine(const Eigen::Matrix<T_shared_param, -1, 1, 0, -1, 1> &, const std::vector<Eigen::Matrix<T_job_param, -1, 1, 0, -1, 1>, std::allocator<Eigen::Matrix<T_job_param, -1, 1, 0, -1, 1>>> &) [with F=mrnorm_model_namespace::lr_functor__, T_shared_param=stan::math::var, T_job_param=stan::math::var]" at line 127 of "stan/lib/stan_math/stan/math/prim/mat/functor/map_rect_concurrent.hpp"
            instantiation of "Eigen::Matrix<stan::return_type<T_shared_param, T_job_param, double, double, double, double>::type, -1, 1, 0, -1, 1> stan::math::internal::map_rect_concurrent<call_id,F,T_shared_param,T_job_param>(const Eigen::Matrix<T_shared_param, -1, 1, 0, -1, 1> &, const std::vector<Eigen::Matrix<T_job_param, -1, 1, 0, -1, 1>, std::allocator<Eigen::Matrix<T_job_param, -1, 1, 0, -1, 1>>> &, const std::vector<stan::math::internal::ops_partials_edge<double,
                      std::vector<std::vector<stan::math::var, std::allocator<stan::math::var>>, std::allocator<std::vector<stan::math::var, std::allocator<stan::math::var>>>>>::partial_t, std::allocator<stan::math::internal::ops_partials_edge<double, std::vector<std::vector<stan::math::var, std::allocator<stan::math::var>>, std::allocator<std::vector<stan::math::var, std::allocator<stan::math::var>>>>>::partial_t>> &, const std::vector<std::vector<int, std::allocator<int>>,
                      std::allocator<std::vector<int, std::allocator<int>>>> &, std::ostream *) [with call_id=1, F=mrnorm_model_namespace::lr_functor__, T_shared_param=stan::math::var, T_job_param=stan::math::var]" at line 176 of "stan/lib/stan_math/stan/math/prim/mat/functor/map_rect.hpp"
            instantiation of "Eigen::Matrix<stan::return_type<T_shared_param, T_job_param, double, double, double, double>::type, -1, 1, 0, -1, 1> stan::math::map_rect<call_id,F,T_shared_param,T_job_param>(const Eigen::Matrix<T_shared_param, -1, 1, 0, -1, 1> &, const std::vector<Eigen::Matrix<T_job_param, -1, 1, 0, -1, 1>, std::allocator<Eigen::Matrix<T_job_param, -1, 1, 0, -1, 1>>> &, const std::vector<stan::math::internal::ops_partials_edge<double, std::vector<std::vector<stan::math::var,
                      std::allocator<stan::math::var>>, std::allocator<std::vector<stan::math::var, std::allocator<stan::math::var>>>>>::partial_t, std::allocator<stan::math::internal::ops_partials_edge<double, std::vector<std::vector<stan::math::var, std::allocator<stan::math::var>>, std::allocator<std::vector<stan::math::var, std::allocator<stan::math::var>>>>>::partial_t>> &, const std::vector<std::vector<int, std::allocator<int>>, std::allocator<std::vector<int, std::allocator<int>>>> &,
                      std::ostream *) [with call_id=1, F=mrnorm_model_namespace::lr_functor__, T_shared_param=stan::math::var, T_job_param=stan::math::var]" at line 294 of "examples/map_rect/mrnorm.hpp"
            instantiation of "T__ mrnorm_model_namespace::mrnorm_model::log_prob<propto__,jacobian__,T__>(std::vector<T__, std::allocator<T__>> &, std::vector<int, std::allocator<int>> &, std::ostream *) const [with propto__=true, jacobian__=true, T__=stan::math::var]" at line 45 of "stan/src/stan/model/log_prob_grad.hpp"
            instantiation of "double stan::model::log_prob_grad<propto,jacobian_adjust_transform,M>(const M &, std::vector<double, std::allocator<double>> &, std::vector<int, std::allocator<int>> &, std::vector<double, std::allocator<double>> &, std::ostream *) [with propto=true, jacobian_adjust_transform=true, M=stan_model]" at line 149 of "stan/src/stan/services/util/initialize.hpp"
            instantiation of "std::vector<double, std::allocator<double>> stan::services::util::initialize(Model &, stan::io::var_context &, RNG &, double, bool, stan::callbacks::logger &, stan::callbacks::writer &) [with Model=stan_model, RNG=boost::random::ecuyer1988]" at line 56 of "stan/src/stan/services/diagnose/diagnose.hpp"
            instantiation of "int stan::services::diagnose::diagnose(Model &, stan::io::var_context &, unsigned int, unsigned int, double, double, double, stan::callbacks::interrupt &, stan::callbacks::logger &, stan::callbacks::writer &, stan::callbacks::writer &) [with Model=stan_model]" at line 143 of "src/cmdstan/command.hpp"
            instantiation of "int cmdstan::command<Model>(int, const char **) [with Model=stan_model]" at line 8 of "src/cmdstan/main.cpp"

compilation aborted for src/cmdstan/main.cpp (code 2)
make: *** [examples/map_rect/mrnorm] Error 2

Expected Output:

Successful compilation

Additional Information:

I can't provide access to the system but can do some debuggin if any ideas provided.

Current Version:

develop (834df71b2c8f)

maedoc commented 6 years ago

I forgot to note: the Bernoulli model builds w/o problem.

bob-carpenter commented 6 years ago

@wds15 may be able to track this down, as he wrote the threading and sometimes uses Intel compilers.

Given their proprietary nature, not to mention what they do to arithmetic precision in the "fast" settings, we don't officially support them as part of the Stan project. We're happy to get contributions that don't break anything else that help support Intel compilers, but it's not a priority for us to fix.

bob-carpenter commented 6 years ago

By the way, this wasn't meant to discourage you from posting issues about Intel compilers. All else being equal, we're happy to support them. And we'd rather hear about issues than not. So thanks for submitting the issue.

maedoc commented 6 years ago

Given their proprietary nature, not to mention what they do to arithmetic precision in the "fast" settings

I haven't been able to reproduce numerical issues with the 2018 versions. In any case, I wouldn't be using them if GCC or Clang effectively vectorized Eigen code for AVX512, an Intel-only instruction set. However, this error only appears for threaded map_rect code, and the CPU in question is 68C/272T, so using GCC with a threaded version of the code (vs Intel single threaded) is a clear win in this case.

wds15 commented 6 years ago

I can't promise to follow this up. Honestly, I had a lot of headaches due to Intel compilers which seem to not care too much about numerical accuracy nor are their compilers good with C++ standards. This is why we moved to using gcc (right now version 6.1) with the Intel MKL. This way you get most of the Intel speed bump (MKL), but ensure that you have a compiler which is up to Stan's C++.

boegel commented 5 years ago

I ran into the same issue using StanHeaders 2.19.0 while compiling thurstonianIRT with Intel compilers.

Is there any hope at all to get this resolved, or is using StanHeaders with Intel compilers basically a lost cause?

wds15 commented 5 years ago

It's not totally lost, but I am probably the only developer with access to Intel compilers... and I am really not blessed with free time...

Why not just use g++?

(we should still keep track of this, don't get me wrong)

maedoc commented 5 years ago

Looks like a bug in ICC. We gave up on ICC since GCC did a good job and time was better spent getting the model to behave better than getting the auto vectorizer to work.

boegel commented 5 years ago

Why not just use g++?

@wds15 We install R with a large collection of R libraries from CRAN with both GCC and Intel compilers, since the Intel compilers often produce better performing binaries.

So far, that hasn't been an issue with any of the R libraries we include (other than trivial compilation errors with Intel compilers that are easy to fix with a trivial patch).

boegel commented 5 years ago

@maedoc That sure looks very similar, but while I can reproduce the issue reported for the example given with Intel C++ compiler version 16.x, I can not reproduce it with more recent Intel C++ compiler versions (17.0.1 works fine, so does the 19.0.1 I'm using now).

boegel commented 5 years ago

Some more info: it seems like this issue was reported to Intel a while ago already, also in the context of stan, see https://software.intel.com/en-us/forums/intel-c-compiler/topic/781749 .

stan-dev / cmdstan