MKL provider : OSX vs Windows performance

lionpeloux commented 8 years ago

Hello,

I've started a simple bench. I need to do lots of vector computations for a spring-mass system. For a standard system, my algorithm will run about 1e5 iterations moving about 1e4 points (x,y,z) at each iterations.

I'm investigating how the MKL provider could speed up of my computations. I'm also investigating 2 different layouts for my data structure :

Array of Structure (AoS)
Structure of Arrays (SoA)

based on A Guide to Vectorization with Intel® C++ Compilers :

The most common and likely well known data structure is the array, which contains a contiguous collection of data items that can be accessed by an ordinal index. This data can be organized as an Array Of Structures (AOS) or a Structure Of Arrays (SOA). While AOS organization is excellent for encapsulation it can be poor for use of vector processing. Selecting appropriate data structures can also make vectorization of the resulting code more effective.

I've remarked, using the sample of code given bellow, that performance is very different between Mono Mac and Windows with the MKL provider.

The bench does pointwise multiplication of double vectors of size N = 10000.

For the AoS layout, the data is organized in a jagged array double[1000][10].
For the SoA layout, the data is organized in a single array or mathnet.vector double[10000].

Here are the results. CPU timing is given in ns per elementary operation (= total CPU time in s x 1e9 / N).

Results for MAC MONO under Yosemite (macbook core i7)
AOS (naive loop) = 6,46 ns/elop
AOS (mathnet managed) = 24,31 ns/elop
AOS (mathnet mkl) = 32,32 ns/elop
SOA (naive loop) = 4,9 ns/elop
SOA (mathnet managed) = 4,26 ns/elop
SOA (mathnet mkl) = 21,36 ns/elop

Results for Windows 7 via VMWARE (macbook core i7)
AOS (naive loop) = 7,19 ns/elop
AOS (mathnet managed) = 19,2 ns/elop
AOS (mathnet mkl) = 14,49 ns/elop
SOA (naive loop) = 4,12 ns/elop
SOA (mathnet managed) = 4,92 ns/elop
SOA (mathnet mkl) = 0,52 ns/elop

Do you know why there's a huge difference between Mac Mono and Windows with the basic pointwise multiplication (x40) ?

Is this inherent to P/Invoke with monomac ? Is this inherent to how I build the MKL provider on OSX ?

Thanks, Lionel

Here's my code on the pointwise multiplication of 2 vectors :

using System;
using MathNet.Numerics.LinearAlgebra;
using MathNet.Numerics;
using System.Diagnostics;

namespace TestConsoleNummerics
{
    class MainClass
    {
        public static void Main (string[] args)
        {
            //test_Matrix(1000);
            int N = 10000;
            //test_Vmul(N);
            test_Vmul_AOSvsOAS(N);
            Console.Read();
        }

        static void test_Vmul_AOSvsOAS(int N)
        {
            // Pb definition
            int loop = 10000;
            var w = Stopwatch.StartNew();
            Random rnd = new Random();

            int n = 10;
            int ne = N / n;

            // AOS : naive loop
            var aos_x = new double[ne][];
            var aos_y = new double[ne][];
            for (int i = 0; i < aos_x.Length; i++)
            {
                aos_x[i] = new double[n];
                aos_y[i] = new double[n];
                for (int j = 0; j < aos_x[i].Length; j++)
                {
                    aos_x[i][j] = rnd.NextDouble();
                }
            }

            w.Restart();
            for (int k = 0; k < loop; k++)
            {
                for (int i = 0; i < aos_x.Length; i++)
                {
                    for (int j = 0; j < aos_x[i].Length; j++)
                    {
                        aos_y[i][j] = aos_x[i][j]*aos_x[i][j];
                    }
                }
            }
            Console.WriteLine("AOS (naive loop) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");

            // AOS : MathManged & MKL
            var aos_mathnet_x = new Vector<double>[ne];
            var aos_mathnet_y = new Vector<double>[ne];

            for (int i = 0; i < aos_x.Length; i++)
            {
                aos_mathnet_x[i] = Vector<double>.Build.Dense(n);
                aos_mathnet_y[i] = Vector<double>.Build.Dense(n);
                for (int j = 0; j < aos_x[i].Length; j++)
                {
                    aos_mathnet_x[i][j] = aos_x[i][j];
                }
            }

            Control.UseManaged();
            w.Restart();
            for (int k = 0; k < loop; k++)
            {
                for (int i = 0; i < aos_x.Length; i++)
                {
                    aos_mathnet_x[i].PointwiseMultiply(aos_mathnet_x[i], aos_mathnet_y[i]);
                }
            }
            Console.WriteLine("AOS (mathnet managed) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");

            Control.UseNativeMKL();
            w.Restart();
            for (int k = 0; k < loop; k++)
            {
                for (int i = 0; i < aos_x.Length; i++)
                {
                    aos_mathnet_x[i].PointwiseMultiply(aos_mathnet_x[i], aos_mathnet_y[i]);
                }
            }
            Console.WriteLine("AOS (mathnet mkl) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");

            // SOA : naive loop
            var soa_x = new double[N];
            var soa_y = new double[N];
            for (int i = 0; i < aos_x.Length; i++)
            {
                for (int j = 0; j < aos_x[i].Length; j++)
                {
                    soa_x[n * i + j] = aos_x[i][j];
                }
            }

            w.Restart();
            for (int k = 0; k < loop; k++)
            {
                for (int i = 0; i < soa_x.Length; i++)
                {
                    soa_y[i] = soa_x[i] * soa_x[i];
                }
            }
            Console.WriteLine("SOA (naive loop) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");

            // SOA : MathManged & MKL
            var soa_mathnet_x = Vector<double>.Build.Random(N);
            var soa_mathnet_y = Vector<double>.Build.Dense(N);

            for (int i = 0; i < soa_x.Length; i++)
            {
                soa_mathnet_x[i] = soa_x[i];
            }

            Control.UseManaged();
            w.Restart();
            for (int i = 0; i < loop; i++)
            {
                soa_mathnet_x.PointwiseMultiply(soa_mathnet_x, soa_mathnet_y);
            }
            Console.WriteLine("SOA (mathnet managed) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");

            Control.UseNativeMKL();
            w.Restart();
            for (int i = 0; i < loop; i++)
            {
                soa_mathnet_x.PointwiseMultiply(soa_mathnet_x, soa_mathnet_y);
            }
            Console.WriteLine("SOA (mathnet mkl) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");

        }

    }
}

kjbartel commented 8 years ago

Perhaps multi-threading isn't being used for MKL on Mac? Might want to check how many threads are being set and also make sure the MKL wrapper is linking to the correct version of MKL when built on Mac.

Only thing I can think of.

kjbartel commented 8 years ago

If it is a p/invoke problem then you could try putting your loop into C/C++ and call the MKL wrapper from there. Then there'd only be a single p/invoke call.

mathnet / mathnet-numerics

MKL provider : OSX vs Windows performance #352