nsf / pnoise

Perlin noise benchmark
77 stars 23 forks source link

java test misses warmup, you are most probably benchmarking the interpreter #15

Open RuedigerMoeller opened 9 years ago

RuedigerMoeller commented 9 years ago

need at least 10k iterations to let JIT kick in

nsf commented 9 years ago

What makes you think so? I generate 256x256 image 100 times, that's 6553600 per-pixel iterations.

solmetricchris commented 9 years ago

I agree with @RuedigerMoeller. This is an apples to oranges comparison as written and it's likely to be biting the C# code as well.

perf stat -r 10 java -cp . test 2>&1 > /dev/null | grep time

This will run the test 10 times. But all compiled code created by the JIT is thrown away between runs. So yes, the JIT is probably kicking in before those 6.5 million pixels are calculated. But not only is the X10 run not helping, you are actually benchmarking the JIT step 10 times as well. To properly benchmark this on an apples to apples level you need to either run the compilation scripts for all those pre-compiled languages 10 times and add in those times OR you need to find a way to excerpt out the JIT step.

How to benchmark JIT based systems against pre-compiled systems is always difficult proposition. In many cases it comes down to use-cases of any real-world code. Will you run this one little chunk of code once and be done with it ? Well yeah, a JIT compiled system might be an awful choice. But if your algorithm is buried in a long running server and called on a regular basis then you need to find a way to benchmark without the cost of the JIT compilation step since in the real world case you basically never see that JIT process as a performance impact.

One option here would be to kick the value for the outer loop (X100: for (int i = 0; i < 100; i++) { ) to up something which cause the JIT portion of the run to be irrelevantly small. With runtimes in the fractions of a second, there is no doubt that a large fraction of that small time is actually either the interpreted code running or the actual JIT running, and not the compiled( JIT'd) code running.

I would kick that number up by 1 to 2 orders of magnitude. Once you are into run times measured in 10's to 100's of seconds, then you can have some confidence that you are measuring actual compiled code. If you want to play more with this number ( and I'd be interested in the results ), perhaps you could pass this as the single parameter to ALL these test. and then chart performance of the languages at X10, X100, X1000, X10000 times.

nsf commented 9 years ago

Read what I wrote here about the goal of this benchmark: https://github.com/nsf/pnoise/issues/16

If you really believe that some of the results don't represent what they should, show me the right results and the way to get them. What you guys do instead is pure theory, and that's boring.. But I don't mind testing it the other way, I can even stand on my head while doing so. Here's an example of doing 1 run with 10000 iterations instead (clang, mono and java):

[nsf @ pnoise]$ perf stat -r 1 ./bin_test_c_clang 2>&1 > /dev/null | grep time
      15,235993535 seconds time elapsed
[nsf @ pnoise]$ perf stat -r 1 ./bin_test_cs 2>&1 > /dev/null | grep time
      99,547617361 seconds time elapsed
[nsf @ pnoise]$ perf stat -r 1 java -cp . test 2>&1 > /dev/null | grep time
      50,065212791 seconds time elapsed

The numbers are practically the same. It doesn't matter that java shows 100ms win here, because it's still ~2x faster than C# and ~4x slower than C. Precision is not important when there is just one benchmark, it doesn't show much anyway.

solmetricchris commented 9 years ago

@nsf. Thanks for running it this way. I appreciate your willingness to do that. Plus, I want proof of that head standing, or it didn't happen :-)

My math on this shows that with X100 the loops that performance shows:

clang X106

C# X94

java X83

So while the magnitude of the improvement is not what I expecting or hoping the direction certainly is.

One final thought. Java JIT tends to focus on method invocations as a point of optimization. This means that it's quite likely that the triple loop within test() will not get JIT'd since the test() method is only called once. A refactor to put the inner X256 loops into it's their method would increase the likelihood that these loops would get JITd. I haven't tested this :-(

And yes, in reading your thought on #16, I agree that the purpose here is not to write code in a benchmarky way. I'm 100% on this. So perhaps creating additional methods does not follow those rules.

RuedigerMoeller commented 9 years ago

Well you also include VM startup time ? Additionally the java version does a lot of Heap allocs which the c version doesn't ..

RuedigerMoeller commented 9 years ago

I'd also exclude the heavy console printing, at least you don't want to benchmark System.out performance.

Jit effects + internal measurement: Run 1: time:430 Run 2: time:370 Run 3: time:344 Run 4: time:355 Run 5: time:341 Run 6: time:340

C1 compiler kicks in in the first iteration, after then C2 gets like another 30%. I assume the java version is dominated by allocation costs.

changed:

public static void main(String[] args) {
        while( true ) {
            long now = System.currentTimeMillis();
            new StupidInternetBench();
            System.out.println("time:"+(System.currentTimeMillis()-now));
        }
    }
RuedigerMoeller commented 9 years ago

for what do you allocate here ?

float gradient(Vec2 orig, Vec2 grad, Vec2 p) {
            Vec2 sp = new Vec2(p.x - orig.x, p.y - orig.y);
            return grad.x * sp.x + grad.y * sp.y;
        }
asterite commented 9 years ago

@RuedigerMoeller: an offtopic question, is this benchmark being discussed right now on some website like reddit or hackernews?

RuedigerMoeller commented 9 years ago

@asterite yeah on HN, but i don't want to blame the author in the public too much :-)

RuedigerMoeller commented 9 years ago

got this down to ~270 ms (from original 430) on my machine

import org.junit.Test;

import java.util.Random;

/**
 * Created by ruedi on 06/04/15.
 */
public class StupidInternetBench {

        public static class Vec2 {
        public Vec2(double x, double y) {
            this.x = x;
            this.y = y;
        }

        public double x;
        public double y;
    }

    private static class Noise2DContext {
        Vec2[] rgradients;
        int[] permutations;
        Vec2[] gradients;
        Vec2[] origins;

        private static final double lerp(double a, double b, double v) {
            return a * (1 - v) + b * v;
        }

        private static final double smooth(double v) {
            return v * v * (3 - 2 * v);
        }

        Vec2 random_gradient(Random rnd) {
            double v = rnd.nextDouble() * Math.PI * 2.0;
            return new Vec2((double) Math.cos(v), (double) Math.sin(v));
        }

        double gradient(Vec2 orig, Vec2 grad, Vec2 p) {
            double xx = p.x - orig.x;
            double yy = p.y - orig.y;
            return grad.x * xx + grad.y * yy;
        }

        public Noise2DContext(int seed) {
            Random rnd = new Random(seed);
            rgradients = new Vec2[256];
            permutations = new int[256];
            for (int i = 0; i < 256; i++) {
                rgradients[i] = random_gradient(rnd);
            }
            for (int i = 0; i < 256; i++) {
                int j = rnd.nextInt(i + 1);
                permutations[i] = permutations[j];
                permutations[j] = i;
            }

            gradients = new Vec2[4];
            origins = new Vec2[4];
            origins[0] = new Vec2(0, 0);
            origins[1] = new Vec2(0, 0);
            origins[2] = new Vec2(0, 0);
            origins[3] = new Vec2(0, 0);
        }

        Vec2 get_gradient(int x, int y) {
            int idx = permutations[x & 255] + permutations[y & 255];
            return rgradients[idx & 255];
        }

        void get_gradients(double x, double y) {
            double x0f = Math.floor(x);
            double y0f = Math.floor(y);
            int x0 = (int) x0f;
            int y0 = (int) y0f;
            int x1 = x0 + 1;
            int y1 = y0 + 1;

            gradients[0] = get_gradient(x0, y0);
            gradients[1] = get_gradient(x1, y0);
            gradients[2] = get_gradient(x0, y1);
            gradients[3] = get_gradient(x1, y1);

            origins[0].x = x0f + 0; origins[0].y = y0f + 0;
            origins[1].x = x0f + 1; origins[1].y = y0f + 0;
            origins[2].x = x0f + 0; origins[2].y = y0f + 1;
            origins[3].x = x0f + 1; origins[3].y = y0f + 1;
        }

        Vec2 p = new Vec2(0, 0);
        public double get(double x, double y) {
            p.x = x; p.y = y;
            get_gradients(x, y);
            double v0 = gradient(origins[0], gradients[0], p);
            double v1 = gradient(origins[1], gradients[1], p);
            double v2 = gradient(origins[2], gradients[2], p);
            double v3 = gradient(origins[3], gradients[3], p);

            double fx = smooth(x - origins[0].x);
            double vx0 = lerp(v0, v1, fx);
            double vx1 = lerp(v2, v3, fx);
            double fy = smooth(y - origins[0].y);
            return lerp(vx0, vx1, fy);
        }
    }

    static char[] symbols = { ' ', '░', '▒', '▓', '█', '█' };

    public StupidInternetBench() {
        Noise2DContext n2d = new Noise2DContext((int) System.currentTimeMillis());
        double[] pixels = new double[256 * 256];

        for (int i = 0; i < 100; i++) {
            for (int y = 0; y < 256; y++) {
                for (int x = 0; x < 256; x++) {
                    double v = n2d.get(x * 0.1f, y * 0.1f) * 0.5f + 0.5f;
                    pixels[y * 256 + x] = v;
                }
            }
        }

//      for (int y = 0; y < 256; y++) {
//          for (int x = 0; x < 256; x++) {
//              int idx = (int) (pixels[y * 256 + x] / 0.2f);
//              System.out.print(symbols[idx]);
//          }
//          System.out.println();
//      }
    }

    public static void main(String[] args) {
        while( true ) {
            long now = System.currentTimeMillis();
            new StupidInternetBench();
            System.out.println("time:"+(System.currentTimeMillis()-now));
        }
    }

}
nsf commented 9 years ago
  1. Converting everything to double is not okay. If VM cannot optimize float usage, its their problem. Because such reasoning about hardware is bogus. Hardware leans towards SIMD model. It's not a single FPU which "converts everything to 80 bit FP anyway". With SIMD you can process 4 floats in parallel or just 2 doubles. GPU people know it very well to the point when they actually even use half precision floating point format often. Bandwidth is the bottleneck. If VM vendors disagree, that's their right. But hey, look at Mono 4.0 for example: http://www.mono-project.com/docs/about-mono/releases/4.0.0/ ("With this release we are introducing support for performing 32-bit floating operations using 32-bit math. This produces faster code at the expense of fidelity. This mode of operation can be enabled with the -O=float32 option.")
  2. Removing allocs is okay.
  3. Don't pull out printing.
  4. Don't measure time inside the process. I measure programs which execute a particular algorithm, not the algorithm itself. If your JIT takes 20 seconds to kick in, that's a problem.
  5. Don't call this benchmark - stupid. If you think it's stupid, why are you wasting your time here?
  6. Want changes in? Send pull requests.
RuedigerMoeller commented 9 years ago

1) floats ok, does not make a big difference 5) apologize, raged quite a bit

3+4) then call it startup time + systemout printing benchmark :-) btw isn't it that go will spawn a go routine or so for each sysout ?

6) you can manage this by cut+paste easily