Open RuedigerMoeller opened 9 years ago
What makes you think so? I generate 256x256 image 100 times, that's 6553600 per-pixel iterations.
I agree with @RuedigerMoeller. This is an apples to oranges comparison as written and it's likely to be biting the C# code as well.
perf stat -r 10 java -cp . test 2>&1 > /dev/null | grep time
This will run the test 10 times. But all compiled code created by the JIT is thrown away between runs. So yes, the JIT is probably kicking in before those 6.5 million pixels are calculated. But not only is the X10 run not helping, you are actually benchmarking the JIT step 10 times as well. To properly benchmark this on an apples to apples level you need to either run the compilation scripts for all those pre-compiled languages 10 times and add in those times OR you need to find a way to excerpt out the JIT step.
How to benchmark JIT based systems against pre-compiled systems is always difficult proposition. In many cases it comes down to use-cases of any real-world code. Will you run this one little chunk of code once and be done with it ? Well yeah, a JIT compiled system might be an awful choice. But if your algorithm is buried in a long running server and called on a regular basis then you need to find a way to benchmark without the cost of the JIT compilation step since in the real world case you basically never see that JIT process as a performance impact.
One option here would be to kick the value for the outer loop (X100: for (int i = 0; i < 100; i++) { ) to up something which cause the JIT portion of the run to be irrelevantly small. With runtimes in the fractions of a second, there is no doubt that a large fraction of that small time is actually either the interpreted code running or the actual JIT running, and not the compiled( JIT'd) code running.
I would kick that number up by 1 to 2 orders of magnitude. Once you are into run times measured in 10's to 100's of seconds, then you can have some confidence that you are measuring actual compiled code. If you want to play more with this number ( and I'd be interested in the results ), perhaps you could pass this as the single parameter to ALL these test. and then chart performance of the languages at X10, X100, X1000, X10000 times.
Read what I wrote here about the goal of this benchmark: https://github.com/nsf/pnoise/issues/16
If you really believe that some of the results don't represent what they should, show me the right results and the way to get them. What you guys do instead is pure theory, and that's boring.. But I don't mind testing it the other way, I can even stand on my head while doing so. Here's an example of doing 1 run with 10000 iterations instead (clang, mono and java):
[nsf @ pnoise]$ perf stat -r 1 ./bin_test_c_clang 2>&1 > /dev/null | grep time
15,235993535 seconds time elapsed
[nsf @ pnoise]$ perf stat -r 1 ./bin_test_cs 2>&1 > /dev/null | grep time
99,547617361 seconds time elapsed
[nsf @ pnoise]$ perf stat -r 1 java -cp . test 2>&1 > /dev/null | grep time
50,065212791 seconds time elapsed
The numbers are practically the same. It doesn't matter that java shows 100ms win here, because it's still ~2x faster than C# and ~4x slower than C. Precision is not important when there is just one benchmark, it doesn't show much anyway.
@nsf. Thanks for running it this way. I appreciate your willingness to do that. Plus, I want proof of that head standing, or it didn't happen :-)
My math on this shows that with X100 the loops that performance shows:
clang X106
C# X94
java X83
So while the magnitude of the improvement is not what I expecting or hoping the direction certainly is.
One final thought. Java JIT tends to focus on method invocations as a point of optimization. This means that it's quite likely that the triple loop within test() will not get JIT'd since the test() method is only called once. A refactor to put the inner X256 loops into it's their method would increase the likelihood that these loops would get JITd. I haven't tested this :-(
And yes, in reading your thought on #16, I agree that the purpose here is not to write code in a benchmarky way. I'm 100% on this. So perhaps creating additional methods does not follow those rules.
Well you also include VM startup time ? Additionally the java version does a lot of Heap allocs which the c version doesn't ..
I'd also exclude the heavy console printing, at least you don't want to benchmark System.out performance.
Jit effects + internal measurement: Run 1: time:430 Run 2: time:370 Run 3: time:344 Run 4: time:355 Run 5: time:341 Run 6: time:340
C1 compiler kicks in in the first iteration, after then C2 gets like another 30%. I assume the java version is dominated by allocation costs.
changed:
public static void main(String[] args) {
while( true ) {
long now = System.currentTimeMillis();
new StupidInternetBench();
System.out.println("time:"+(System.currentTimeMillis()-now));
}
}
for what do you allocate here ?
float gradient(Vec2 orig, Vec2 grad, Vec2 p) {
Vec2 sp = new Vec2(p.x - orig.x, p.y - orig.y);
return grad.x * sp.x + grad.y * sp.y;
}
@RuedigerMoeller: an offtopic question, is this benchmark being discussed right now on some website like reddit or hackernews?
@asterite yeah on HN, but i don't want to blame the author in the public too much :-)
got this down to ~270 ms (from original 430) on my machine
import org.junit.Test;
import java.util.Random;
/**
* Created by ruedi on 06/04/15.
*/
public class StupidInternetBench {
public static class Vec2 {
public Vec2(double x, double y) {
this.x = x;
this.y = y;
}
public double x;
public double y;
}
private static class Noise2DContext {
Vec2[] rgradients;
int[] permutations;
Vec2[] gradients;
Vec2[] origins;
private static final double lerp(double a, double b, double v) {
return a * (1 - v) + b * v;
}
private static final double smooth(double v) {
return v * v * (3 - 2 * v);
}
Vec2 random_gradient(Random rnd) {
double v = rnd.nextDouble() * Math.PI * 2.0;
return new Vec2((double) Math.cos(v), (double) Math.sin(v));
}
double gradient(Vec2 orig, Vec2 grad, Vec2 p) {
double xx = p.x - orig.x;
double yy = p.y - orig.y;
return grad.x * xx + grad.y * yy;
}
public Noise2DContext(int seed) {
Random rnd = new Random(seed);
rgradients = new Vec2[256];
permutations = new int[256];
for (int i = 0; i < 256; i++) {
rgradients[i] = random_gradient(rnd);
}
for (int i = 0; i < 256; i++) {
int j = rnd.nextInt(i + 1);
permutations[i] = permutations[j];
permutations[j] = i;
}
gradients = new Vec2[4];
origins = new Vec2[4];
origins[0] = new Vec2(0, 0);
origins[1] = new Vec2(0, 0);
origins[2] = new Vec2(0, 0);
origins[3] = new Vec2(0, 0);
}
Vec2 get_gradient(int x, int y) {
int idx = permutations[x & 255] + permutations[y & 255];
return rgradients[idx & 255];
}
void get_gradients(double x, double y) {
double x0f = Math.floor(x);
double y0f = Math.floor(y);
int x0 = (int) x0f;
int y0 = (int) y0f;
int x1 = x0 + 1;
int y1 = y0 + 1;
gradients[0] = get_gradient(x0, y0);
gradients[1] = get_gradient(x1, y0);
gradients[2] = get_gradient(x0, y1);
gradients[3] = get_gradient(x1, y1);
origins[0].x = x0f + 0; origins[0].y = y0f + 0;
origins[1].x = x0f + 1; origins[1].y = y0f + 0;
origins[2].x = x0f + 0; origins[2].y = y0f + 1;
origins[3].x = x0f + 1; origins[3].y = y0f + 1;
}
Vec2 p = new Vec2(0, 0);
public double get(double x, double y) {
p.x = x; p.y = y;
get_gradients(x, y);
double v0 = gradient(origins[0], gradients[0], p);
double v1 = gradient(origins[1], gradients[1], p);
double v2 = gradient(origins[2], gradients[2], p);
double v3 = gradient(origins[3], gradients[3], p);
double fx = smooth(x - origins[0].x);
double vx0 = lerp(v0, v1, fx);
double vx1 = lerp(v2, v3, fx);
double fy = smooth(y - origins[0].y);
return lerp(vx0, vx1, fy);
}
}
static char[] symbols = { ' ', '░', '▒', '▓', '█', '█' };
public StupidInternetBench() {
Noise2DContext n2d = new Noise2DContext((int) System.currentTimeMillis());
double[] pixels = new double[256 * 256];
for (int i = 0; i < 100; i++) {
for (int y = 0; y < 256; y++) {
for (int x = 0; x < 256; x++) {
double v = n2d.get(x * 0.1f, y * 0.1f) * 0.5f + 0.5f;
pixels[y * 256 + x] = v;
}
}
}
// for (int y = 0; y < 256; y++) {
// for (int x = 0; x < 256; x++) {
// int idx = (int) (pixels[y * 256 + x] / 0.2f);
// System.out.print(symbols[idx]);
// }
// System.out.println();
// }
}
public static void main(String[] args) {
while( true ) {
long now = System.currentTimeMillis();
new StupidInternetBench();
System.out.println("time:"+(System.currentTimeMillis()-now));
}
}
}
1) floats ok, does not make a big difference 5) apologize, raged quite a bit
3+4) then call it startup time + systemout printing benchmark :-) btw isn't it that go will spawn a go routine or so for each sysout ?
6) you can manage this by cut+paste easily
need at least 10k iterations to let JIT kick in