haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.97k stars 1.13k forks source link

Suspicious discontinuity in run time with LU decomposition #774

Closed mikedeskevich closed 1 month ago

mikedeskevich commented 1 month ago

I'm evaluating moving to SMILE in my project, so I rewrote one of my algorithms to use SMILEs LU decomposition and am comparing that to my Java-only version. I expect SMILE will be faster since it uses BLAS, especially on large matrices. I'm noticing that even while the matrices are still small, there's a significant slow down when the matrix goes from 99x99 to 100x100.

I'm comparing my old code which looks like this

        long start=System.nanoTime();
        LU = new LUDecomposition(mat);
        Yout = LU.solve(Y);
        long stop=System.nanoTime();
        System.out.println("size = "+mat.length+" time = "+(stop-start));

to a SMILE version, which is nearly identical

        long start=System.nanoTime();
        LU=mat.lu();
        Yout = LU.solve(Y);
        long stop=System.nanoTime();
        System.out.println("size = "+mat.length+" time = "+(stop-start));

SMILE's results agree with my java-only version, so I'm pretty sure I'm using it correctly. But the wall-time to execute the code looks suspicious.

As my matrix gets bigger, I'm seeing this kind of timing in my version (the matrices are still small enough that the time is dominated by noise).

size = 98 time = 139800
size = 99 time = 136900
size = 100 time = 150300
size = 101 time = 195600
size = 102 time = 119000

The SMILE version has a large jump at size=100

size = 98 time = 222100
size = 99 time = 164900
size = 100 time = 314503800
size = 101 time = 41970800
size = 102 time = 31335900

I tried this with multiple data sets, so it's not contingent on the matrix being weirdly ill conditioned or something. It always happens at matrix size = 100

I did a quick search through the code to see if there was a switch of algorithm choice or something at size=100 but it wasn't obvious to me.

Am I using the matrix library wrong?

Windows and Java version

Microsoft Windows [Version 10.0.22631.3527]
(c) Microsoft Corporation. All rights reserved.

C:\Users\mike>java -version
openjdk version "17.0.11" 2024-04-16 LTS
OpenJDK Runtime Environment Zulu17.50+19-CA (build 17.0.11+9-LTS)
OpenJDK 64-Bit Server VM Zulu17.50+19-CA (build 17.0.11+9-LTS, mixed mode, sharing)

SMILE version is 3.1.0 jar downloaded directly from github.

haifengl commented 1 month ago

I cannot reproduce such a 10x latency jump. Here is one of my run outputs:

size =  90  time = 487590219
size =  91  time = 1182113
size =  92  time = 1146628
size =  93  time = 1395523
size =  94  time = 1183671
size =  95  time = 1224702
size =  96  time = 1216694
size =  97  time = 1268323
size =  98  time = 1097092
size =  99  time = 1043337
size = 100  time = 1093569
size = 101  time = 1119556
size = 102  time = 1485775
size = 103  time = 4151115
size = 104  time = 1035125
size = 105  time = 1067357
size = 106  time = 1014787
size = 107  time = 1042866
size = 108  time = 1110109
size = 109  time = 1126487
size = 110  time = 1161611

The first run always has very high latency, which is caused by loading the library. Afterwards, I don't observe significant latency jump. I think that the higher latency with size = 103 may coincide with GC. It is not always with size = 103. I see that it happens randomly at different size (tends to the late stage of experiments). Besides, it is much smaller than your tests.

mikedeskevich commented 1 month ago

Thanks for your reply, I'll keep digging. It's probably something related to the JVM or GC as you say. I added a third timing step, and I never see the problem. If I take the mid point check out the issue comes back.

        long start=System.nanoTime();
        LU=mat.lu();
        long mid=System.nanoTime();
        Yout= LU.solve(Y);
        long stop=System.nanoTime();
        System.out.println("size = "+mat.length+" time = "+(stop-mid)+" "+(mid-start));