lessthanoptimal / ejml

A fast and easy to use linear algebra library written in Java for dense, sparse, real, and complex matrices.
https://ejml.org
565 stars 117 forks source link

ImplMultiplication_MT_DSCC.mult listWork argument not really optional #164

Closed alugowski closed 1 year ago

alugowski commented 1 year ago

The docs for multiplying two CSC matrices in parallel say that the final listWork argument is optional https://ejml.org/javadoc/org/ejml/sparse/csc/mult/ImplMultiplication_MT_DSCC.html#mult(org.ejml.data.DMatrixSparseCSC,org.ejml.data.DMatrixSparseCSC,org.ejml.data.DMatrixSparseCSC,pabeles.concurrency.GrowArray)

However passing null results in:

Exception in thread "main" java.lang.NullPointerException
    at pabeles.concurrency.ConcurrencyOps.runLoopBlocks(ConcurrencyOps.java:238)
    at pabeles.concurrency.ConcurrencyOps.loopBlocks(ConcurrencyOps.java:208)
    at org.ejml.sparse.csc.mult.ImplMultiplication_MT_DSCC.mult(ImplMultiplication_MT_DSCC.java:53)
alugowski commented 1 year ago

A quick glance offers no insight on what is expected, so I used IntelliJ's suggestions to come up with this:

new GrowArray<>(new ConcurrencyOps.NewInstance<Workspace_MT_DSCC>() {
            @Override
            public Workspace_MT_DSCC newInstance() {
                return new Workspace_MT_DSCC();
            }
        });

It appears to work, but I've no idea if this is what is expected, and also how reusable this object is.

lessthanoptimal commented 1 year ago

I think I know what's going on here. It's designed so that you call it through CommonOps_MT_DSCC not through the "Impl" class directly. The wacky JavaDoc (with multiple issues) is probably an artifact of it being written in one class, being split into two classes, and not updated.

lessthanoptimal commented 1 year ago

https://github.com/lessthanoptimal/ejml/tree/feature/various

second commit there updates javadoc to clarify what's going on.

alugowski commented 1 year ago

Ah, that makes sense. So parallel multiply like this looks a lot more natural :)

        DMatrixSparseCSC product = new DMatrixSparseCSC(A.numRows, B.numCols);
        CommonOps_MT_DSCC.mult(A, B, product);
alugowski commented 1 year ago

Sorry to go back to this, but is CommonOps_MT_DSCC.mult actually parallel today? I see there's a ForkJoin involved (though haven't dug deeper), but no matter what I pass to EjmlConcurrency.setMaxThreads (or not call it at all) the runtimes are identical.

lessthanoptimal commented 1 year ago

Yep it's concurrent. Just modified BenchmarkMatrixMult_MT_DSCC so that it will directly compare the single thread and multi threaded versions. Here's the results.

Benchmark                            (countPerColumn)  (dimension)  Mode  Cnt    Score    Error  Units
BenchmarkMatrixMult_MT_DSCC.mult                  100        10000  avgt    3  692.258 ± 63.746  ms/op
BenchmarkMatrixMult_MT_DSCC.mult_MT               100        10000  avgt    3  152.069 ±  8.985  ms/op

Are these matrices very small or have only a few elements?

alugowski commented 1 year ago

They weren't small, the multiply took about a second. I tried plotting and saw a small speedup. Tried again with a much larger problem and trend is much more visible. Thanks!

image

lessthanoptimal commented 1 year ago

Is this a standard test matrix? Kinda curious what the structure was to negate the speed up. You can see with my random matrix in the benchmark above the speed up isn't subtle. When applied to optimizations related to computer vision it's also fairly large. Also what's "sprs"?

lessthanoptimal commented 1 year ago

Also just had another thought for what might be causing the issue. The JVM requires a cycle or two to warm up. If you benchmark by launching a new VM for multiplication then kill the VM it will never fully optimize.

alugowski commented 1 year ago

The matrix is a random Erdos Renyi, n=2**scale, fillfactor=32. The hiccup is probably just that run. I did a quick and dirty, didn't average multiple runs, kept using the machine throughout. I just wanted to see the line move. You're right about JVM warmup, but I don't think that should completely kill parallelism on smaller problems? Either way the interesting ones are the big ones anyway so I don't think I'll care to investigate further.

sprs is a Rust sparse matrix library. One of the few that offer parallelism. I'm just trying a whole bunch of libraries to see what's out there.

alugowski commented 1 year ago

Do you know of any ways to keep an OpenJDK JVM warmed up between runs?

ennerf commented 1 year ago

Java benchmarks should be written using the Java Microbenchmarking Harness (JMH). Anything self-written is going to be very unreliable.

lessthanoptimal commented 1 year ago

@alugowski I've not done it, but there is a way. Was reading some articles on how to optimize performance for AWS Lambdas. Sounded a bit involved.

In my personal experience, 99% of the time you launch the application once and it persists. I tend to set up my cross platform/language benchmarks that way. For example, when testing image processing I have each library scan the the test directory for images, then after loading the image clock the processing time to avoid measuring IO overhead. If you launch a new instance of the app for each image and measuring the total runtime of the app that typically isn't realistic, outside of Lambdas.

A proper scientific micro benchmark should use a language specific tool that can handle all the quirks. All language that run in a virtual machine are more difficult to benchmark as the past history of what ran will influence the speed of what it will run in the future. Hence @ennerf 's comment about JMH.

alugowski commented 1 year ago

Java benchmarks should be written using the Java Microbenchmarking Harness (JMH). Anything self-written is going to be very unreliable.

Thanks for the link.

What I'm doing is many orders of magnitude larger than micro benchmarks though.

alugowski commented 1 year ago

Looks like my workloads are both large enough and hot-spotty enough that any JVM warmup time is in the noise.

alugowski commented 1 year ago

Also turns out the lack of speedup I saw earlier was for a very different reason. I missed that the operation wasn't squaring an ER matrix, but a row permutation of that matrix. So P*ER, not ER*ER. I know it sounds strange, but the sparsity pattern of the operands has an impact on parallel speedup. I've seen several codes show no speedup on this problem.

lessthanoptimal commented 1 year ago

In EJML's implementation, the parallelization is done along columns for multiplication. I glanced at ER on wikipedia and it looks like a square matrix. So there should be a speed up. Is there another step that's much slower?

alugowski commented 1 year ago

Sorry I forgot the comments are in Markdown and it ate my asterisks. I edited them back in.

The operation is P*ER, where P is a permutation matrix, i.e. the identity matrix with rows and columns permuted (in my case randomly). The ER matrix is as you described.

I have not profiled EJML in particular, but I believe what is happening is that on this particular problem the dominating phase is the addition that stitches the result matrix. EJML's version of that is sequential (stitchMatrix()).

Is there another step that's much slower?

I only time the call to CommonOps_MT_DSCC.mult.