Improve performance of Sibernetic computation step

gidili commented 11 years ago

The Sibernetic codebase is the C++ implementation of the c. elegans body. It includes an implementation of the smoothed particle hydrodynamics algorithm. From this algorithm, we implement multiple types of matter, including liquid, elastic, and hard surfaces.

Code base is here: https://github.com/openworm/Smoothed-Particle-Hydrodynamics

Several of the comments below also describe the Geppetto implementation of the SPH algorithm. However, right now the C++ version (Sibernetic) has the latest updates and therefore performance boost should be focused there.

Simulations with > 10000 particles are fairly slow. Need to improve performance.

vellamike commented 11 years ago

Is this Java-specific or just an SPH-limitation?

On 14 March 2013 14:13, Giovanni Idili notifications@github.com wrote:

Simulations with > 10000 particles are very slow. Need to improve performance.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74 .

gidili commented 11 years ago

That's a good question - to answer that we'd have to run both the Java and C++ version on the same machine and see how they fare. For example one step of the pureLiquid scene @skhayrulin has on the C++ version was taking 4 seconds on Matteo's laptop (top of the range laptop) and twice as much on mine (a 4 years old laptop) my guess is that C++ may provide a small performance boost but I think that it's more about the bottle-necks in our code and it's a bit slow in general on modest hardware :)

vellamike commented 11 years ago

OK. My hunch is that 4 years should not contribute to such a slowdown, but I could easily be wrong on that. Once Gepetto is Ubuntu-Friendly we can test this.

Mike

On 14 March 2013 14:25, Giovanni Idili notifications@github.com wrote:

That's a good question - to answer that we'd have to run both the Java and C++ version on the same machine and see how they fare. For example one step of the pureLiquid scene @skhayrulin https://github.com/skhayrulin has on the C++ version was taking 4 seconds on Matteo's laptop (top of the range laptop) and twice as much on mine (a 4 years old laptop) my guess is that it's a bit slow in general on modest hardware :)

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-14904870 .

gidili commented 11 years ago

In fairness my laptop is pretty bad - Matteo's one is pretty good though, I would be interested in hearing from you and @skhayrulin in hearing how much it takes to compute a step on that same scene (it's in the configuration folder, you can run it by swapping file names as you did the other day in case you have time to try) on your systems.

tarelli commented 11 years ago

The method that sends down to the kernels all the buffers for a number of times takes averagely 4 seconds, what happens inside there is 95% OpenCL, JAVA itself has nothing to do with it, the current JAVA implementation could impact for a very little percentage but I think the real bottleneck, as Gio suggested previously, is in the buffers being sent up and down for every different bit of the algorithm.

vellamike commented 11 years ago

Which one?

On 14 March 2013 14:31, Giovanni Idili notifications@github.com wrote:

In fairness my laptop is pretty bad - Matteo's one is pretty good though, I would be interested in hearing from you and @skhayrulinhttps://github.com/skhayrulinin hearing how much it takes to compute a step on that same scene (it's in the configuration folder, you can run it by swapping file names as you did the other day in case you have time to try) on your systems.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-14905211 .

gidili commented 11 years ago

@vellamike positions/velocities PureLiquid.txt - files here: https://github.com/openworm/Smoothed-Particle-Hydrodynamics/tree/master/configuration

gidili commented 11 years ago

Posted a question with details on this on the JavaCL discussion group - the bottleneck seems to be the "read" operation at the end of each step. It is still unclear if this is due to misuse of the JavaCL bindings or if it's a JavaCL problem.

vellamike commented 11 years ago

Hopefully we'll hear back!

On 23 April 2013 16:42, Giovanni Idili notifications@github.com wrote:

Posted a question with details on this on the JavaCL discussion grouphttps://groups.google.com/forum/?fromgroups=#!topic/nativelibs4java/UkiS8kjnJJ8

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-16866398 .

gidili commented 11 years ago

Heard back from @ochafik in that thread - tried what the suggested but no luck, hoping for some more suggestions as I am all out of ideas :)

slarson commented 11 years ago

@skhayrulin can you help?

skhayrulin commented 11 years ago

@slarson Sure I'll try to.

gidili commented 11 years ago

Tried this on GeForce GT 650M - the step takes from 200ms to 400ms - better compared to 4s on CPU but still not good enough I think.

slarson commented 11 years ago

@charles-cooper in case you are curious about performance -- here's the issue.

charles-cooper commented 11 years ago

where is the computation step?

gidili commented 11 years ago

@charles-cooper here it is: https://github.com/openworm/org.geppetto.solver.sph/blob/interfacesRefactoring/src/main/java/org/geppetto/solver/sph/SPHSolverService.java#L787

You'll notice there's a lot of logging garbage so we know where most of the time is spent. Seems like the time consuming bit is when the output buffers get mapped to Java objects forcing an output operation from the device (CPU/GPU) to the Java host code. This seems to be much faster with the C++ bindings.

This is how it happens in the C++ version, for reference.

charles-cooper commented 11 years ago

Why are we using maps instead of enqueuing read/write buffers like in the cpp code?

charles-cooper commented 11 years ago

Hrm, according to my ghetto profiling, it seems the code is almost always sitting in

"main" prio=10 tid=0x00007f815c008000 nid=0x393b runnable [0x00007f8163eb5000] java.lang.Thread.State: RUNNABLE at com.nativelibs4java.opencl.library.OpenCLLibrary.clWaitForEvents(Native Method) at com.nativelibs4java.opencl.CLEvent.waitFor(CLEvent.java:202) at com.nativelibs4java.opencl.CLEvent.waitFor(CLEvent.java:183) at org.geppetto.solver.sph.SPHSolverService.step(SPHSolverService.java:779) Which is here: https://github.com/openworm/org.geppetto.solver.sph/blob/interfacesRefactoring/src/main/java/org/geppetto/solver/sph/SPHSolverService.java#L758

Depending on what I'm running (the testSolvePureLiquidScene_NoNaN or testSolveElastic_NoNaN) the time of the wait during each step is around 80-90ms or 50-60ms, respectively. For reference, the C++ version (I really have no idea what that is doing, I just compiled and ran it) claims that the runPCISPH is about 25ms per step.

gidili commented 11 years ago

@charles-cooper that's correct - the waitFor is basically forcing the host code (Java) to wait for the device (CPU / GPU) to do what it's supposed to do (process many particles in parallel and figure out new particle positions over an increment of time).

I would expect most of the processing time should be between that waitFor and the mapping, where the waitFor waits for the thing to finish off processing while the mapping instructs to copy data (I/O operation) so that we can safely access the output buffer with meaningful data in it.

To answer your question, we had it "more similar" to the C++ version a while ago but changed to mapping syntax to improve performance and it did improve quite a bit indeed.

I say "more similar" because it is fairly unclear (to me) what corresponds to what in the Java and C++ bindings, since the Java bindings (JavaCL) do a good job at abstracting that away with the drawback of making it difficult to troubleshoot stuff like this.

The suggestion to switch to mapping syntax came from a discussion with @jhurliman who looked into the JavaCL bindings quite a bit and helped figure out some important concepts. You can catch-up on that discussion here.

Also we have basic tests that show different memory usage (host / device), for both CPU and GPU.

vellamike commented 11 years ago

I would treat the self-reported C++ PCISPH profiling with scepticism, a while back there was an email thread where I was reporting inconsistencies between what it was reporting and what was plausible.

charles-cooper commented 11 years ago

I did see that discussion with @jhurliman. But it is unclear to me that there is really going to be that big of an overhead. From what I can tell, JavaCL is using direct byte buffers here for the JVM/native trip, which in Java 1.7 is like zero copy and very low overhead. Then again I don't have a GPU so maybe the CPU code isn't slower.

gidili commented 11 years ago

Empirical evidence says that switching from explicit write/read operations to mapping halved the processing time on CPU and lowered it even more on GPU (for both myself and @tarelli - we are on identical systems).

My guess is that using implicit copying (mapping) is saving us an extra memory allocation on the host (we are talking about big buffers)... but it's just a guess :)

charles-cooper commented 11 years ago

That makes sense.

What I really don't understand is why it would be slower than the C++ version given that most of the time is spent in letting OpenCL do its calculation thing. On my machine anyways, the bottleneck is clearly in the waitFor operation.

gidili commented 11 years ago

That's the big question.

Another factor with working with low-level parallel stuff (regardless if it's CPU or GPU) is down to device "layout" (global / local workgroup size) and buffers size. Telling the device to allocate buffers of the "wrong" size can affect performance quite a bit. This is quite counter-intuitive because "wrong" often means not a multiple of whatever the global work group size is.

Here's an example from some other code I have using other bindings (not JavaCL):

int elementCount = ELEM_COUNT;                                  // Length of arrays to process
int localWorkSize = min((int)kernel.getWorkGroupSize(device), 256);  // Local work size dimensions
int globalWorkSize = roundUp(localWorkSize, elementCount);   // rounded up to the nearest multiple of the localWorkSize

 /* input buffers declarations */
CLBuffer<FloatBuffer> V_in_Buffer = context.createFloatBuffer(globalWorkSize, READ_WRITE);

You can see that even though we ned buffers of size ELEM_COUNT there is a round up. If you don't do that it will work but much slower.

JavaCL is taking care of stuff like the above for us, letting us declare buffers of the size we need. This is great on one end but my gut feeling is that something sub-optimal must be going on under the hoods of the library that is not happening when using the C++ bindings (even though there is no rounding up in buffer declaration in the C++ version either).

The .cl code is the same, so the bindings are in my opinion somehow responsible for the difference in processing time.

Again - this is just an educated guess :)

gidili commented 11 years ago

Here are some numbers on CPU (i7 2.7GHz) and GPU (nVidia GeForce something).

I ran the last test 5 times for each case and averaged max and mins recorded on this line. Usually the first few steps will take a bit more time and then the processing time goes down till it doesn't (in particular for the CPU there's a big difference, I think there must be some intelligence in the firmware that optimizes the operation after it gets repeated a few times).

on the interfacesRefactoring branch:

CPU: 1.7s - 945ms GPU: 250ms - 87ms

on the clMemAllocHost branch:

CPU: 1.7s - 938ms GPU: 550ms - 385ms

So it looks like pretty much identical on CPU but quite a bit slower on GPU (and I am quite sure it's not the other way around - I double checked).

Weird!

Maybe there is not enough parallelism on the CPU for the different allocation strategy to make an impact?

charles-cooper commented 11 years ago

Yeah, not surprising that it actually leads to a slowdown. It's going to interact some weird way with the mapping and read/writes. I wish I had a GPU so I could play around with the flags and see what's going on.

@JohnIdol from the profiling logging is there a particular section which seems to take up more time? Or is it just all slower in general?

gidili commented 11 years ago

@charles-cooper looking at the log files from the unit test runs on those 2 branches the difference is basically all on the waitFor - it basically just takes more to finish off the integration step for all the particles. The rest is pretty much the same.

BTW the interfacesRefactoring branch has been merged - we are back on master.

charles-cooper commented 11 years ago

Okay, interesting. Well, seeing as I don't have a GPU I think I am pretty much useless on this issue :P.

gidili commented 11 years ago

@charles-cooper if you are still interested in working on this item we could fire up an amazon instance EC2 with a GPU and you can work on there.

charles-cooper commented 11 years ago

Sounds enticing, but will EC2 virtualization really give the proper PCI2 bandwidth?

gidili commented 11 years ago

I have no idea - I guess the only way to tell is to try it! :)

charles-cooper commented 11 years ago

Okay, why don't we try it? I won't have time for a week or so though.

gidili commented 11 years ago

OK - let's sync-up via email when you free up.

msasinski commented 11 years ago

Performance hit is negligible, that's why they are able to sell this as a product:)

slarson commented 11 years ago

@shabanovd would be great to have your perspective on this. Thanks!

shabanovd commented 11 years ago

Is there any performance tests? I found only PCISPHSolverBigTest, but personally can't call it's big. It must be on close to actual/planned data with ~5 minutes run.
I see that sort performed always on CPU, why? Here an example of OpenCL sort http://www.bealto.com/gpu-sorting_intro.html
what going to be final result of calculation? image to display or anything else?
time measurements done totally wrong, it affected by IO operations (logger.info("....) & enqueueNDRange doesn't mean calculated.

To see really speed of GPU strategy must be next: load all data into device perform calculations with minimal data reads or without for some calculation step. That lead to the needs to have all calculations coded on OpenCL including soft/max and etc. and two evaluation modes: "visual" or "batch".

note: remember that CPU use main memory, so performance is not affected by copy-to/from-device.

gidili commented 11 years ago

Hi @shabanovd - thanks for your feedback, let me try to answer your questions:

Is there any performance tests? I found only PCISPHSolverBigTest, but personally can't call it's big. It must be on close to actual/planned data with ~5 minutes run.

Those are the biggest tests we have at the moment (>50k particles), they are not huge but good enough to understand if changes to the solver improve performance or not. I am mainly using this test as a reference to compare performance.

I see that sort performed always on CPU, why?

Because by default it uses CPU. You can change that from the default constructor or use the other constructor to specify if you want to use CPU or GPU.

what going to be final result of calculation? image to display or anything else?

It's a physics solver, so the result is going to be particle positions. In order to visualize you have to go through this tutorial to setup the simulation engine.

time measurements done totally wrong, it affected by IO operations (logger.info("....) & enqueueNDRange doesn't mean calculated.

I see you're not big on diplomacy :) Good! Of course the logging takes time, but the goal here is not measure computation time with absolute precision, but to improve performance on total computation time on this line. Logging, it's just a useful tool to quickly verify if changes have any effect and to spot high latency areas. All the logging will go away eventually.

I would suggest you go ahead and fork the repositories needed and show us what you would do different - the code will speaks for itself :)

msasinski commented 11 years ago

@gidili I believe that @shabanovd meant that if more deterministic result will be achievable if you swap lines 832 & 833. Also, maybe instead of removing logging operations at a later time, it would be better to wrap it in an if statement with isDebugEnabled(). Also, using log4j default formatting capabilities (logger.info("SPH STEP END, took {}ms", (end - start));)may help with the performance.

gidili commented 11 years ago

@msasinski I don't disagree with any of that (beside the "deterministic result" thing, as I dont quite understand what you mean by that) but that's not the point - logging takes a finite amount of time that is always the same - absolute performance will improve by removing logging but it makes no difference if I am trying to measure relative changes in performances when making changes to the code.

tarelli commented 11 years ago

@msasinski I'm not sure in what way you are using the word deterministic but swapping those two lines will just result in the measurement being slightly more precise (not taking into account how much time was spent logging that message) but it won't affect in any way the kind of time scale we are dealing with.

We are talking about a step that with some tests is taking 400ms and this issue is about investigating possible ways to improve it so that it gets 10 times smaller, the impact of logging although not null is negligible.

About the if statement, log4j has capabilities to decide whether you want to log messages with info priority or not from the properties file, no need for if statements to switch them off.

msasinski commented 11 years ago

@tarelli often it's more efficient to check with isDebugEnabled() if the logging should be performed, than it is to depend on it's internal settings, especially in the case were the logging string must first be created. I'm not sure if I can agree with you that logging won't affect the time scale. We're not running this on a real-time system, and any process (i.e disk writes) can affect the time it takes to perform a logging operation.

gidili commented 11 years ago

@msasinski you make it sound like logging is the issue and if we remove logging everything is fine - LOL, we put logging in in the first place because it was slow, so that we could understand were the latency was without profiling every time. Logging is not the issue, if you think it is, please write some code, run a benchmark and show us that you "solved" performance :)

If we want to talk about logging best practices please do it on this other issue I opened: https://github.com/openworm/org.geppetto.solver.sph/issues/14

msasinski commented 11 years ago

@gidili from my personal experience the amount of time that it takes to log something is never the same. As a matter of fact, time required to run any process, on a system that is not a real time os, can't be predicted, and is just an educated guess. I believe logging changes proposed by @shabanovd do make sense. just my 2 cents.

msasinski commented 11 years ago

@gidili I never said that logging is the issue, but just as @shabanovd, I believe that it could be improved.

shabanovd commented 11 years ago

to stop this war I have to clear myself -) by "logging" I mean that it's useful and if so it must better to have just few lines to see only|easy valuable figure. I do understand that it may be there because .... my point - "it does not matter how we get here, much more important where we going next". My proposal to cleanup logging to have just few lines instead of current 10s.

@gidili, sort should be done on OpenCL to avoid device-host-device operations. It may looks like small benefit on current dataset, but big one change the game's rules.

What are target figures? particles? calculation time?

Is it possible to generate one setup with target number of particles? (let's call it stress test)

slarson commented 10 years ago

Hi all -- Just wanted to highlight this has become a focus for the group again as @a-palyanov has reached a stopping point on the sibernetic code base for the moment. He has implemented the worm body and muscles quite successfully but it is slow so now optimization is no longer premature. @vellamike has been working to implement the latest code base on a 64-core machine to speed up the run time. This has caused a push to make the sibernetic code run in a headless mode. @vellamike also points out if you are trying to get the code base to run on an AMD processor, that the Intel OpenCL drivers are what you want.

However, updates to the basic algorithm are welcome. The membranes branch where @a-palyanov did all his work to implement the worm body has been merged into the master here: https://github.com/openworm/Smoothed-Particle-Hydrodynamics.

If you have been on the sidelines waiting to help out, this is a good time to re-engage with the code base and help out.

Neurophile commented 10 years ago

Would it be possible for someone to generate a Makefile for the project so it can be compiled on *nix?

slarson commented 10 years ago

I believe @vellamike has a start on this for macOS which should be close. True, Mike?

On Saturday, November 30, 2013, Neurophile wrote:

Would it be possible for someone to generate a Makefile for the project so it can be compiled on *nix?

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-29555070 .

gidili commented 10 years ago

@Neurophile I think @vellamike has already done that for ubuntu.

Mike?

vellamike commented 10 years ago

I added the linux makefilehttps://github.com/openworm/Smoothed-Particle-Hydrodynamics/blob/electrophysiology/Release/makefile to github in a "Release" folder, this may not be the best place to put it (I don't know much about C++ projects), any opinions on this?

A few points to note:

This is on the "electrophysiology" branch, which has some slight variations from the master branch but not in any of the core functionality.
Once you have compiled this include the srchttps://github.com/openworm/Smoothed-Particle-Hydrodynamics/tree/electrophysiology/Release/src folder in your PYTHONPATH environment variable.
You will need the Intel OpenCL drivers, do not use AMD, even if you have an AMD chip.
To get it to work initially run on a CPU not GPU, I have hit problems with NVIDIA.
It probably won't work on old (pre-2009) processors.

Please let me know if you have any questions, Mike

On 30 November 2013 17:40, Giovanni Idili notifications@github.com wrote:

@Neurophile https://github.com/Neurophile I think @vellamikehttps://github.com/vellamikehas already done that.

Mike?

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-29556987 .

openworm / OpenWorm

Improve performance of Sibernetic computation step #74