Closed gidili closed 8 years ago
Is this Java-specific or just an SPH-limitation?
On 14 March 2013 14:13, Giovanni Idili notifications@github.com wrote:
Simulations with > 10000 particles are very slow. Need to improve performance.
— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74 .
That's a good question - to answer that we'd have to run both the Java and C++ version on the same machine and see how they fare. For example one step of the pureLiquid scene @skhayrulin has on the C++ version was taking 4 seconds on Matteo's laptop (top of the range laptop) and twice as much on mine (a 4 years old laptop) my guess is that C++ may provide a small performance boost but I think that it's more about the bottle-necks in our code and it's a bit slow in general on modest hardware :)
OK. My hunch is that 4 years should not contribute to such a slowdown, but I could easily be wrong on that. Once Gepetto is Ubuntu-Friendly we can test this.
Mike
On 14 March 2013 14:25, Giovanni Idili notifications@github.com wrote:
That's a good question - to answer that we'd have to run both the Java and C++ version on the same machine and see how they fare. For example one step of the pureLiquid scene @skhayrulin https://github.com/skhayrulin has on the C++ version was taking 4 seconds on Matteo's laptop (top of the range laptop) and twice as much on mine (a 4 years old laptop) my guess is that it's a bit slow in general on modest hardware :)
— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-14904870 .
In fairness my laptop is pretty bad - Matteo's one is pretty good though, I would be interested in hearing from you and @skhayrulin in hearing how much it takes to compute a step on that same scene (it's in the configuration folder, you can run it by swapping file names as you did the other day in case you have time to try) on your systems.
The method that sends down to the kernels all the buffers for a number of times takes averagely 4 seconds, what happens inside there is 95% OpenCL, JAVA itself has nothing to do with it, the current JAVA implementation could impact for a very little percentage but I think the real bottleneck, as Gio suggested previously, is in the buffers being sent up and down for every different bit of the algorithm.
Which one?
On 14 March 2013 14:31, Giovanni Idili notifications@github.com wrote:
In fairness my laptop is pretty bad - Matteo's one is pretty good though, I would be interested in hearing from you and @skhayrulinhttps://github.com/skhayrulinin hearing how much it takes to compute a step on that same scene (it's in the configuration folder, you can run it by swapping file names as you did the other day in case you have time to try) on your systems.
— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-14905211 .
@vellamike positions/velocities PureLiquid.txt - files here: https://github.com/openworm/Smoothed-Particle-Hydrodynamics/tree/master/configuration
Posted a question with details on this on the JavaCL discussion group - the bottleneck seems to be the "read" operation at the end of each step. It is still unclear if this is due to misuse of the JavaCL bindings or if it's a JavaCL problem.
Hopefully we'll hear back!
On 23 April 2013 16:42, Giovanni Idili notifications@github.com wrote:
Posted a question with details on this on the JavaCL discussion grouphttps://groups.google.com/forum/?fromgroups=#!topic/nativelibs4java/UkiS8kjnJJ8
— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-16866398 .
Heard back from @ochafik in that thread - tried what the suggested but no luck, hoping for some more suggestions as I am all out of ideas :)
@skhayrulin can you help?
@slarson Sure I'll try to.
Tried this on GeForce GT 650M - the step takes from 200ms to 400ms - better compared to 4s on CPU but still not good enough I think.
@charles-cooper in case you are curious about performance -- here's the issue.
where is the computation step?
@charles-cooper here it is: https://github.com/openworm/org.geppetto.solver.sph/blob/interfacesRefactoring/src/main/java/org/geppetto/solver/sph/SPHSolverService.java#L787
You'll notice there's a lot of logging garbage so we know where most of the time is spent. Seems like the time consuming bit is when the output buffers get mapped to Java objects forcing an output operation from the device (CPU/GPU) to the Java host code. This seems to be much faster with the C++ bindings.
This is how it happens in the C++ version, for reference.
Why are we using maps instead of enqueuing read/write buffers like in the cpp code?
Hrm, according to my ghetto profiling, it seems the code is almost always sitting in
"main" prio=10 tid=0x00007f815c008000 nid=0x393b runnable [0x00007f8163eb5000] java.lang.Thread.State: RUNNABLE at com.nativelibs4java.opencl.library.OpenCLLibrary.clWaitForEvents(Native Method) at com.nativelibs4java.opencl.CLEvent.waitFor(CLEvent.java:202) at com.nativelibs4java.opencl.CLEvent.waitFor(CLEvent.java:183) at org.geppetto.solver.sph.SPHSolverService.step(SPHSolverService.java:779) Which is here: https://github.com/openworm/org.geppetto.solver.sph/blob/interfacesRefactoring/src/main/java/org/geppetto/solver/sph/SPHSolverService.java#L758
Depending on what I'm running (the testSolvePureLiquidScene_NoNaN or testSolveElastic_NoNaN) the time of the wait during each step is around 80-90ms or 50-60ms, respectively. For reference, the C++ version (I really have no idea what that is doing, I just compiled and ran it) claims that the runPCISPH is about 25ms per step.
@charles-cooper that's correct - the waitFor is basically forcing the host code (Java) to wait for the device (CPU / GPU) to do what it's supposed to do (process many particles in parallel and figure out new particle positions over an increment of time).
I would expect most of the processing time should be between that waitFor and the mapping, where the waitFor waits for the thing to finish off processing while the mapping instructs to copy data (I/O operation) so that we can safely access the output buffer with meaningful data in it.
To answer your question, we had it "more similar" to the C++ version a while ago but changed to mapping syntax to improve performance and it did improve quite a bit indeed.
I say "more similar" because it is fairly unclear (to me) what corresponds to what in the Java and C++ bindings, since the Java bindings (JavaCL) do a good job at abstracting that away with the drawback of making it difficult to troubleshoot stuff like this.
The suggestion to switch to mapping syntax came from a discussion with @jhurliman who looked into the JavaCL bindings quite a bit and helped figure out some important concepts. You can catch-up on that discussion here.
Also we have basic tests that show different memory usage (host / device), for both CPU and GPU.
I would treat the self-reported C++ PCISPH profiling with scepticism, a while back there was an email thread where I was reporting inconsistencies between what it was reporting and what was plausible.
I did see that discussion with @jhurliman. But it is unclear to me that there is really going to be that big of an overhead. From what I can tell, JavaCL is using direct byte buffers here for the JVM/native trip, which in Java 1.7 is like zero copy and very low overhead. Then again I don't have a GPU so maybe the CPU code isn't slower.
Empirical evidence says that switching from explicit write/read operations to mapping halved the processing time on CPU and lowered it even more on GPU (for both myself and @tarelli - we are on identical systems).
My guess is that using implicit copying (mapping) is saving us an extra memory allocation on the host (we are talking about big buffers)... but it's just a guess :)
That makes sense.
What I really don't understand is why it would be slower than the C++ version given that most of the time is spent in letting OpenCL do its calculation thing. On my machine anyways, the bottleneck is clearly in the waitFor operation.
That's the big question.
Another factor with working with low-level parallel stuff (regardless if it's CPU or GPU) is down to device "layout" (global / local workgroup size) and buffers size. Telling the device to allocate buffers of the "wrong" size can affect performance quite a bit. This is quite counter-intuitive because "wrong" often means not a multiple of whatever the global work group size is.
Here's an example from some other code I have using other bindings (not JavaCL):
int elementCount = ELEM_COUNT; // Length of arrays to process
int localWorkSize = min((int)kernel.getWorkGroupSize(device), 256); // Local work size dimensions
int globalWorkSize = roundUp(localWorkSize, elementCount); // rounded up to the nearest multiple of the localWorkSize
/* input buffers declarations */
CLBuffer<FloatBuffer> V_in_Buffer = context.createFloatBuffer(globalWorkSize, READ_WRITE);
You can see that even though we ned buffers of size ELEM_COUNT there is a round up. If you don't do that it will work but much slower.
JavaCL is taking care of stuff like the above for us, letting us declare buffers of the size we need. This is great on one end but my gut feeling is that something sub-optimal must be going on under the hoods of the library that is not happening when using the C++ bindings (even though there is no rounding up in buffer declaration in the C++ version either).
The .cl code is the same, so the bindings are in my opinion somehow responsible for the difference in processing time.
Again - this is just an educated guess :)
Here are some numbers on CPU (i7 2.7GHz) and GPU (nVidia GeForce something).
I ran the last test 5 times for each case and averaged max and mins recorded on this line. Usually the first few steps will take a bit more time and then the processing time goes down till it doesn't (in particular for the CPU there's a big difference, I think there must be some intelligence in the firmware that optimizes the operation after it gets repeated a few times).
on the interfacesRefactoring branch:
CPU: 1.7s - 945ms GPU: 250ms - 87ms
on the clMemAllocHost branch:
CPU: 1.7s - 938ms GPU: 550ms - 385ms
So it looks like pretty much identical on CPU but quite a bit slower on GPU (and I am quite sure it's not the other way around - I double checked).
Weird!
Maybe there is not enough parallelism on the CPU for the different allocation strategy to make an impact?
Yeah, not surprising that it actually leads to a slowdown. It's going to interact some weird way with the mapping and read/writes. I wish I had a GPU so I could play around with the flags and see what's going on.
@JohnIdol from the profiling logging is there a particular section which seems to take up more time? Or is it just all slower in general?
@charles-cooper looking at the log files from the unit test runs on those 2 branches the difference is basically all on the waitFor - it basically just takes more to finish off the integration step for all the particles. The rest is pretty much the same.
BTW the interfacesRefactoring branch has been merged - we are back on master.
Okay, interesting. Well, seeing as I don't have a GPU I think I am pretty much useless on this issue :P.
@charles-cooper if you are still interested in working on this item we could fire up an amazon instance EC2 with a GPU and you can work on there.
Sounds enticing, but will EC2 virtualization really give the proper PCI2 bandwidth?
I have no idea - I guess the only way to tell is to try it! :)
Okay, why don't we try it? I won't have time for a week or so though.
OK - let's sync-up via email when you free up.
Performance hit is negligible, that's why they are able to sell this as a product:)
@shabanovd would be great to have your perspective on this. Thanks!
To see really speed of GPU strategy must be next: load all data into device perform calculations with minimal data reads or without for some calculation step. That lead to the needs to have all calculations coded on OpenCL including soft/max and etc. and two evaluation modes: "visual" or "batch".
note: remember that CPU use main memory, so performance is not affected by copy-to/from-device.
Hi @shabanovd - thanks for your feedback, let me try to answer your questions:
Is there any performance tests? I found only PCISPHSolverBigTest, but personally can't call it's big. It must be on close to actual/planned data with ~5 minutes run.
Those are the biggest tests we have at the moment (>50k particles), they are not huge but good enough to understand if changes to the solver improve performance or not. I am mainly using this test as a reference to compare performance.
I see that sort performed always on CPU, why?
Because by default it uses CPU. You can change that from the default constructor or use the other constructor to specify if you want to use CPU or GPU.
what going to be final result of calculation? image to display or anything else?
It's a physics solver, so the result is going to be particle positions. In order to visualize you have to go through this tutorial to setup the simulation engine.
time measurements done totally wrong, it affected by IO operations (logger.info("....) & enqueueNDRange doesn't mean calculated.
I see you're not big on diplomacy :) Good! Of course the logging takes time, but the goal here is not measure computation time with absolute precision, but to improve performance on total computation time on this line. Logging, it's just a useful tool to quickly verify if changes have any effect and to spot high latency areas. All the logging will go away eventually.
I would suggest you go ahead and fork the repositories needed and show us what you would do different - the code will speaks for itself :)
@gidili I believe that @shabanovd meant that if more deterministic result will be achievable if you swap lines 832 & 833. Also, maybe instead of removing logging operations at a later time, it would be better to wrap it in an if statement with isDebugEnabled(). Also, using log4j default formatting capabilities (logger.info("SPH STEP END, took {}ms", (end - start));)may help with the performance.
@msasinski I don't disagree with any of that (beside the "deterministic result" thing, as I dont quite understand what you mean by that) but that's not the point - logging takes a finite amount of time that is always the same - absolute performance will improve by removing logging but it makes no difference if I am trying to measure relative changes in performances when making changes to the code.
@msasinski I'm not sure in what way you are using the word deterministic but swapping those two lines will just result in the measurement being slightly more precise (not taking into account how much time was spent logging that message) but it won't affect in any way the kind of time scale we are dealing with.
We are talking about a step that with some tests is taking 400ms and this issue is about investigating possible ways to improve it so that it gets 10 times smaller, the impact of logging although not null is negligible.
About the if statement, log4j has capabilities to decide whether you want to log messages with info priority or not from the properties file, no need for if statements to switch them off.
@tarelli often it's more efficient to check with isDebugEnabled() if the logging should be performed, than it is to depend on it's internal settings, especially in the case were the logging string must first be created. I'm not sure if I can agree with you that logging won't affect the time scale. We're not running this on a real-time system, and any process (i.e disk writes) can affect the time it takes to perform a logging operation.
@msasinski you make it sound like logging is the issue and if we remove logging everything is fine - LOL, we put logging in in the first place because it was slow, so that we could understand were the latency was without profiling every time. Logging is not the issue, if you think it is, please write some code, run a benchmark and show us that you "solved" performance :)
If we want to talk about logging best practices please do it on this other issue I opened: https://github.com/openworm/org.geppetto.solver.sph/issues/14
@gidili from my personal experience the amount of time that it takes to log something is never the same. As a matter of fact, time required to run any process, on a system that is not a real time os, can't be predicted, and is just an educated guess. I believe logging changes proposed by @shabanovd do make sense. just my 2 cents.
@gidili I never said that logging is the issue, but just as @shabanovd, I believe that it could be improved.
to stop this war I have to clear myself -) by "logging" I mean that it's useful and if so it must better to have just few lines to see only|easy valuable figure. I do understand that it may be there because .... my point - "it does not matter how we get here, much more important where we going next". My proposal to cleanup logging to have just few lines instead of current 10s.
@gidili, sort should be done on OpenCL to avoid device-host-device operations. It may looks like small benefit on current dataset, but big one change the game's rules.
What are target figures? particles? calculation time?
Is it possible to generate one setup with target number of particles? (let's call it stress test)
Hi all -- Just wanted to highlight this has become a focus for the group again as @a-palyanov has reached a stopping point on the sibernetic code base for the moment. He has implemented the worm body and muscles quite successfully but it is slow so now optimization is no longer premature. @vellamike has been working to implement the latest code base on a 64-core machine to speed up the run time. This has caused a push to make the sibernetic code run in a headless mode. @vellamike also points out if you are trying to get the code base to run on an AMD processor, that the Intel OpenCL drivers are what you want.
However, updates to the basic algorithm are welcome. The membranes branch where @a-palyanov did all his work to implement the worm body has been merged into the master here: https://github.com/openworm/Smoothed-Particle-Hydrodynamics.
If you have been on the sidelines waiting to help out, this is a good time to re-engage with the code base and help out.
Would it be possible for someone to generate a Makefile for the project so it can be compiled on *nix?
I believe @vellamike has a start on this for macOS which should be close. True, Mike?
On Saturday, November 30, 2013, Neurophile wrote:
Would it be possible for someone to generate a Makefile for the project so it can be compiled on *nix?
— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-29555070 .
@Neurophile I think @vellamike has already done that for ubuntu.
Mike?
I added the linux makefilehttps://github.com/openworm/Smoothed-Particle-Hydrodynamics/blob/electrophysiology/Release/makefile to github in a "Release" folder, this may not be the best place to put it (I don't know much about C++ projects), any opinions on this?
A few points to note:
Please let me know if you have any questions, Mike
On 30 November 2013 17:40, Giovanni Idili notifications@github.com wrote:
@Neurophile https://github.com/Neurophile I think @vellamikehttps://github.com/vellamikehas already done that.
Mike?
— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/74#issuecomment-29556987 .
The Sibernetic codebase is the C++ implementation of the c. elegans body. It includes an implementation of the smoothed particle hydrodynamics algorithm. From this algorithm, we implement multiple types of matter, including liquid, elastic, and hard surfaces.
Code base is here: https://github.com/openworm/Smoothed-Particle-Hydrodynamics
Several of the comments below also describe the Geppetto implementation of the SPH algorithm. However, right now the C++ version (Sibernetic) has the latest updates and therefore performance boost should be focused there.
Simulations with > 10000 particles are fairly slow. Need to improve performance.