openworm / OpenWorm

Repository for the main Dockerfile with the OpenWorm software stack and project-wide issues
http://openworm.org
MIT License
2.68k stars 210 forks source link

Position of particles is assuming NaN values #92

Closed gidili closed 11 years ago

gidili commented 11 years ago

The Java ported version of SPH is suffering from an issue where freely moving particles assume the same position as boundary particles. Once this happens positions of moving particles assume NaN values.

gidili commented 11 years ago

It is important to understand that the NaN issue seems to be only a symptom of the fact that freely moving particles are assuming the same position as boundary particles. The assumption is that this should not be happening. Need @a-palyanov and @skhayrulin to confirm this is the case.

For this to happen (freely moving particles assuming position of boundary particles) somewhere in the algorithm the following must be happening:

vellamike commented 11 years ago

This argument might be a bit hand-wavy but here we go:

  1. There should be a repulsive force modelled which prevents boundary particles from assuming the same position as free particles?
  2. It's quite a big coincidence that particles are randomly assuming the same position as the boundary particles...
  3. From 1 and 2 I suspect that the function which produces the repulsive force may be broken, and may even possibly be producing an attractive force.

On 17 May 2013 19:26, Giovanni Idili notifications@github.com wrote:

It is important to understand that the NaN is only a symptom of the fact that freely moving particles are assuming the same position as boundary particles. The assumption is that this should not be happening. Need @a-palyanov https://github.com/a-palyanov and @skhayrulinhttps://github.com/skhayrulinto confirm this is the case.

Once this happens (freely moving particles asume position of boundary particles) somewhere in the algorithm the following must be happening:

  • Division of anything by 0 (including 0 by 0)
  • Square root of a negative number

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18078137 .

gidili commented 11 years ago

@vellamike I didn't mean to imply the particles get there at random - in fact I can observe position slowly approaching the boundary and instead of being repulsed assuming the same position as boundary particles. So I tend to completely agree with your assessment, something is broken in the logic that governs repulsion.

My guess is that this is happening only in this implementation because of rounding differences between floats in C++ and Java, but in order to prove this I need to run both implementations side by side and see where things diverge. It's next on my list.

vellamike commented 11 years ago

Your explanation (rounding errors causing the attraction) sounds implausible to me for this kind of change in behaviour. It sounds much more like something like a sign error.

On 17 May 2013 22:49, Giovanni Idili notifications@github.com wrote:

@vellamike https://github.com/vellamike I didn't mean to imply the particles get there at random - in fact I can observe position slowly approaching the boundary and instead of being repulsed assuming the same position as boundary particles. So I tend to completely agree with your assessment, something is broken in the logic that governs repulsion.

My guess is that this is happening only in this implementation because of rounding differences between floats in C++ and Java, but in order to prove this I need to run both implementations side by side and see where things diverge. It's next on my list.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18087861 .

gidili commented 11 years ago

I doubt it Mike - I double checked everything for sign errors many times. Still it's not impossible, but I'd be surprised.

vellamike commented 11 years ago

What happens with one free particle and only boundary particles? Does it always get "sucked in"?

On 17 May 2013 23:05, Giovanni Idili notifications@github.com wrote:

I doubt it Mike - I double checked everything for sign errors many times. Still it's not impossible, but I'd be surprised.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18088486 .

gidili commented 11 years ago

No - I had the same idea and put together a test that does just that, let it go for 10000 steps and it doesn't get sucked in. Tried with different initial conditions (speed / position) for the particle too.

vellamike commented 11 years ago

Does it get pushed away from the boundary particle if they are very close?

gidili commented 11 years ago

I tweaked initial position and velocity till I managed to reproduce the issue with just 1 particle.

I have a particle very close to a corner of the boundary box and I basically shoot it towards the corner:

<positionVector p="1.1" z="0.000177" y="0.000245" x="0.000275"/>
<velocityVector p="0.0" z="-5.3000275" y="-5.45896786" x="-5.42222357"/>

After one step its coordinates are the same as the origin (0, 0, 0). After another step coordinates are (NaN, NaN, NaN)

The problem of course being that it doesn't bounce back.

This happens when running on CPU. When running on GPU... the particle bounces back. The issue still exists for other configurations with many particles on GPU but this outlines the fact that CPU and GPU are generating different results.

I am at a loss.

a-palyanov commented 11 years ago

Giovanni, does Java-version includes boundaries composed of particles, or they are still represented by repulsive planes? I think that currently the latter variant is implemented, while 'particle walls' are more advanced and stable. Possibly changing this may help?

2013/5/21 Giovanni Idili notifications@github.com

I tweaked initial position and velocity till I managed to reproduce the issue with just 1 particle.

I have a particle very close to a corner of the boundary box and I basically shoot it towards the corner:

After one step its coordinates are the same as the origin (0, 0, 0). After another step coordinates are (NaN, NaN, NaN)

The problem of course being that it doesn't bounce back.

This happens when running on CPU. When running on GPU... it bounces back. The issue still exists for other configurations with many particles on GPU but this outlines the fact that CPU and GPU are generating different results.

I am at a loss.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18215834 .

gidili commented 11 years ago

@a-palyanov the latest Java implementation should be the exact same as the C++ one so "unfortunately" we already have boundaries made up of particles :)

vellamike commented 11 years ago

@JohnIdol If CPU and GPU are generating different results this is a huge headache. JavaCL could be the culprit, or something like this: http://www.khronos.org/message_boards/showthread.php/6664-Different-output-on-GPU-vs-CPU

@JohnIdol I would like to propse that you set up a OSGI bundle in Gepetto (am I using the correct terminology?) which is a a placeholder for a wrapped C++ PCISPH implementation - I could then get to work on this with your help.

This has some advantages: * Gives me more opportunity to learn about Gepetto * Fallback if JavaCL is a unfixable bottleneck * Excuse for me to optimize SPH code

Mike

Mike

gidili commented 11 years ago

@vellamike yeah looking at that article it could be an OpenCL driver bug on apple (FML), at this point I need to make my priority getting the C++ version to run on my machine to see if I observe the same difference between CPU and GPU.

About your proposal to setup the OSGi bundle (terminology is correct) I think there's no harm in attempting that in parallel. Will look into what's involved.

vellamike commented 11 years ago

Great,

We need to reduce our goals to solving this small number of specific problems.

Mike

On 21 May 2013 23:20, Giovanni Idili notifications@github.com wrote:

@vellamike https://github.com/vellamike yeah looking at that article it could be an OpenCL driver bug on apple (FML), at this point I need to make my priority getting the C++ version to run on my machine to see if I observe the same difference between CPU and GPU.

About your proposal to setup the OSGi bundle (terminology is correct) I think there's no harm in attempting that in parallel. Will look into what's involved.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18245668 .

gidili commented 11 years ago

@vellamike believe it or not I got the C++ version to run on Mac :)

I sold my soul for SPH

cc: @jhurliman @tarelli @a-palyanov @skhayrulin

vellamike commented 11 years ago

I assume it was quite easy? On 22 May 2013 02:01, "Giovanni Idili" notifications@github.com wrote:

@vellamike https://github.com/vellamike believe it or not I got the C++ version to run on Mac :)

[image: I sold my soul for SPH]https://a248.e.akamai.net/camo.github.com/250a750d3a7d8c3725bd89ccf80634875ff5fa64/687474703a2f2f692e696d6775722e636f6d2f3131524e79376b2e706e67

cc: @jhurliman https://github.com/jhurliman @tarellihttps://github.com/tarelli @a-palyanov https://github.com/a-palyanov @skhayrulinhttps://github.com/skhayrulin

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18251489 .

a-palyanov commented 11 years ago

@JohnIdol liquid or elastic particles in C++ version never have X, Y or Z coordinate equal to zero during simulation -- only boundary particles. Repulsion forces from boundary particles start acting when moving particles comes closer that r0 (particle radius) which has a value nearly equal to (smoothing radius)/2. So maybe there is a bug, not so obvious, preceding the problem, which causes very weak repulsion or complete absence of it? Maybe it could be helpful to compare all variables (starting from delta, which is calculated within a special function calcDelta(), Wpoly6Coefficient, gradWspikyCoefficient etc.), to be sure that C++ and Java versions start their work with identical conditions?

gidili commented 11 years ago

@vellamike if someone tells you how to do it almost everything is easy. Unfortunately Apple handles libraries and framework stuff in a completely different way and I had never worked with C++ on mac before so it actually was fairly tricky. After getting the project to compile I also had to doctor the OpenCL code to get the kernels to compile, basically a lot of experience-based educated guessing. I was fairly surprised when it worked :)

@a-palyanov thanks for the suggestion - I will double check all the initial variables and constants to make sure they are the same and get back to you.

vellamike commented 11 years ago

I'm surprised about the complexity, on Ubuntu it ran immediately - though to be fair @skhayrulin is probably responsible for that.

gidili commented 11 years ago

@a-palyanov I went over the constants as you suggested and I found that MASS was 0.001 in the Java version compared to 0.003 in the C++ one. Must've missed it before. The interesting thing is that MASS is used to calculate the DELTA you mentioned. Here's the change

https://github.com/openworm/org.geppetto.core/commit/5d939a3f281ddc8c03433580f91251779a381b65

Anyway - the result of this is that all the tests seem to now pass for NaN checks on GPU (particles no longer have the same coordinates as boundary particles), and only one test fails on CPU. Before we had 4 tests always failing on NaN checks on CPU and 3 on GPU.

The one that fails on CPU is the one with only one freely moving particle placed very very close to the origin, and I am shooting it towards the corner of the boundary box:

<positionVector p="1.1" z="0.000177" y="0.000245" x="0.000275"/>
<velocityVector p="0.0" z="-5.3000275" y="-5.45896786" x="-5.42222357"/>

A few things to consider:

I am not considering this issue as closed since it still happens on CPU - but this is certainly progress.

I haven't visually inspected the simulation in terms of system behaviour as a whole and either way we are still in need of tests that validate the PCI-SPH behaviour as a system.

I would be curious to see if on Windows / Ubuntu people get the same results both on CPU and GPU (by simply running the tests) to see if it really is an Apple driver bug. I could not run the original C++ on CPU (for some reason it only sees my GPU) so I cannot verify there either for now, will need to devise a mini-test just for that.

P.S. I am working on the "interfacesRefactoring" branch if anybody wants to try

tarelli commented 11 years ago

Elastic matter does not explode anymore after changing that constant screen shot 2013-05-24 at 15 56 52

tarelli commented 11 years ago

But nothing changes in the scene neither in this one or in the liquid one...

vellamike commented 11 years ago

Nice! On 24 May 2013 15:58, "Matteo Cantarelli" notifications@github.com wrote:

Elastic matter does not explode anymore after changing that constant [image: screen shot 2013-05-24 at 15 56 52]https://f.cloud.github.com/assets/81127/560916/5d1c5ea2-c482-11e2-99c8-68b3063b8117.png

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18409968 .

gidili commented 11 years ago

It's something else to do with refactoring - check the the unit tests, I am pretty sure particles are moving.

Sent from my iPhone

On 24 May 2013, at 16:03, Matteo Cantarelli notifications@github.com wrote:

But nothing changes in the scene neither in this one or in the liquid one...

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18410264 .

tarelli commented 11 years ago

Well they move with 0.01

tarelli commented 11 years ago

After many cycles they slightly move also with 0.03 something dodgy

gidili commented 11 years ago

I looked at values in the unit tests with the new value for mass = 0.003 and they were moving, at least in some of the tests.

Sent from my iPhone

On 24 May 2013, at 16:17, Matteo Cantarelli notifications@github.com wrote:

Well they move with 0.01

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/92#issuecomment-18411120 .

charles-cooper commented 11 years ago

Maybe the CPU and GPU have different behavior for some floating point arithmetic?

gidili commented 11 years ago

@charles-cooper could be - but on the Khronos forums there are hints at possible compiler bugs on apple's implementation of OpenCL drivers.

I am seeing this issues on OSX, still havent gotten around to running those tests on a windows/ linux. If it has consistent behavior across CPU / GPU on other platforms it would corroborate the theory.

charles-cooper commented 11 years ago

Sorry for being dumb but -- what test exactly is failing? I checked out interfacesRefactoring and ran mvn test but no tests were run.

gidili commented 11 years ago

@charles-cooper you're not being dumb at all - please ask any questions you might have, our lack of documentation is unfortunately staggering at the moment.

The tests are under org.geppetto.solver.sph - they should run if you run it as Maven Install or Maven Test. Here's the test class, if for some weird reason it doesn't run with the build you can run it directly with JUnit.

To run that bundle you need:

All under the interfacesRefactoring branch.

It would be extremely helpful if you could run all the tests in there on CPU and GPU (default is CPU but there's a parameter in the solver constructor) and check if you get the same results, maybe post screenshots of JUnit results or smt so it's easier to compare :)

P.S. are you on Windows / Linux / OSX?

charles-cooper commented 11 years ago

Linux (Debian) on i5. No GPU. Working on getting OpenCL now.

charles-cooper commented 11 years ago

Does anybody know how to get this to work on Linux? I got the AMD OpenCL implementation thingy by running sudo apt-get install ocl-icd-libopencl1, after which running mvn test gives all manner of failures along the lines of

com.nativelibs4java.opencl.CLException: OpenCL Error : CL_PLATFORM_NOT_FOUND_KHR (make sure to log all errors with environment variable CL_LOG_ERRORS=stdout) and java.lang.NoClassDefFoundError: Could not initialize class com.nativelibs4java.opencl.JavaCL

Is this the right place to be posting all of this?

tarelli commented 11 years ago

Charles, thanks a lot for trying this out! We are aware of issues getting JavaCL to work on linux. We have it down as a task to try and figure out what the problem is but bandwidth is a problem at the moment and we could really use some help!

JavaCL is a third part library, if you could help us getting it to work on linux, figuring out what the problem is or what combination of drivers is needed that would be an immense help to the project.

Here are few pointers:

Issue page on JavaCL https://github.com/ochafik/nativelibs4java/issues?labels=JavaCL&state=open feel free to raise an issue against their repo

Issue on our repo to keep track of the problem (there might be different problems in different distros but maybe there's a common solution) https://github.com/openworm/OpenWorm/issues/64

The JavaCL discussion group https://groups.google.com/forum/?fromgroups#!forum/nativelibs4java some people might have some answers

Again this is something that is blocking different folks on linux and it would be a huge help if you could help us figuring out what's wrong!

charles-cooper commented 11 years ago

Ah, glad to see it is not just my problem ;). I'll see what I can do.

charles-cooper commented 11 years ago

Well, this thing passes all tests on my Linux box. Here is the output of mvn test in org.gepetto.solver.sph:

[INFO] Scanning for projects... [WARNING] POM for 'biz.aQute:bndlib:pom:1.50.0:runtime' is invalid.

Its dependencies (if any) will NOT be available to the current build. [INFO] ------------------------------------------------------------------------ [INFO] Building Geppetto SPH Solver Bundle [INFO] task-segment: [test] [INFO] ------------------------------------------------------------------------ [INFO] [dependency:copy-dependencies {execution: copy-dependencies}] [INFO] jackson-annotations-2.0.6.jar already exists in destination. [INFO] jackson-core-2.0.6.jar already exists in destination. [INFO] jackson-databind-2.1.0.jar already exists in destination. [INFO] dx-1.7.jar already exists in destination. [INFO] bridj-0.6.2.jar already exists in destination. [INFO] javacl-1.0.0-RC3.jar already exists in destination. [INFO] javacl-core-1.0.0-RC3.jar already exists in destination. [INFO] nativelibs4java-utils-1.5.jar already exists in destination. [INFO] opencl4java-1.0.0-RC3.jar already exists in destination. [INFO] jaxb-impl-2.1.2.jar already exists in destination. [INFO] commons-lang-2.6.jar already exists in destination. [INFO] activation-1.1.jar already exists in destination. [INFO] com.springsource.javax.servlet-2.5.0.jar already exists in destination. [INFO] jaxb-api-2.1.jar already exists in destination. [INFO] stax-api-1.0-2.jar already exists in destination. [INFO] com.springsource.org.aopalliance-1.0.0.jar already exists in destination. [INFO] com.springsource.org.apache.commons.logging-1.1.1.jar already exists in destination. [INFO] core-0.0.1.jar already exists in destination. [INFO] model.sph-0.0.1.jar already exists in destination. [INFO] org.springframework.aop-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.asm-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.beans-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.context-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.context.support-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.core-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.expression-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.jdbc-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.jms-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.orm-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.oxm-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.spring-library-3.0.0.RELEASE.libd already exists in destination. [INFO] org.springframework.transaction-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.web-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.web.portlet-3.0.0.RELEASE.jar already exists in destination. [INFO] org.springframework.web.servlet-3.0.0.RELEASE.jar already exists in destination. [INFO] [resources:resources {execution: default-resources}] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 3 resources [INFO] [compiler:compile {execution: default-compile}] [INFO] Nothing to compile - all classes are up to date [INFO] [resources:testResources {execution: default-testResources}] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 8 resources [INFO] [compiler:testCompile {execution: default-testCompile}] [INFO] Nothing to compile - all classes are up to date [INFO] [surefire:test {execution: default-test}] [INFO] Surefire report directory: /media/files/Downloads/org.geppetto.solver.sph/target/surefire-reports


T E S T S

Running org.geppetto.solver.sph.internal.PCISPHSolverTest May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.JavaCL May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.JavaCL$OpenCLProbeLibrary May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.TimeT May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.TimeT$timeval_customizer May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.StructIO$DefaultCustomizer May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.TimeT$timeval May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.StructObject May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.NativeObject May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.AbstractIntegral May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type java.lang.Number Setting of real/effective user Id to 0/0 failed FATAL: Module fglrx not found. Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_kernel May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.Pointer$ListType May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type java.lang.Enum May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.Pointer$StringType May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.Pointer$Releaser May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.TypedPointer May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.Pointer$OrderedPointer May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.Pointer May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_program May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_sampler May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_context May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_command_queue May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_platform_id May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_event May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_GLsync May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_mem May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$cl_device_id May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clSetMemObjectDestructorAPPLE_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clSetMemObjectDestructorAPPLE_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type org.bridj.Callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clSetPrintfCallback_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clSetPrintfCallback_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clEnqueueNativeKernel_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clEnqueueNativeKernel_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clSetEventCallback_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clSetEventCallback_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clLinkProgram_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clLinkProgram_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clCompileProgram_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clCompileProgram_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clBuildProgram_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clBuildProgram_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clSetMemObjectDestructorCallback_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clSetMemObjectDestructorCallback_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clCreateContextFromType_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clCreateContextFromType_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clCreateContext_arg1_callback May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clCreateContext_arg1_callback_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clCreateSubDevicesEXT_fn May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clCreateSubDevicesEXT_fn_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clRetainDeviceEXT_fn May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clRetainDeviceEXT_fn_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clReleaseDeviceEXT_fn May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clReleaseDeviceEXT_fn_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clIcdGetPlatformIDsKHR_fn May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clIcdGetPlatformIDsKHR_fn_NativeImpl May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary$clGetGLContextInfoKHR_fn May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.library.OpenCLLibrary_clGetGLContextInfoKHR_fn_NativeImpl created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 May 28, 2013 12:44:41 AM org.bridj.BridJ log INFO: Registering type com.nativelibs4java.opencl.CLEvent$3 log4j:WARN No appenders could be found for logger (org.geppetto.solver.sph.SPHSolverService). log4j:WARN Please initialize the log4j system properly. created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 created CLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Version OpenCL C 1.2 Version 1113.2 (sse2,avx) using Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) max workgroup size: 1024 max workitems size: 1024 Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.652 sec Running org.geppetto.solver.sph.internal.KernelTest Testing CPU using host memory Created OpenCL contextCLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) Found device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Using device: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) OpenCL version: OpenCL C 1.2 Driver version: 1113.2 (sse2,avx) Max workgroup size: 1024 Max workitems size: 1024 Testing CPU using device memory Created OpenCL contextCLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) Found device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Using device: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) OpenCL version: OpenCL C 1.2 Driver version: 1113.2 (sse2,avx) Max workgroup size: 1024 Max workitems size: 1024 Testing GPU using host memory Created OpenCL contextCLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) Found device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Using device: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) OpenCL version: OpenCL C 1.2 Driver version: 1113.2 (sse2,avx) Max workgroup size: 1024 Max workitems size: 1024 Testing GPU using device memory Created OpenCL contextCLContext(platform = AMD Accelerated Parallel Processing; devices = Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz) Found device - 0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) Using device: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz (AMD Accelerated Parallel Processing) OpenCL version: OpenCL C 1.2 Driver version: 1113.2 (sse2,avx) Max workgroup size: 1024 Max workitems size: 1024 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.183 sec

Results :

Tests run: 10, Failures: 0, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 9 seconds [INFO] Finished at: Tue May 28 00:44:47 PDT 2013 [INFO] Final Memory: 36M/683M [INFO] ------------------------------------------------------------------------

gidili commented 11 years ago

@charles-cooper thanks for trying this Charles - extremely helpful. On my system at least one of the fails (the first one) on CPU while they all pass on GPU. Not conclusive but it's another piece of the puzzle :)

Do you have eclipse installed on the machine you're running the tests on? There's a few things we could try by slightly modifying the tests.

charles-cooper commented 11 years ago

Yes I have Eclipse but it seems so complicated that I can never get anything to build ;). What are you thinking of?

gidili commented 11 years ago

@charles-cooper lol - if it builds on maven it should build from eclipse too, usually it's the other way around that doesn't work :)

We have eclipse distributions packaged with all the plugins you need but unfortunately we have no Debian distro.

You can follow these instructions though to get the dev environment up from scratch.

Basically I had to comment out a few lines from some of tests (the last two, the only "big" ones) to limit them because they are breaking the build with "out of memory" errors. I would've asked you to play around with those tests , uncomment a few lines and change some variables etc to see if we get the same results :)

gidili commented 11 years ago

@charles-cooper oh wait - I forgot we have a script that will setup the whole dev environment for you :)

https://github.com/openworm/OpenWorm/blob/master/utilities/build_dev_environment.py

charles-cooper commented 11 years ago

Ah, Eclipse. As soon as I get it to build I get fun messages like Setting of real/effective user Id to 0/0 failed FATAL: Module fglrx not found. Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly

I did uncomment the lines involving checkModelForOverlappingParticles:

git diff diff --git a/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java b/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java index 58a618a..9b3d82e 100644 --- a/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java +++ b/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java @@ -291,7 +291,7 @@ public class PCISPHSolverTest

  • // checkModelForOverlappingParticles((SPHModelX)model, true);
  • checkModelForOverlappingParticles((SPHModelX)model, true); @@ -314,7 +314,7 @@ public class PCISPHSolverTest
  • // checkModelForOverlappingParticles((SPHModelX)model, true);
  • checkModelForOverlappingParticles((SPHModelX)model, true);

Ran mvn test and got output: Results : Failed tests: org.geppetto.solver.sph.internal.PCISPHSolverTest.testSolvePureLiquidScene_NoNaN(): Found no overlapping particles when they were expected. org.geppetto.solver.sph.internal.PCISPHSolverTest.testSolveElastic_NoNaN(): Found no overlapping particles when they were expected.

Tests run: 10, Failures: 2, Errors: 0, Skipped: 0

gidili commented 11 years ago

That's my bad - those lines should read

checkModelForOverlappingParticles((SPHModelX)model, false); 

You're a victim of copy-paste!

The ones I was gonna ask about ar the ones at the bottom though:

checkFinalStateStringForNaN(stateSet.lastStateToString(), false);

If you can uncomment those on the last two tests and set the time configuration to run 10 steps instead of 1:

new TimeConfiguration(0.1f, 10, 1)

Do tests pass like that?

P.S. you're gonna have to do the -Xmx1g thing otherwise you'll probably run out of memory :)

charles-cooper commented 11 years ago

Tests all seem to pass. I wish I could attach files instead of this primitive copy pasting, but here goes:

git diff diff --git a/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java b/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java index 58a618a..144c602 100644 --- a/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java +++ b/src/test/java/org/geppetto/solver/sph/internal/PCISPHSolverTest.java @@ -291,15 +291,15 @@ public class PCISPHSolverTest
// we have one particle that overlaps with boundary particles in this test // NOTE: commenting out as it takes a while for big scenes

  • // checkModelForOverlappingParticles((SPHModelX)model, true);
  • checkModelForOverlappingParticles((SPHModelX)model, false); SPHSolverService solver = new SPHSolverService(); solver.initialize(model);
  • StateSet stateSet = solver.solve(new TimeConfiguration(0.1f, 1, 1));
  • StateSet stateSet = solver.solve(new TimeConfiguration(0.1f, 10, 1));
  • //checkFinalStateStringForNaN(stateSet.lastStateToString(), false);
  • checkFinalStateStringForNaN(stateSet.lastStateToString(), false); } @@ -314,14 +314,14 @@ public class PCISPHSolverTest // we have one particle that overlaps with boundary particles in this test // NOTE: commenting out as it takes a while for big scenes
  • // checkModelForOverlappingParticles((SPHModelX)model, true);
  • checkModelForOverlappingParticles((SPHModelX)model, false); SPHSolverService solver = new SPHSolverService(); solver.initialize(model);
  • StateSet stateSet = solver.solve(new TimeConfiguration(0.1f, 1, 1));
  • StateSet stateSet = solver.solve(new TimeConfiguration(0.1f, 10, 1));
  • //checkFinalStateStringForNaN(stateSet.lastStateToString(), false);
  • checkFinalStateStringForNaN(stateSet.lastStateToString(), false); } }

mvn test

Results :

Tests run: 10, Failures: 0, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 50 seconds [INFO] Finished at: Tue May 28 09:55:34 PDT 2013 [INFO] Final Memory: 46M/680M [INFO] ------------------------------------------------------------------------

gidili commented 11 years ago

@charles-cooper awesome - this gives me some confidence the NaN issue might be actually resolved!

Thanks a lot for helping with this :)

gidili commented 11 years ago

NaN happens only on an edge case (single particle extremely close to a boundary particle and shot towards it with considerable speed) on apple CPU - we have a test for it. It does never happen on GPU or CPU on any other system we tried.

Changing mass of particles from 0.001 to 0.003 (what's the measurement unit @a-palyanov?) seems to have fixed it. Also increasing the mass makes liquid behave more visibly like liquid.

I am closing this for now.