hemelb-codes / hemelb

A high performance parallel lattice-Boltzmann code for large scale fluid flow in complex geometries
GNU Lesser General Public License v3.0
34 stars 11 forks source link

diffTest failing on linux. #85

Closed schmie closed 9 years ago

schmie commented 9 years ago

Reported by jamespjh on 1 Nov 2011 15:48 UTC Is intermittent.

Seems to occur only with -O3 or greater, but due to intermittency, I'm not certain.

Seems to occur only when run under mpirun, not directly.

Running via the debugger, get this stack:

Program received signal SIGSEGV, Segmentation fault. 0x000000000045248b in hemelb::vis::StreaklineDrawer::velSiteDataPointer ( this=0x903c40, iLatDat=0x8febd0, site_i=-9223372036854775808, site_j=0, site_k=-9223372036854775808) at /home/jamespjh/devel/HemeLB/main/hemelb/Code/vis/StreaklineDrawer.cc:71 71 if (velocity_field[block_id].vel_site_data == NULL) (gdb) bt

0 0x000000000045248b in hemelb::vis::StreaklineDrawer::velSiteDataPointer (

this=0x903c40, iLatDat=0x8febd0, site_i=-9223372036854775808, site_j=0, 
site_k=-9223372036854775808)
at /home/jamespjh/devel/HemeLB/main/hemelb/Code/vis/StreaklineDrawer.cc:71

1 0x00000000004529a0 in hemelb::vis::StreaklineDrawer::localVelField (

this=0x903c40, p_index=<value optimised out>, v=0x7fffffffd850, 
is_interior=0x7fffffffd8bc, iLatDat=<value optimised out>)
at /home/jamespjh/devel/HemeLB/main/hemelb/Code/vis/StreaklineDrawer.cc:194

2 0x0000000000453b02 in hemelb::vis::StreaklineDrawer::updateVelField (

this=0x903c40, stage_id=0, iLatDat=0x8febd0)
at /home/jamespjh/devel/HemeLB/main/hemelb/Code/vis/StreaklineDrawer.cc:630

3 0x00000000004548e8 in hemelb::vis::StreaklineDrawer::StreakLines (

this=0x903c40, time_steps=<value optimised out>, 
time_steps_per_cycle=<value optimised out>, iLatDat=0x8febd0)
at /home/jamespjh/devel/HemeLB/main/hemelb/Code/vis/StreaklineDrawer.cc:852

4 0x000000000043ebd8 in hemelb::vis::Control::ProgressStreaklines (

this=0x8fede0, time_step=716, period=1000)
at /home/jamespjh/devel/HemeLB/main/hemelb/Code/vis/Control.cc:649

5 0x0000000000433c34 in SimulationMaster::RunSimulation (

this=0x7fffffffdd40, image_directory=<value optimised out>, 
snapshot_directory=<value optimised out>, 
lSnapshotsPerCycle=<value optimised out>, 
lImagesPerCycle=<value optimised out>)

---Type to continue, or q to quit--- at /home/jamespjh/devel/HemeLB/main/hemelb/Code/SimulationMaster.cc:456

6 0x000000000043bacf in main (argc=,

argv=<value optimised out>)
at /home/jamespjh/devel/HemeLB/main/hemelb/Code/main.cc:145
schmie commented 9 years ago

Modified by jamespjh on 1 Nov 2011 18:48 UTC

schmie commented 9 years ago

Comment by jamespjh on 1 Nov 2011 18:50 UTC There may be more than one bug here, judging by the variability in failure points.

One definite bug is that the code uses reserve() and capacity() instead of resize() and size(), but there is no guarantee that when the reserve occurs, that the existing data between size() and capacity() will be copied over to the new location.

I have fixed this by replacing reserve and capacity with resize and size. I will push this pre-review so that I can confirm the fix on jenkins.

schmie commented 9 years ago

Comment by jamespjh on 1 Nov 2011 18:55 UTC Hmm, can't push:

 hg push
pushing to ssh://hg@entropy.chem.ucl.ac.uk/hemelb
pushing subrepo RegressionTests
searching for changes
no changes found
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 1 changes to 1 files
remote: beginning 1 autopushes
remote: pushing to ssh://hgreceive@pauli.chem.ucl.ac.uk//var/lib/hg/hemelb
remote: searching for changes
remote: remote: adding changesets
remote: remote: adding manifests
remote: remote: adding file changes
remote: remote: added 1 changesets with 1 changes to 1 files
remote: remote: abort: unknown revision 'bf9ee1be7df74facebdec9ef31d435a6342fe9ce+'!
remote: remote: transaction abort!
remote: remote: rollback completed
remote: remote: abort: pretxnchangegroup.autoupdate hook exited with status 255
remote: error: pretxnchangegroup.autopush hook raised an exception: ('unexpected response:', '')
remote: transaction abort!
remote: rollback completed
remote: abort: unexpected response: empty string
abort: unexpected response: empty string
schmie commented 9 years ago

Comment by jamespjh on 1 Nov 2011 19:01 UTC The revision number given in the above error message is the .hgsubstate number in the folder on pauli:

bf9ee1be7df74facebdec9ef31d435a6342fe9ce+ RegressionTests

schmie commented 9 years ago

Comment by jamespjh on 1 Nov 2011 19:04 UTC Anyway, since I can't push, here's the patch file for review, please Rupert.

schmie commented 9 years ago

Modified by jamespjh on 1 Nov 2011 19:15 UTC

schmie commented 9 years ago

Comment by jamespjh on 1 Nov 2011 19:21 UTC I'm manually patching the jenkins copy for now until we can push properly.

schmie commented 9 years ago

Comment by rupert on 2 Nov 2011 10:49 UTC Simple change that works for me on OSX too.

I've fixed the problem on pauli and pushed.

Resolved by changeset 5da4a2ba0a4e