sailfish-team / sailfish

Lattice Boltzmann (LBM) simulation package for GPUs (CUDA, OpenCL)
http://sailfish.us.edu.pl
233 stars 85 forks source link

Hanging during simulations #37

Open sanguinariojoe opened 8 years ago

sanguinariojoe commented 8 years ago

Xubuntu 15.10, sailfish from master (f111f6e4a0953357f0871374aa825bc2eaafc2a0), ATI R9 290


If I launch the Lid-Driven cavity, everything seems to be working fine... Unfortunately it is suddenly hanging:

[  1751  INFO Master/sobremesa] Machine master starting with PID 25192 at 2016-04-07 18:18:13 UTC
[  1751  INFO Master/sobremesa] Simulation started with: ./ldc_2d.py
[  1760  INFO Master/sobremesa] Sailfish version: f111f6e4a0953357f0871374aa825bc2eaafc2a0
[  1761  INFO Master/sobremesa] Handling subdomains: [0]
[  1761  INFO Master/sobremesa] Subdomain -> GPU map: {0: 0}
[  1764  INFO Master/sobremesa] Selected backend: opencl
[  2291  INFO Subdomain/0] Initializing subdomain.
[  2291  INFO Subdomain/0] Required memory: 
[  2291  INFO Subdomain/0] . distributions: 5 MiB
[  2291  INFO Subdomain/0] . fields: 0 MiB
[  2422  INFO Subdomain/0] On-GPU invalid result check disabled as the device does not support all required features.
/home/pepe/Downloads/sailfish/sailfish/backend_opencl.py:159: UserWarning: Received OpenCL source code in Unicode, should be ASCII string. Attempting conversion.
  return cl.Program(self.ctx, preamble + source).build() #'-cl-single-precision-constant -cl-fast-relaxed-math')
[  5056 WARNING Subdomain/0] Running infinite simulation.
[  5056  INFO Subdomain/0] Starting simulation.
[  5510  INFO Subdomain/0] iteration:2000  speed:277.77 MLUPS
[  5727  INFO Subdomain/0] iteration:3000  speed:295.56 MLUPS
[  5951  INFO Subdomain/0] iteration:4000  speed:288.61 MLUPS
[  6175  INFO Subdomain/0] iteration:5000  speed:288.83 MLUPS
[  6441  INFO Subdomain/0] iteration:6000  speed:243.48 MLUPS
[  6753  INFO Subdomain/0] iteration:7000  speed:208.11 MLUPS
[  7033  INFO Subdomain/0] iteration:8000  speed:230.93 MLUPS
[  7318  INFO Subdomain/0] iteration:9000  speed:227.47 MLUPS
[  7574  INFO Subdomain/0] iteration:10000  speed:252.54 MLUPS
[  7808  INFO Subdomain/0] iteration:11000  speed:276.91 MLUPS
[  8067  INFO Subdomain/0] iteration:12000  speed:250.54 MLUPS
[  8304  INFO Subdomain/0] iteration:13000  speed:273.10 MLUPS
[  8595  INFO Subdomain/0] iteration:14000  speed:222.76 MLUPS
[  8858  INFO Subdomain/0] iteration:15000  speed:246.14 MLUPS
[  9052  INFO Subdomain/0] iteration:16000  speed:333.59 MLUPS
[  9260  INFO Subdomain/0] iteration:17000  speed:311.17 MLUPS
[  9503  INFO Subdomain/0] iteration:18000  speed:266.69 MLUPS
[  9774  INFO Subdomain/0] iteration:19000  speed:238.98 MLUPS
[ 10013  INFO Subdomain/0] iteration:20000  speed:271.23 MLUPS
[ 10268  INFO Subdomain/0] iteration:21000  speed:253.38 MLUPS
[ 10535  INFO Subdomain/0] iteration:22000  speed:243.09 MLUPS
[ 10782  INFO Subdomain/0] iteration:23000  speed:262.50 MLUPS
[ 11032  INFO Subdomain/0] iteration:24000  speed:258.22 MLUPS
[ 11283  INFO Subdomain/0] iteration:25000  speed:258.77 MLUPS
[ 11527  INFO Subdomain/0] iteration:26000  speed:265.50 MLUPS
[ 11791  INFO Subdomain/0] iteration:27000  speed:245.31 MLUPS
[ 12058  INFO Subdomain/0] iteration:28000  speed:242.33 MLUPS
[ 12311  INFO Subdomain/0] iteration:29000  speed:255.68 MLUPS
[ 12564  INFO Subdomain/0] iteration:30000  speed:256.76 MLUPS
[ 12818  INFO Subdomain/0] iteration:31000  speed:254.30 MLUPS
[ 13066  INFO Subdomain/0] iteration:32000  speed:261.79 MLUPS
[ 13491  INFO Subdomain/0] iteration:33000  speed:152.45 MLUPS
[ 13741  INFO Subdomain/0] iteration:34000  speed:259.01 MLUPS
[ 14018  INFO Subdomain/0] iteration:35000  speed:233.74 MLUPS
[ 14260  INFO Subdomain/0] iteration:36000  speed:267.39 MLUPS
[ 14510  INFO Subdomain/0] iteration:37000  speed:258.93 MLUPS

If I cancel the job, it seems to be a synchronization problem between threads:

  File "./ldc_2d.py", line 41, in <module>
    ctrl.run()
  File "/home/pepe/Downloads/sailfish/sailfish/controller.py", line 793, in run
    return self._finish_simulation(subdomain_specs, summary_receiver)
  File "/home/pepe/Downloads/sailfish/sailfish/controller.py", line 708, in _finish_simulation
    self._simulation_process.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)

However, if I launch the case with the following command:

./ldc_2d.py --debug_single_process

It is hanging again:

[  1718  INFO MainProcess] Machine master starting with PID 25261 at 2016-04-07 18:21:15 UTC
[  1718  INFO MainProcess] Simulation started with: ./ldc_2d.py --debug_single_process
[  1728  INFO MainProcess] Sailfish version: f111f6e4a0953357f0871374aa825bc2eaafc2a0
[  1729  INFO MainProcess] Handling subdomains: [0]
[  1729  INFO MainProcess] Subdomain -> GPU map: {0: 0}
[  1730  INFO MainProcess] Selected backend: opencl
[  2273  INFO MainProcess] Initializing subdomain.
[  2273  INFO MainProcess] Required memory: 
[  2273  INFO MainProcess] . distributions: 5 MiB
[  2273  INFO MainProcess] . fields: 0 MiB
[  2448  INFO MainProcess] On-GPU invalid result check disabled as the device does not support all required features.
/home/pepe/Downloads/sailfish/sailfish/backend_opencl.py:159: UserWarning: Received OpenCL source code in Unicode, should be ASCII string. Attempting conversion.
  return cl.Program(self.ctx, preamble + source).build() #'-cl-single-precision-constant -cl-fast-relaxed-math')
[  5546 WARNING MainProcess] Running infinite simulation.
[  5564  INFO MainProcess] Starting simulation.
[  6078  INFO MainProcess] iteration:2000  speed:266.26 MLUPS
[  6288  INFO MainProcess] iteration:3000  speed:304.68 MLUPS
[  6513  INFO MainProcess] iteration:4000  speed:287.41 MLUPS
[  6740  INFO MainProcess] iteration:5000  speed:285.69 MLUPS
[  6966  INFO MainProcess] iteration:6000  speed:286.89 MLUPS
[  7199  INFO MainProcess] iteration:7000  speed:278.13 MLUPS
[  7452  INFO MainProcess] iteration:8000  speed:255.82 MLUPS
[  7703  INFO MainProcess] iteration:9000  speed:257.62 MLUPS
[  7921  INFO MainProcess] iteration:10000  speed:297.96 MLUPS
[  8164  INFO MainProcess] iteration:11000  speed:266.58 MLUPS
[  8382  INFO MainProcess] iteration:12000  speed:296.28 MLUPS
[  8632  INFO MainProcess] iteration:13000  speed:259.16 MLUPS
[  8895  INFO MainProcess] iteration:14000  speed:246.05 MLUPS
[  9125  INFO MainProcess] iteration:15000  speed:282.82 MLUPS
[  9355  INFO MainProcess] iteration:16000  speed:281.31 MLUPS
[  9590  INFO MainProcess] iteration:17000  speed:275.48 MLUPS
[  9839  INFO MainProcess] iteration:18000  speed:260.35 MLUPS
[ 10076  INFO MainProcess] iteration:19000  speed:272.75 MLUPS
[ 10351  INFO MainProcess] iteration:20000  speed:235.59 MLUPS
[ 10625  INFO MainProcess] iteration:21000  speed:236.49 MLUPS
[ 11062  INFO MainProcess] iteration:22000  speed:148.00 MLUPS
[ 11284  INFO MainProcess] iteration:23000  speed:292.25 MLUPS
[ 11503  INFO MainProcess] iteration:24000  speed:295.61 MLUPS
[ 11764  INFO MainProcess] iteration:25000  speed:248.77 MLUPS
[ 12020  INFO MainProcess] iteration:26000  speed:252.55 MLUPS
[ 12274  INFO MainProcess] iteration:27000  speed:254.87 MLUPS
[ 12531  INFO MainProcess] iteration:28000  speed:252.07 MLUPS
[ 12779  INFO MainProcess] iteration:29000  speed:261.26 MLUPS

And this time I cannot cancel the job :-S