Open GoogleCodeExporter opened 9 years ago
please respond to dpephd1@gmail.com or dpephd-nvidia@yahoo.com
Original comment by dpep...@gmail.com
on 31 Oct 2009 at 6:22
additional information concerning driver from /proc/nvidia/version:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 190.18 Wed Jul 22 15:36:09 PDT
2009
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)
from doug.enright@gmail / dpephd1@gmail.com - please use dpephd1@gmail.com in
replies
Original comment by dpep...@gmail.com
on 31 Oct 2009 at 6:32
Tried running this test using Windows XP Pro (32bit) using VC 8 (Visual Studio
2005
Express Ed.).
The test still failed, but not due to an "unspecified launch failure".
Now I get the following:
================================================================================
====
run resolution 32
deltaT = 657.000000
init min/max t = 10.265625 646.734375
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error
after: L2 =
0.141615 (0.374520x), Linf = 3.4
81782 (0.973917x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg did not converge,
retrying with
zeroed initial search vector
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error
after: L2 =
0.048482 (12.194238x), Linf = 1.
842807 (1.782636x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg failed
[WARNING] Failure: Sol_ProjectDivergence3DDevice::solve - could not solve for
pressure
[ERROR] Equation::advance - failed to advance equation
[ASSERT] RayleighTest::assert_true at ..\..\src\tests\rayleigh.cpp line 239
[FAILED] RayleighTest
There were failures.
================================================================================
====
This error appears to be more numerically oriented than a runtime/hardware
issue.
I would appreciate an guidance in understanding both issues raised with regards
to
the RayleighTest unit test results observed.
Thanks,
dpe
Original comment by dpep...@gmail.com
on 3 Nov 2009 at 7:30
Attachments:
Some additional data when running under CentOS.
I tried running RayleighTest again this morning and the first time I ran it,
the test
never completed and pretty much stalled at the start of the run resolution 32
case.
Stalled being making no progress after ~10 minutes at a particular time step
... at
least that is what I interpret the "Log ratio: " output to mean.
I ctrl-c'd the run and restarted and obtained the launch error failure noted
above.
It appears likely that the windows results, i.e. failure to converge warning, is
related to the above observed behavior. 'ctrl-c' does not appear to result in a
clean termination of the program and subsequent restarts result in a
runtime/hardware
failure.
For the opencurrent developers - why would the warning noted above be issued
under
windows, but not CentOS?
dpe
Original comment by dpep...@gmail.com
on 3 Nov 2009 at 4:11
Additional windows driver and system information
Original comment by dpep...@gmail.com
on 4 Nov 2009 at 9:55
Attachments:
Some additional unit test information (CentOS). Both sm13-rel and sm13-dbg.
Silently hung process for sm13-rel (ctrl-c) to quit during unit test 17.
sm13-dbg finished to completion with test failures noted.
Original comment by dpep...@gmail.com
on 4 Nov 2009 at 3:57
Attachments:
update - able to get utest to compile under CUDA 2.3 SDK for further testing and
running under emulation mode.
Original comment by dpep...@gmail.com
on 15 Nov 2009 at 8:46
Ran unit tests compiled with debug (dbg) and release (rel) flags issued upon
compilation. Note that the debug and release flags thrown when compiling under
the
SDK are different than those with CMake.
Only NSTest failed when compiled with debug flags. Warning / Error messages
issued are:
================================================================================
====
[INFO] Running on GPU 0
Running tests: NSTest
running NSTest
Frame 0
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error
after: L2 =
0.000000000000 (nanx), Linf =
0.000061035156 (16056.320312x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg did not converge,
retrying with
zeroed initial search vector
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error
after: L2 =
0.000000000000 (nanx), Linf =
0.000061035156 (16056.320312x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg failed
[WARNING] Failure: Sol_ProjectDivergence3DDevice::solve - could not solve for
pressure
[ERROR] Equation::advance - failed to advance equation
================================================================================
====
MultigridMixedTest issued warning about failure to converge:
================================================================================
====
[INFO] Running on GPU 0
Running tests: MultigridMixedTest
running MultigridMixedTest
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error
after: L2 =
0.020313601941 (4447.862672x),
Linf = 0.000000000000 (nanx)
================================================================================
====
more unit tests failed under release compilation, namely:
MultigridDoubleTest, NSTest, RayleighNoSlipTest, RayleighTest, and
RayleighTimingTest
in addition warnings were issued for Advection3DDoubleSwirlTest,
LockExDoubleTest,
and LockExTest. All warnings were issued due to failure of do_fmg (I assume the
multigrid solver) to converge, e.g. for Advection3DDoubleSwirlTest:
================================================================================
=====
Advection3DDoubleSwirlTest_rel.out:[WARNING]
Sol_MultigridPressure3DBase::do_fmg -
Failed to converge, error after:
L2 = 0.000000000000 (nanx), Linf = 1449.953977296822 (0.021035x)
Advection3DDoubleSwirlTest_rel.out:[WARNING] Sol_MultigridPressure3DBase::solve
-
do_fmg did not converge, retrying
with zeroed initial search vector
Advection3DDoubleSwirlTest_rel.out:[WARNING]
Sol_MultigridPressure3DBase::do_fmg -
Failed to converge, error after:
L2 = 0.000000000000 (nanx), Linf = 926.607348723281 (0.032916x)
Advection3DDoubleSwirlTest_rel.out:[WARNING] Sol_MultigridPressure3DBase::solve
-
do_fmg failed
Advection3DDoubleSwirlTest_rel.out:[WARNING] Failure:
Sol_ProjectDivergence3DDevice::solve - could not solve for pressure
================================================================================
=====
Errors were issued primarily due to "unspecified launch failure" CUDA error
messages,
e.g. for MultigridDoubleTest:
================================================================================
=====
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) -
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) -
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(256) - CUDA error "invalid
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 0
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) -
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) -
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(256) - CUDA error "invalid
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 0
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDevice::relax -
failed at
level 0
MultigridDoubleTest_rel.out:[ERROR] Grid3DDeviceT::clear_zero - cudaMemset
failed
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(128) - CUDA error "invalid
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 1
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD::bind_tex_calculate_residual - Could not bind tex
ture U
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD_calculate_residual(256) - CUDA error "unspecified
launch failure"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD_restrict(128) -
CUDA error "unspecified launch fa
ilure"
MultigridDoubleTest_rel.out:[ERROR] Grid1DDeviceF::init - cudaMalloc failed
MultigridDoubleTest_rel.out:[ERROR] reduce_kernel - CUDA error "invalid
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::restrict_residuals
- failed at level 0 -> 1
MultigridDoubleTest_rel.out:[ERROR] Grid3DDeviceT::clear_zero - cudaMemset
failed
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(256) - CUDA error "invalid
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 0
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD::bind_tex_calculate_residual - Could not bind tex
ture U
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD_calculate_residual(256) - CUDA error "unspecified
launch failure"
MultigridDoubleTest_rel.out:[ERROR] Grid1DDeviceF::init - cudaMalloc failed
MultigridDoubleTest_rel.out:[ERROR] reduce_kernel - CUDA error "invalid
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::restrict_residuals
- failed at level 0 -> 0
MultigridDoubleTest_rel.out:[ERROR] Grid1DDeviceF::init - cudaMalloc failed
MultigridDoubleTest_rel.out:[ERROR] reduce_kernel - CUDA error "invalid
resource handle"
================================================================================
=====
I had to modify some of the tests to use smaller grids since my GTX 260 only
has 866
MB of onboard device memory as opposed to the 4GB available with the Tesla
C1060 line.
Any assistance in resolving these unit test failure messages would be
appreciated.
dpe
Original comment by dpep...@gmail.com
on 25 Nov 2009 at 3:29
additional comments about CMake and gdb debugging can be found here:
http://forums.nvidia.com/index.php?s=&showtopic=105027&view=findpost&p=960293
Original comment by dpep...@gmail.com
on 5 Dec 2009 at 11:34
Am now able to run successfully run the unit tests (make test target), however
this
required an extensive investigation of the use of the various optimization
flags and
directing these flags to specific parts of the CUDA compilation trajectory
(nvopencc
and ptxas). The use of CMake 2.8 with the updated FindCUDA.cmake capability was
required to enable cuda-gdb use. I am also running CUDA 3.0 beta with the new
driver
(195.17).
I am only able to run at either the -O0 and -O1 optimization levels. There is
minimal, to no performance difference between the different optimization
levels.
There is a significant performance difference between -O0 and the use of the
debug
flags (-g -G). The various unit test performance results are attached.
I consider the issue closed, but would appreciate it if someone knowledgeable
about
what the specific optimization level do could comment on this behavior and why
it may
be happening. I still observing unit test failures when trying to use
optimization
level -O2. It is unknown to me if use of -O2 can provide additional performance
benefit. If someone could comment on this, I would appreciate it.
dpe
Original comment by dpep...@gmail.com
on 15 Dec 2009 at 5:06
Attachments:
Original issue reported on code.google.com by
doug.enr...@gmail.com
on 31 Oct 2009 at 5:44Attachments: