polymec / polymec-dev

The development "branch" of the polymec HPC libraries.
Mozilla Public License 2.0
2 stars 2 forks source link

Fix flakiness in polyhedral mesh file writing. #68

Closed pbtoast closed 8 years ago

pbtoast commented 8 years ago

The create_*mesh_n_proc (n > 1) unit tests occasionally hang, meaning that there are likely some deadlock issues in the mesh partitioning process and/or the writing of mesh files. This doesn't happen very much, but it's not the robust behavior we're striving for. It should be fixed.

pbtoast commented 8 years ago

This occurs more rarely than I had thought, so maybe it's not super high priority at the moment.

pbtoast commented 8 years ago

One of these kinds of failures is a parallel Silo write, with debris of this sort:

56/81 Testing: test_create_uniform_mesh_4_proc

56/81 Test: test_create_uniform_mesh_4_proc

Command: "/usr/local/bin/mpirun" "-np" "4" "/Users/travis/build/polymec/polymec-dev/build/Darwin-x86_64-mpi-static-double-mpicc-Release/geometry/tests/test_create_uniform_mesh"

Directory: /Users/travis/build/polymec/polymec-dev/build/Darwin-x86_64-mpi-static-double-mpicc-Release/geometry/tests

"test_create_uniform_mesh_4_proc" start time: Feb 22 03:53 UTC

Output:


[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files

DBCreate: Low-level function call failed: link group

DBSetDir: File was closed or never opened/created.: link group

DBWrite: File was closed or never opened/created.: link group

DBPutMultimesh: File was closed or never opened/created.: link group


MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.


0: Fatal error: Error writing multi-mesh to Silo master file uniform_mesh_10x10x10_4f_4procs/uniform_mesh_10x10x10_4f-0.silo.

Test time = 0.30 sec --- Test Failed. "test_create_uniform_mesh_4_proc" end time: Feb 22 03:53 UTC "test_create_uniform_mesh_4_proc" time elapsed: 00:00:00 ---
pbtoast commented 8 years ago

Here's another failure:

60/82 Testing: test_create_rectilinear_mesh_4_proc

60/82 Test: test_create_rectilinear_mesh_4_proc

Command: "/usr/local/bin/mpirun" "-np" "4" "/Users/travis/build/polymec/polymec-dev/build/Darwin-x86_64-mpi-shared-double-mpicc-Release/geometry/tests/test_create_rectilinear_mesh"

Directory: /Users/travis/build/polymec/polymec-dev/build/Darwin-x86_64-mpi-shared-double-mpicc-Release/geometry/tests

"test_create_rectilinear_mesh_4_proc" start time: Feb 22 03:57 UTC

Output:


[==========] Running 2 test(s).

[ RUN ] test_create_rectilinear_mesh

[==========] Running 2 test(s).

[ RUN ] test_create_rectilinear_mesh

[==========] Running 2 test(s).

[ RUN ] test_create_rectilinear_mesh

[==========] Running 2 test(s).

[ RUN ] test_create_rectilinear_mesh

[ OK ] test_create_rectilinear_mesh

[ RUN ] test_plot_rectilinear_mesh

[ OK ] test_create_rectilinear_mesh

[ RUN ] test_plot_rectilinear_mesh

[ OK ] test_create_rectilinear_mesh

[ RUN ] test_plot_rectilinear_mesh


MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.


0: Fatal error: silo_file_write_mesh: Could not write mesh 'mesh'.

DBCreate: File not found or invalid permissions: ./rectilinear_4x4x4-0.silo

DBSetDir: File was closed or never opened/created.: ./rectilinear_4x4x4-0.silo

DBWrite: File was closed or never opened/created.: ./rectilinear_4x4x4-0.silo

DBPutUcdmesh: File was closed or never opened/created.: ./rectilinear_4x4x4-0.silo

[ OK ] test_plot_rectilinear_mesh

[==========] 2 test(s) run.

[ PASSED ] 2 test(s).

[ PASSED ] 2 test(s).

Test time = 0.89 sec --- Test Failed. "test_create_rectilinear_mesh_4_proc" end time: Feb 22 03:57 UTC "test_create_rectilinear_mesh_4_proc" time elapsed: 00:00:00 ---
pbtoast commented 8 years ago

Another:

57/82 Testing: test_create_uniform_mesh_4_proc

57/82 Test: test_create_uniform_mesh_4_proc

Command: "/usr/local/bin/mpirun" "-np" "4" "/Users/travis/build/polymec/polymec-dev/build/Darwin-x86_64-mpi-shared-double-mpicc-Release/geometry/tests/test_create_uniform_mesh"

Directory: /Users/travis/build/polymec/polymec-dev/build/Darwin-x86_64-mpi-shared-double-mpicc-Release/geometry/tests

"test_create_uniform_mesh_4_proc" start time: Feb 22 04:57 UTC

Output:


[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[==========] Running 3 test(s).

[ RUN ] test_create_uniform_mesh

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_create_uniform_mesh

[ RUN ] test_plot_uniform_mesh_to_single_file

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files

[ OK ] test_plot_uniform_mesh_to_single_file

[ RUN ] test_plot_uniform_mesh_to_n_files


MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD

with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.


3: Fatal error: Error writing multi-mesh to Silo master file uniform_mesh_10x10x10_4f_4procs/uniform_mesh_10x10x10_4f-0.silo.

DBCreate: Low-level function call failed: link group

DBSetDir: File was closed or never opened/created.: link group

DBWrite: File was closed or never opened/created.: link group

DBPutMultimesh: File was closed or never opened/created.: link group

Test time = 0.35 sec --- Test Failed. "test_create_uniform_mesh_4_proc" end time: Feb 22 04:57 UTC "test_create_uniform_mesh_4_proc" time elapsed: 00:00:00 ---