ornladios / ADIOS

The old ADIOS 1.x code repository. Look for ADIOS2 for new repo
https://csmd.ornl.gov/adios
Other
54 stars 40 forks source link

Segmentation violation when using MPI_BGQ trnasport layer #138

Closed fouriaux closed 6 years ago

fouriaux commented 7 years ago

When developping a benchmarking test (https://github.com/fouriaux/adios_experiments/tree/master/adios_buffer_size) I have observed that a segmentation fault appears in ADIOS, when using MPI_BGQ transport layer. This segmentation fault dont appears when using MPI or MPI_Aggregate transport layers. Attached is stack of segfault from totalview run.

adios_core_dump_srun

pnorbert commented 7 years ago

Can you please tell us the mpi size and parameters you have run with? Thank you.

On Tue, Jul 11, 2017 at 12:24 PM, Jeremy FOURIAUX notifications@github.com wrote:

When developping a benchmarking test I have observed that a segmentation fault appears in ADIOS, when using MPI_BGQ transport layer. This segmentation fault dont appears when using MPI or MPI_Aggregate transport layers. Attached is stack of segfault from totalview run. https://github.com/fouriaux/adios_experiments/tree/master/ adios_buffer_size [image: adios_core_dump_srun] https://user-images.githubusercontent.com/28300081/28078743-234d7a1e-6666-11e7-8652-3e6d57eb360b.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/138, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLUzt1htmG83RF-hsBbkz_HF0YygFks5sM6HXgaJpZM4OUfRo .

fouriaux commented 7 years ago

Hello, I have tested with mpi size between 4 -> 8192 mpi ranks. and with size of write of 4k per rank, 1000 times each rank.

pnorbert commented 7 years ago

If I do not turn off verbosity completely (note that default is 2 for errors+warnings), I get this errors:

ERROR: error allocating memory to build var index. Index aborted

This error happens on Mira/Cetus at ANL with 16 mpi tasks per node. If I run it with 8 mpi tasks per node, the test runs fine.

Also, I fixed the example to correctly use 64bit offsets and global dimensions in the global array:

--- a/adios_buffer_size/adios_buffer_size.cpp +++ b/adios_buffer_size/adios_buffer_size.cpp @@ -23,7 +23,7 @@ static const char* method; // adios met

static int rank; // mpi rank id of a process static int nb_ranks; // nb ranks in MPI_COMM_WORLD -static int total_size; // total_size to be written +static uint64_t total_size; // total_size to be written static int buffer; // allocated buffer to write of file per rank static int batch_size; // size of one write in static int nb_batchs; // number of batchs to write per rank @@ -42,7 +42,7 @@ void open (const char filename) { void write (int buffer){ adios_write (adios_handle, "global_size", (void) &total_size); adios_write (adios_handle, "batch_size", (void*) &batch_size);

I don't know why it runs out of memory, and why only with the MPI_BGQ method.

On Wed, Jul 12, 2017 at 3:56 AM, Jeremy FOURIAUX notifications@github.com wrote:

Hello, I have tested with mpi size between 4 -> 8192 mpi ranks. and with size of write of 4k per rank, 1000 times each rank.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/138#issuecomment-314685893, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLXmtbbQgxerRd5nI7L-z9PhyR2onks5sNHw8gaJpZM4OUfRo .

fouriaux commented 7 years ago

Awesome,

I will test it in my morning.

And if I have time I will try to dig in ADIOS source code.

Have a nice day and thanks for the correction.

Best regards,

Jeremy.


From: pnorbert notifications@github.com Sent: Thursday, July 20, 2017 9:08 PM To: ornladios/ADIOS Cc: Fouriaux Jérémy Pierre Benoit; Author Subject: Re: [ornladios/ADIOS] Segmentation violation when using MPI_BGQ trnasport layer (#138)

If I do not turn off verbosity completely (note that default is 2 for errors+warnings), I get this errors:

ERROR: error allocating memory to build var index. Index aborted

This error happens on Mira/Cetus at ANL with 16 mpi tasks per node. If I run it with 8 mpi tasks per node, the test runs fine.

Also, I fixed the example to correctly use 64bit offsets and global dimensions in the global array:

--- a/adios_buffer_size/adios_buffer_size.cpp +++ b/adios_buffer_size/adios_buffer_size.cpp @@ -23,7 +23,7 @@ static const char* method; // adios met

static int rank; // mpi rank id of a process static int nb_ranks; // nb ranks in MPI_COMM_WORLD -static int total_size; // total_size to be written +static uint64_t total_size; // total_size to be written static int buffer; // allocated buffer to write of file per rank static int batch_size; // size of one write in static int nb_batchs; // number of batchs to write per rank @@ -42,7 +42,7 @@ void open (const char filename) { void write (int buffer){ adios_write (adios_handle, "global_size", (void) &total_size); adios_write (adios_handle, "batch_size", (void*) &batch_size);

I don't know why it runs out of memory, and why only with the MPI_BGQ method.

On Wed, Jul 12, 2017 at 3:56 AM, Jeremy FOURIAUX notifications@github.com wrote:

Hello, I have tested with mpi size between 4 -> 8192 mpi ranks. and with size of write of 4k per rank, 1000 times each rank.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/138#issuecomment-314685893, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLXmtbbQgxerRd5nI7L-z9PhyR2onks5sNHw8gaJpZM4OUfRo .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ornladios/ADIOS/issues/138#issuecomment-316800891, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aa_TMTqFoq0VlweUwl4pKAzmHR20G152ks5sP6WwgaJpZM4OUfRo.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/ornladios/ADIOS","title":"ornladios/ADIOS","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/ornladios/ADIOS"}},"updates":{"snippets":[{"icon":"PERSON","message":"@pnorbert in #138: If I do not turn off verbosity completely (note that default is 2 for\nerrors+warnings), I get this errors:\n\nERROR: error allocating memory to build var index. Index aborted\n\nThis error happens on Mira/Cetus at ANL with 16 mpi tasks per node. If I\nrun it with 8 mpi tasks per node, the test runs fine.\n\nAlso, I fixed the example to correctly use 64bit offsets and global\ndimensions in the global array:\n\n--- a/adios_buffer_size/adios_buffer_size.cpp\n+++ b/adios_buffer_size/adios_buffer_size.cpp\n@@ -23,7 +23,7 @@ static const char\nmethod; // adios met\n\n static int rank; // mpi\nrank id of a process\n static int nb_ranks; // nb\nranks in MPI_COMM_WORLD\n-static int total_size; //\ntotal_size to be written\n+static uint64_t total_size; //\ntotal_size to be written\n static int buffer; //\nallocated buffer to write of file per rank\n static int batch_size; //\nsize of one write in\n static int nb_batchs; //\nnumber of batchs to write per rank\n@@ -42,7 +42,7 @@ void open (const char filename) {\n void write (int buffer){\n adios_write (adios_handle, \"global_size\", (void) \u0026total_size);\n adios_write (adios_handle, \"batch_size\", (void) \u0026batch_size);\n- int offset = batch_size rank;\n+ uint64_t offset = batch_size rank;\n for (int i = 0 ; i \u003c nb_batchs; i++) {\n adios_write (adios_handle, \"offset\", (void) \u0026offset);\n adios_write (adios_handle, \"data\", buffer);\n@@ -59,11 +59,11 @@ void initAdios (const char method, int\nmax_buffer_size) {\n adios_init_noxml ( comm);\n // adios_set_max_buffer_size ( max_buffer_size);\n adios_declare_group ( \u0026adios_group_id,\"report\", \"\", adios_stat_no);\n- adios_select_method ( adios_group_id, method, \"verbose=0\", \"\");\n- adios_define_var ( adios_group_id, \"global_size\", \"\",\nadios_integer, \"\", \"\", \"\");\n+ adios_select_method ( adios_group_id, method, \"verbose=2\", \"\");\n+ adios_define_var ( adios_group_id, \"global_size\", \"\",\nadios_long, \"\", \"\", \"\");\n adios_define_var ( adios_group_id, \"batch_size\", \"\",\nadios_integer, \"\", \"\", \"\");\n for (int i = 0; i \u003c nb_batchs; i++) {\n- offset_ids.push_back ( adios_define_var ( adios_group_id, \"offset\",\n\"\", adios_integer, \"\", \"\", \"\"));\n+ offset_ids.push_back ( adios_define_var ( adios_group_id, \"offset\",\n\"\", adios_long, \"\", \"\", \"\"));\n data_ids.push_back ( adios_define_var ( adios_group_id, \"data\",\n\"\", adios_integer, \"batch_size\", \"global_size\", \"offset\"));\n }\n }\n@@ -95,7 +95,7 @@ int main (int argc, char* argv) {\n MPI_Comm_size(comm, \u0026nb_ranks);\n initAdios(method, max_buffer_size);\n\n- total_size = batch_size nb_ranks nb_batchs;\n+ total_size = (uint64_t)batch_size (uint64_t)nb_ranks *\n(uint64_t)nb_batchs;\n initBuffer (buffer);\n open (out_file);\n\n\nI don't know why it runs out of memory, and why only with the MPI_BGQ\nmethod.\n\n\nOn Wed, Jul 12, 2017 at 3:56 AM, Jeremy FOURIAUX \u003cnotifications@github.com\u003e\nwrote:\n\n\u003e Hello, I have tested with mpi size between 4 -\u003e 8192 mpi ranks. and with\n\u003e size of write of 4k per rank, 1000 times each rank.\n\u003e\n\u003e —\n\u003e You are receiving this because you commented.\n\u003e Reply to this email directly, view it on GitHub\n\u003e \u003chttps://github.com/ornladios/ADIOS/issues/138#issuecomment-314685893\u003e,\n\u003e or mute the thread\n\u003e \u003chttps://github.com/notifications/unsubscribe-auth/ADGMLXmtbbQgxerRd5nI7L-z9PhyR2onks5sNHw8gaJpZM4OUfRo\u003e\n\u003e .\n\u003e\n"}],"action":{"name":"View Issue","url":"https://github.com/ornladios/ADIOS/issues/138#issuecomment-316800891"}}}

pnorbert commented 6 years ago

Hi Jeremy, do you still have this issue?