Closed tongshu83 closed 4 years ago
Hi @tongshu83, thanks for reporting this. What MPI implementation are you using? Does this happen if you don't add the -ppn flag?
MPI implementation I am using is intel-mpi .
The loaded modules are as follows. module unload intel-mkl/2017.3.196-v7uuj6z module load intel/17.0.4-74uvhji module load intel-mpi/2017.3-dfphq6k module load libpng/1.6.34-nhq7uj3 module load cmake/3.12.2-4zllpyo module load jdk/8u141-b15-mopj6qr module load tcl/8.6.6-x4wnbsg module load bzip2/1.0.8-5ba64je module load zlib/1.2.11-6632jqd module load anaconda3/5.2.0
The error is the same if I don't add the -ppn flag or set other values for ppn such as 25, 27.
$ mpiexec -n 73 -hosts bdw-0040,bdw-0042,bdw-0043 ./build/gray-scott simulation/settings-files.json > output_gray-scott.txt 2>&1
Fatal error in PMPI_Type_vector: Invalid argument, error stack: PMPI_Type_vector(163): MPI_Type_vector(count=514, blocklength=-64, stride=-62, MPI_DOUBLE, new_type_p=0x7ffd5a874638) failed PMPI_Type_vector(128): Invalid value for blocklen, must be non-negative but is -64
Could you please help fix this issue and give me more suggestions? Thank you very much for your help!
It seems the computed sub-grid size becomes invalid (negative) for some number of processes you found. I will look into it.
-- KT
On Dec 21, 2019, at 4:51, Tong Shu notifications@github.com wrote:
MPI implementation I am using is intel-mpi .
The loaded modules are as follows. module unload intel-mkl/2017.3.196-v7uuj6z module load intel/17.0.4-74uvhji module load intel-mpi/2017.3-dfphq6k module load libpng/1.6.34-nhq7uj3 module load cmake/3.12.2-4zllpyo module load jdk/8u141-b15-mopj6qr module load tcl/8.6.6-x4wnbsg module load bzip2/1.0.8-5ba64je module load zlib/1.2.11-6632jqd module load anaconda3/5.2.0
The error is the same if I don't add the -ppn flag or set other values for ppn such as 25, 27.
$ mpiexec -n 73 -hosts bdw-0040,bdw-0042,bdw-0043 ./build/gray-scott simulation/settings-files.json > output_gray-scott.txt 2>&1
Fatal error in PMPI_Type_vector: Invalid argument, error stack: PMPI_Type_vector(163): MPI_Type_vector(count=514, blocklength=-64, stride=-62, MPI_DOUBLE, new_type_p=0x7ffd5a874638) failed PMPI_Type_vector(128): Invalid value for blocklen, must be non-negative but is -64
Could you please help fix this issue and give me more suggestions? Thank you very much for your help!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
@tongshu83 PR #50 should fix this issue. Can you please try it out?
Thank you very much for your great help! However, there are still some issues, such as the following parameters.
mpiexec -n 601 -ppn 26 ./build/gray-scott simulation/settings-files.json > output_gray-scott.txt 2>&1
Simulation writes data using engine type: BP4
grid: 512x512x512 steps: 250 plotgap: 10 F: 0.01 k: 0.05 dt: 2 Du: 0.2 Dv: 0.1 noise: 1e-07 output: gs.bp adios_config: adios2.xml process layout: 601x1x1 local grid size: 1x512x512
terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
Could you please help fix this issue and give me more suggestions? Thanks a lot!
Hi Tong, What are you trying to achieve here? 601 is a prime number, so the only decomposition of the 3D array is 1x512x512 for 512 processes and 0x512x512 for the rest. The rest is not calculating anything and it fails at I/O (which it shouldn't, I agree). Fixing that bug in IO will not help you anything, would it? Best regards Norbert
On Wed, Dec 25, 2019 at 1:32 PM Tong Shu notifications@github.com wrote:
Thank you very much for your great help! However, there are still some issues, such as the following parameters.
mpiexec -n 601 -ppn 26 ./build/gray-scott simulation/settings-files.json > output_gray-scott.txt 2>&1 Simulation writes data using engine type: BP4 grid: 512x512x512 steps: 250 plotgap: 10 F: 0.01 k: 0.05 dt: 2 Du: 0.2 Dv: 0.1 noise: 1e-07 output: gs.bp adios_config: adios2.xml process layout: 601x1x1 local grid size: 1x512x512
terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Put
Could you please help fix this issue and give me more suggestions? Thanks a lot!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pnorbert/adiosvm/issues/49?email_source=notifications&email_token=AAYYYLOZLYSKXJNCAAKE52LQ2ORNDA5CNFSM4J5UDL32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHUQZBI#issuecomment-568921221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYYYLN2FFVR3URMOCODHJDQ2ORNDANCNFSM4J5UDL3Q .
Thank you very much for your timely explanation. I would like to see the impact of different parameters on the performance. According to your explanation, 601 is an invalid parameter in terms of the number of processes, and we should avoid using it, right?
We know that if you choose a process number (N) that does not result in a balanced (identical) domain size on every process, than the calculation will not be balanced. Nor the IO.
I guess it depends on your study what are good and bad choices for N. But for a MxMxM array a choice of M < N < 2M will do a 1D decomposition with many processes left with zero domain size.
On Wed, Dec 25, 2019, 2:29 PM Tong Shu notifications@github.com wrote:
Thank you very much for your timely explanation. I would like to see the impact of different parameters on the performance. According to your explanation, 601 is an invalid parameter in terms of the number of processes, and we should avoid using it, right?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pnorbert/adiosvm/issues/49?email_source=notifications&email_token=AAYYYLONPELCVVYAJDKZ2A3Q2OYA7A5CNFSM4J5UDL32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHURWOY#issuecomment-568924987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYYYLKAKJTB7MQ5LA3MID3Q2OYA7ANCNFSM4J5UDL3Q .
I've fixed the crash when the domain size is zero on one of the ranks. Please note that the parallel efficiency will be low as Norbert pointed out.
Thanks a lot! Keichi, how did you fix the crash when the domain size is zero on one of the ranks?
The fix included in my PR (#50). See https://github.com/pnorbert/adiosvm/pull/50/commits/ed4ac0a58fd5e594a141ba07817d5b9656bf64cd.
Thank you so much for your help! With the output of simulation for the grid of 512x512x512, I fail to execute analysis pdf_calc for the number of processes larger than 512. I tried the follow parameters.
$ mpiexec -n 517 -ppn 32 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 522 -ppn 32 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 523 -ppn 33 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 525 -ppn 24 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 600 -ppn 33 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 620 -ppn 34 build/pdf_calc gs.bp pdf.bp 200
Error information is as follows.
PDF analysis reads from Simulation using engine type: BP4 PDF analysis writes using engine type: BP4 terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Get
terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Get
terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Get
Could you please help fix this issue and give me some suggestions? Thank you very much!
pdf_calc uses 1D decomposition. If you have more processes than L, some processes will have zero domain size. I can modify pdf_calc so that it won’t error, but you won’t achieve higher performance than using L processes.
-- KT
On Dec 27, 2019, at 10:47, Tong Shu notifications@github.com wrote:
Thank you so much for your help! With the output of simulation for the grid of 512x512x512, I fail to execute analysis pdf_calc for the number of processes larger than 512. I tried the follow parameters.
$ mpiexec -n 517 -ppn 32 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 522 -ppn 32 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 523 -ppn 33 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 525 -ppn 24 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 600 -ppn 33 build/pdf_calc gs.bp pdf.bp 200 $ mpiexec -n 620 -ppn 34 build/pdf_calc gs.bp pdf.bp 200
Error information is as follows.
PDF analysis reads from Simulation using engine type: BP4 PDF analysis writes using engine type: BP4 terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Get
terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Get
terminate called after throwing an instance of 'std::invalid_argument' what(): ERROR: found null pointer for data argument in non-zero count block, in call to Get
Could you please help fix this issue and give me some suggestions? Thank you very much!
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
I would highly appreciate it if you can modify pdf_calc. There is also another type of errors as follows.
I used SST to launch the simulation gray-scott and the analysis pdf_calc simultaneously, but pdf_calc crashed for some parameters in terms of the number of processes.
settings-staging.json: { "L": 512, "Du": 0.2, "Dv": 0.1, "F": 0.01, "k": 0.05, "dt": 2.0, "plotgap": 100, "steps": 500, "noise": 0.0000001, "output": "gs.bp", "checkpoint": false, "checkpoint_freq": 10, "checkpoint_output": "gs_ckpt.bp", "adios_config": "adios2.xml", "adios_span": false, "adios_memory_selection": false, "mesh_type": "image" }
$ mpiexec -n 523 -hosts bdw-0041,bdw-0042,bdw-0044,bdw-0046,bdw-0059,bdw-0060,bdw-0061,bdw-0062,bdw-0201,bdw-0202,bdw-0584,bdw-0585,bdw-0586,bdw-0587,bdw-0598 build/gray-scott simulation/settings-staging.json > output_gray-scott.txt 2>&1
Simulation writes data using engine type: SST
grid: 512x512x512 steps: 500 plotgap: 100 F: 0.01 k: 0.05 dt: 2 Du: 0.2 Dv: 0.1 noise: 1e-07 output: gs.bp adios_config: adios2.xml process layout: 523x1x1 local grid size: 1x512x512
Simulation at step 100 writing output step 1 Simulation at step 200 writing output step 2 Simulation at step 300 writing output step 3 Simulation at step 400 writing output step 4 Simulation at step 500 writing output step 5
$ mpiexec -n 227 -hosts bdw-0599,bdw-0600,bdw-0601,bdw-0605,bdw-0606,bdw-0607,bdw-0608 build/pdf_calc gs.bp pdf.bp 200 > output_pdf_calc.txt 2>&1
PDF analysis reads from Simulation using engine type: SST
PDF analysis writes using engine type: BP4
pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:166: fi_ibv_rdm_prepare_conn_memory: Assertion conn->ack_md.mr' failed. pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion
!ret' failed.
pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion !ret' failed. pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion
!ret' failed.
pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion !ret' failed. pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion
!ret' failed.
pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion !ret' failed. pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion
!ret' failed.
pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion !ret' failed. pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion
!ret' failed.
pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion !ret' failed. pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion
!ret' failed.
pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:151: fi_ibv_rdm_prepare_conn_memory: Assertion !ret' failed. pdf_calc: prov/verbs/src/ep_rdm/verbs_rdm_cm.c:166: fi_ibv_rdm_prepare_conn_memory: Assertion
conn->ack_md.mr' failed.
There is the same issue in the following case. $ mpiexec -n 502 -ppn 32 build/gray-scott simulation/settings-staging.json > output_gray-scott.txt 2>&1 $ mpiexec -n 345 -ppn 34 build/pdf_calc gs.bp pdf.bp 200 > output_pdf_calc.txt 2>&1
Could you please help fix this issue and give me some suggestions? Thank you very much!
Hi Tong Shu,
Do you know what version of libfabric your system is using? You might be able to find it by running the fi_info --version
command.
It looks like libfabric is running out of some resource (pinnable memory, most likely.) You mention that you see this problem for certain configuration parameters. Do you get a sense of what the configuration parameters are that will fail (e.g. larger datasizes.)
$ fi_info --version fi_info: 1.6.2 libfabric: 1.6.2 libfabric api: 1.6
At present, I only listed failed parameters above. I will record more failed parameters if I meet them.later.
I run the simulation gray-scott in Bebop. It can work over the following parameter values for the number of processes and the number of processes per node. $ mpiexec -n 162 -ppn 31 $ mpiexec -n 68 -ppn 13 $ mpiexec -n 64 -ppn 32
simulation/settings-files.json: { "L": 512, "Du": 0.2, "Dv": 0.1, "F": 0.01, "k": 0.05, "dt": 2.0, "plotgap": 10, "steps": 250, "noise": 0.0000001, "output": "gs.bp", "checkpoint": false, "checkpoint_freq": 10, "checkpoint_output": "gs_ckpt.bp", "adios_config": "adios2.xml", "adios_span": false, "adios_memory_selection": false, "mesh_type": "image" }
However, gray-scott cannot work for some reasonable parameter values in terms of the number of processes and the number of processes per node. For example, $ mpiexec -n 73 -ppn 26 -hosts bdw-0034,bdw-0478,bdw-0514 ./build/gray-scott simulation/settings-files.json > output_gray-scott.txt 2>&1
Fatal error in PMPI_Type_vector: Invalid argument, error stack: PMPI_Type_vector(163): MPI_Type_vector(count=514, blocklength=-64, stride=-62, MPI_DOUBLE, new_type_p=0x7ffeb16750a8) failed PMPI_Type_vector(128): Invalid value for blocklen, must be non-negative but is -64
gray-scott cannot work for the following parameter values either. $ mpiexec -n 304 -ppn 34 $ mpiexec -n 536 -ppn 35
Could you please help fix this issue and let me know the reason? Thank you very much for your help!