sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
243 stars 58 forks source link

Defect: Image number dependant MPI_Win_Lock error #737

Open Oiubrab opened 2 years ago

Oiubrab commented 2 years ago

System Information:

Note, in running mpicc -show, I get /opt/nvidia/hpc_sdk/Linux_x86_64/21.1/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpicc: error while loading shared libraries: libnvcpumath.so: cannot open shared object file: No such file or directory

The issue

What I was trying to do

I was trying to run four concurrent images, executing the compilation of my code, found at https://github.com/Oiubrab/byinheritance, executing sudo chmod u+x i_am_in_command.zsh && ./i_am_in_command.zsh clean 2 test print. The why is described in the github readme, found in the link, but basically I have created a neural network that computes a trading action in fortran. Ultimately, the pertinent execution lies in the aforementioned bash script line cafrun -n 4 --use-hwthread-cpus ./lack_of_comprehension $3.

What Happened

When this line is run, there is an mpi error generated and, having put in two print statements to catch the error, where place represents the order of the placing of the statements, I get:

Invalid Trades:
[1, 0, 0, 0, 0]
0

Network Choice:
[0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0]
[0, 0, 0]

Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.33, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.115, 'units_owned': 0}

Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241351.6863492}

run:  1
 image number:           1 Place:           1
 image number:           2 Place:           1
 image number:           3 Place:           1
 image number:           4 Place:           1
 image number:           2 Place:           2
 image number:           3 Place:           2
 image number:           1 Place:           2
[manjaro:25102] *** An error occurred in MPI_Win_detach
[manjaro:25102] *** reported by process [3482124289,0]
[manjaro:25102] *** on win rdma window 5
[manjaro:25102] *** MPI_ERR_UNKNOWN: unknown error
[manjaro:25102] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:25102] ***    and potentially your MPI job)
[manjaro:25098] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:25098] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./lack_of_comprehension test`
failed to run.

Invalid Trades:
[0, 0, 1, 0, 0]
2

Network Choice:
[0, 0, 0, 0, 0, 0, 0] [1, 0, 0, 0, 0, 0, 0] [0, 1, 0, 1, 0, 0, 0]
[0, -1, -10]

Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.33, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.115, 'units_owned': 0}

Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241360.06828}

run:  2
 image number:           1 Place:           1
 image number:           2 Place:           1
 image number:           3 Place:           1
 image number:           4 Place:           1
 image number:           1 Place:           2
 image number:           2 Place:           2
 image number:           3 Place:           2
[manjaro:25174] *** An error occurred in MPI_Win_detach
[manjaro:25174] *** reported by process [3486973953,2]
[manjaro:25174] *** on win rdma window 5
[manjaro:25174] *** MPI_ERR_UNKNOWN: unknown error
[manjaro:25174] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:25174] ***    and potentially your MPI job)
[manjaro:25168] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:25168] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./lack_of_comprehension test`
failed to run.

Invalid Trades:
[0, 0, 1, 0, 0]
2

Network Choice:
[1, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0] [0, 1, 0, 0, 0, 0, 0]
[-1, 0, -2]

Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.39, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.12, 'units_owned': 0}

Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241366.8761365}

What I expected to happen

Markets and network choices vary. This is expected. What is not expected is the error and the fact that the fourth image does not run to the second place. I should see the output above, but without the error, and with a image number: 4 Place: 2 line. This exact code (minus the print statements) ran without a hitch with the last version of openmpi (openmpi-4.0.5-3-x86_64). I have since tried to run other opencoarrays programs I have written and found various errors trying to run less than six threads.

Step by step reproduction

This error can be reproduced following the execution above. As this error seems to be code agnostic, you can also try running the process below to reproduce a similar error (again, this code was running previous to the update):

step 1

Take the following code and save as an f95 file (e.g test_arraymove.f95):

program test_arraycom

real,dimension(10) ,  codimension [*] :: x ,  y
integer ::  num_img , me
num_img = num_images()
me = this_image ()
print*,me,num_img

! Some code  here
x (2) = x ( 3 ) [ 6 ]!  get  value  from image 6
x ( 6 ) [ 4 ] = x (1)!  put  value on image 4
x ( : ) [ 2 ] = y ( : )!  put  array on image 2
sync all

!  Remote−to−remote  array  transfer
if(me == 1)then
    y(:)[num_img]=x(:)[  4  ]
    sync images (num_img)
else if(me == num_img) then
    sync images ([ 1 ])
end if

x(1:10:2)=y(1:10:2)[4]!  strided  get  from 4

end program

step 2

compile the code with caf test_arraymove.f95 -o programname

step 3

run the code, with the number of CPU threads, $2, below 6

i.e cafrun -n $2 --use-hwthread-cpus ./programname

step 4

get an error of the form:

           1           4
           2           4
           3           4
           4           4
[manjaro:28814] *** An error occurred in MPI_Win_lock
[manjaro:28814] *** reported by process [3708682241,1]
[manjaro:28814] *** on win rdma window 6
[manjaro:28814] *** MPI_ERR_RANK: invalid rank
[manjaro:28814] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:28814] ***    and potentially your MPI job)
[manjaro:28809] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:28809] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./testarraymove`
failed to run.
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.