sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
243 stars 58 forks source link

Defect: UCX warnings in CentOS #780

Open SineBell opened 11 months ago

SineBell commented 11 months ago

System information including:

To help us debug your issue please explain:

What you were trying to do (and why)

Running any fortran code with more than 1 images.

What happened (include command output, screenshots, logs, etc.)

At the end of the execution, numerous UCX warnings are printed on screen. E.g.

[1691070750.032589] [debye4:964906:0]      tag_match.c:61   UCX  WARN  unexpected tag-receive descriptor 0x7f5c5710dfc0 was not matched
[1691070750.032604] [debye4:964906:0]      tag_match.c:61   UCX  WARN  unexpected tag-receive descriptor 0x7f5c570fdf40 was not matched
[1691070750.032616] [debye4:964906:0]      tag_match.c:61   UCX  WARN  unexpected tag-receive descriptor 0x7f5c570edec0 was not matched
[1691070750.032650] [debye4:964905:0]      tag_match.c:61   UCX  WARN  unexpected tag-receive descriptor 0x7f5a84b00f40 was not matched

What you expected to happen

The execution appears to end successfully. The large number of warnings, however, clutters the output making difficult to read the output on screen.

Step-by-step reproduction instructions to reproduce the error/bug

Any code I tested with cafrun and -n > 1.

For example, this simple code

program bugcheck
    write(*,*) "hello by ", this_image()
end program

Compiled with caf -o bugcheck bugcheck.f90 Run with cafrun -n 2 bugcheck will output

 hello by            1
 hello by            2
[1691071152.339731] [debye4:965700:0]      tag_match.c:61   UCX  WARN  unexpected tag-receive descriptor 0x7fb352acbfc0 was not matched
[1691071152.339731] [debye4:965701:0]      tag_match.c:61   UCX  WARN  unexpected tag-receive descriptor 0x7fb0ef50efc0 was not matched
jthies commented 7 months ago

We see the same issue, checked out OpenCoarrays today, compiled with GCC 11.3.0 or GCC 8.5.0, OpenMPI 4.1.4. When run with p images, p*(p-1) such warning messages are printed. AFAIK they are triggered by MPI_Finalize if MPI_Send's were not matched with an MPI_Recv.