openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 418 forks source link

Open MPI with UCX breaks in user namespaces #4224

Open adrianreber opened 4 years ago

adrianreber commented 4 years ago

Trying to run a UCX based Open MPI with each process in a user namespace (container) breaks UCX completely it seems:

 mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission denied. File name is: /proc/24149/fd/16    
    mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote mm_id: 103719165231238
  pml_ucx.c:383  Error: ucp_ep_create(proc=6) failed: Shared memory error

I fixed a similar thing recently in Open MPI vader: https://github.com/open-mpi/ompi/pull/6844

Autodetect that each process is running in a different user namespace and do not use ptrace() based copy mechanisms.

This can be easily reproduced on Fedora 31 with:

[root@fedora01 ~]# rpm -q ucx openmpi
ucx-1.6.0-1.fc31.x86_64
openmpi-4.0.2-0.2.rc1.fc31.x86_64
[root@fedora01 ~]# mpirun --allow-run-as-root -np 4 unshare --map-root-user --user /home/mpi/ring
[fedora01:00765] mca_base_component_repository_open: unable to open mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: undefined symbol: uct_ep_create_connected (ignored)
[fedora01:00767] mca_base_component_repository_open: unable to open mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: undefined symbol: uct_ep_create_connected (ignored)
[fedora01:00764] mca_base_component_repository_open: unable to open mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: undefined symbol: uct_ep_create_connected (ignored)
[fedora01:00766] mca_base_component_repository_open: unable to open mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: undefined symbol: uct_ep_create_connected (ignored)
[1569392914.129581] [fedora01:766  :0]       mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission denied. File name is: /proc/767/fd/16
[1569392914.129594] [fedora01:766  :0]          mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote mm_id: 3294239916166
[fedora01:00766] pml_ucx.c:383  Error: ucp_ep_create(proc=3) failed: Shared memory error
[1569392914.129813] [fedora01:764  :0]       mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission denied. File name is: /proc/765/fd/16
[1569392914.129829] [fedora01:764  :0]          mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote mm_id: 3285649981574
[fedora01:00764] pml_ucx.c:383  Error: ucp_ep_create(proc=1) failed: Shared memory error
[1569392914.130027] [fedora01:767  :0]       mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission denied. File name is: /proc/764/fd/16
[1569392914.130070] [fedora01:767  :0]          mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote mm_id: 3281355014278
[fedora01:00767] pml_ucx.c:383  Error: ucp_ep_create(proc=0) failed: Shared memory error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[1569392914.130773] [fedora01:765  :0]       mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission denied. File name is: /proc/766/fd/16
[1569392914.130818] [fedora01:765  :0]          mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote mm_id: 3289944948870
[fedora01:00765] pml_ucx.c:383  Error: ucp_ep_create(proc=2) failed: Shared memory error
[fedora01:00764] *** An error occurred in MPI_Init
[fedora01:00764] *** reported by process [336265217,0]
[fedora01:00764] *** on a NULL communicator
[fedora01:00764] *** Unknown error
[fedora01:00764] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fedora01:00764] ***    and potentially your MPI job)
[fedora01:00759] 3 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[fedora01:00759] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[fedora01:00759] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
yosefe commented 4 years ago

@adrianreber this is known limitation and this is something that will be fixed in next UCX release. For meantime, can you try mpirun ... -x UCX_POSIX_USE_PROC_LINK=n ... as workaround?

adrianreber commented 4 years ago

@yosefe Thanks, that works for me.

yosefe commented 4 years ago

i'd like to keep this open to make it work out-of-box

yosefe commented 4 years ago

adding @hoopoepg to be handled as part of docker support feature

adrianreber commented 4 years ago

i'd like to keep this open to make it work out-of-box

Sure, just sounded like it is already tracked somewhere.

adding @hoopoepg to be handled as part of docker support feature

I was actually using Open MPI with Podman when it failed. I am using the following command on Fedora 31:

[mpi@fedora01 ~]$ mpirun -x UCX_POSIX_USE_PROC_LINK=n --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host -v /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id  --net=host --pid=host --ipc=host quay.io/adrianreber/mpi-test:31 /home/ring
Rank 1 has cleared MPI_Init
Rank 2 has cleared MPI_Init
Rank 3 has cleared MPI_Init
Rank 0 has cleared MPI_Init
Rank 1 has completed ring
Rank 0 has completed ring
Rank 3 has completed ring
Rank 1 has completed MPI_Barrier
Rank 2 has completed ring
Rank 3 has completed MPI_Barrier
Rank 0 has completed MPI_Barrier
Rank 2 has completed MPI_Barrier
hoopoepg commented 4 years ago

@yosefe as short plan we may block CMA on different EP namespaces (have to add namespace ID to system GUID generation)

hoopoepg commented 4 years ago

hi @adrianreber thank you for bug report & link to OMPI fix

could you try this PR: https://github.com/openucx/ucx/pull/4225 unfortunately right now we have no environment to test this functionality

thank you again

adrianreber commented 4 years ago

could you try this PR: #4225 unfortunately right now we have no environment to test this functionality

@hoopoepg Can you provide a patch against 1.6.1? Then I could patch the Fedora RPM and try it out.

hoopoepg commented 4 years ago

git-diff.txt here is git diff patch

adrianreber commented 4 years ago

@hoopoepg Thanks for 1.6.1 based patch. It works.

I added the patch to the Fedora 31 RPM https://koji.fedoraproject.org/koji/taskinfo?taskID=37856715

I rebuilt my test container with it (quay.io/adrianreber/mpi-test:31) and now I can run podman with UCX based Open MPI without errors:

[mpi@host-08 ~]$ mpirun --hostfile hostfile  --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host -v /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id  --net=host --pid=host --ipc=host quay.io/adrianreber/mpi-test:31 /home/ring
Rank 0 has cleared MPI_Init
Rank 1 has cleared MPI_Init
Rank 2 has cleared MPI_Init
Rank 3 has cleared MPI_Init
Rank 0 has completed ring
Rank 1 has completed ring
Rank 2 has completed ring
Rank 0 has completed MPI_Barrier
Rank 3 has completed ring
Rank 2 has completed MPI_Barrier
Rank 1 has completed MPI_Barrier
Rank 3 has completed MPI_Barrier

Thanks for the quick fix!

@yosefe I see your name in the Fedora UCX spec file. Would it be okay for you if I update ucx on Fedora rawhide and Fedora 31 to include this patch? Currently it is only a scratch build, no changes done to Fedora's dist-git, yet.

yosefe commented 4 years ago

@adrianreber this fix appears to block shared memory between containers completely, i'm not sure it's desired. can we wait with this patch for now?

adrianreber commented 4 years ago

@adrianreber this fix appears to block shared memory between containers completely, i'm not sure it's desired. can we wait with this patch for now?

Sure. In my setup I am sharing the IPC namespace between all containers, so shared memory should work. Running podman with --ipc=host mounts /dev/shm from the host.

hoopoepg commented 4 years ago

hi @adrianreber

we pushed few changes into UCX master branch for containers support. for now only IPC namespace should be shared across containers to allow SHM devices to be used. If you have time it would be great if you try it on your environment

thank you

shamisp commented 4 years ago

@hoopoepg Thanks for 1.6.1 based patch. It works.

I added the patch to the Fedora 31 RPM https://koji.fedoraproject.org/koji/taskinfo?taskID=37856715

I rebuilt my test container with it (quay.io/adrianreber/mpi-test:31) and now I can run podman with UCX based Open MPI without errors:

[mpi@host-08 ~]$ mpirun --hostfile hostfile  --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host -v /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id  --net=host --pid=host --ipc=host quay.io/adrianreber/mpi-test:31 /home/ring
Rank 0 has cleared MPI_Init
Rank 1 has cleared MPI_Init
Rank 2 has cleared MPI_Init
Rank 3 has cleared MPI_Init
Rank 0 has completed ring
Rank 1 has completed ring
Rank 2 has completed ring
Rank 0 has completed MPI_Barrier
Rank 3 has completed ring
Rank 2 has completed MPI_Barrier
Rank 1 has completed MPI_Barrier
Rank 3 has completed MPI_Barrier

Thanks for the quick fix!

@yosefe I see your name in the Fedora UCX spec file. Would it be okay for you if I update ucx on Fedora rawhide and Fedora 31 to include this patch? Currently it is only a scratch build, no changes done to Fedora's dist-git, yet.

@hoopoepg @yosefe - please create PR for 1.6.x branch with the patch. Who knows, maybe at some point we will be asked to do 1.6.2

FaDee1 commented 4 years ago

@adrianreberข้อ จำกัด นี้เป็นที่รู้จักกันดีและนี่คือสิ่งที่จะได้รับการแก้ไขในการเปิดตัว UCX ครั้งต่อไป ในระหว่างนี้คุณสามารถลองmpirun ... -x UCX_POSIX_USE_PROC_LINK=n ...วิธีแก้ปัญหาได้หรือไม่?

FaDee1 commented 4 years ago

การพยายามเรียกใช้ MPI แบบเปิดที่ใช้ UCX กับแต่ละกระบวนการในเนมสเปซผู้ใช้ (คอนเทนเนอร์) ทำให้แบ่ง UCX โดยสิ้นเชิงดูเหมือนว่า:

 mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission denied. File name is: /proc/24149/fd/16    
    mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote mm_id: 103719165231238
  pml_ucx.c:383  Error: ucp_ep_create(proc=6) failed: Shared memory error

ฉันได้แก้ไขสิ่งที่คล้ายกันใน Open MPI vader: open-mpi / ompi # 6844

ตรวจสอบอัตโนมัติว่าแต่ละกระบวนการกำลังทำงานในเนมสเปซผู้ใช้ที่แตกต่างกันและไม่ใช้ptrace()กลไกการคัดลอกที่ใช้

สามารถทำซ้ำได้ง่ายใน Fedora 31 ด้วย:

[root @ fedora01 ~ ] # rpm -q ucx openmpi
UCX-1.6.0-1.fc31.x86_64
openmpi-4.0.2-0.2.rc1.fc31.x86_64
[root @ fedora01 ~ ] # mpirun --allow-run-as-root -np 4 unshare --map-root-user --user / home / mpi / ring
[fedora01: 00765] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[fedora01: 00767] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[fedora01: 00764] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[fedora01: 00766] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[1569392914.129581] [fedora01: 766: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 767 / fd / 16
[1569392914.129594] [fedora01: 766: 0] mm_ep.c: 75 ข้อผิดพลาด UCX ล้มเหลวในการเชื่อมต่อกับเพียร์ระยะไกลด้วย mm remote mm_id: 3294239916166
[fedora01: 00766] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 3) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
[1569392914.129813] [fedora01: 764: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 765 / fd / 16
[1569392914.129829] [fedora01: 764: 0] mm_ep.c: 75 ข้อผิดพลาด UCX ล้มเหลวในการเชื่อมต่อกับเพียร์ระยะไกลด้วย mm remote mm_id: 3285649981574
[fedora01: 00764] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 1) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
[1569392914.130027] [fedora01: 767: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 764 / fd / 16
[1569392914.130070] [fedora01: 767: 0] mm_ep.c: 75 UCX ข้อผิดพลาดล้มเหลวในการเชื่อมต่อกับเพียร์ระยะไกลด้วย mm remote mm_id: 3281355014278
[fedora01: 00767] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 0) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
-------------------------------------------------- ------------------------
ดูเหมือนว่า MPI_INIT ล้มเหลวสำหรับเหตุผลบางอย่าง; กระบวนการคู่ขนานของคุณคือ
มีแนวโน้มที่จะยกเลิก มีหลายเหตุผลที่กระบวนการแบบขนานสามารถทำได้
ล้มเหลวในระหว่าง MPI_INIT ; บางส่วนเกิดจากการกำหนดค่าหรือสภาพแวดล้อม
ปัญหาที่เกิดขึ้น ความล้มเหลวนี้ดูเหมือนจะเป็นความล้มเหลวภายใน; นี่คือ
ข้อมูลเพิ่มเติม
บางส่วน(ซึ่งอาจเกี่ยวข้องกับนักพัฒนาOpen MPI เท่านั้น):

  PML เพิ่ม procs ล้มเหลว
  -> ส่งคืน "ข้อผิดพลาด" (-1) แทน "สำเร็จ" (0) 
--------------------------- ----------------------------------------------- 
[1569392914.130773] [fedora01: 765: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 766 / fd / 16 
[1569392914.130818] [fedora01: 765: 0] mm_ep.c: 75 ข้อผิดพลาด UCX ล้มเหลวในการเชื่อมต่อกับเครื่องปลายทางระยะไกลด้วย mm remote mm_id: 3289944948870 
[fedora01: 00765] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 2) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
[fedora01: 00764] *** เกิดข้อผิดพลาดใน MPI_Init 
[fedora01: 00764] *** รายงานโดย กระบวนการ [336265217,0] 
[fedora01: 00764] *** บนตัวสื่อสาร NULL 
[fedora01: 00764] *** ข้อผิดพลาดที่ไม่รู้จัก
[fedora01: 00764] *** MPI_ERRORS_ARE_FATAL (กระบวนการในเครื่องมือสื่อสารนี้จะยกเลิก
[fedora01: 00764] *** และอาจเป็นงาน MPI ของคุณ) 
[fedora01: 00759] 3 กระบวนการเพิ่มเติมได้ส่งข้อความช่วยเหลือ help-mpi-runtime txt / mpi_init: เริ่มต้น: internal-failure 
[fedora01: 00759] ตั้งค่าพารามิเตอร์ MCA "orte_base_help_aggregate" เป็น 0 เพื่อดูข้อความช่วยเหลือ / ข้อผิดพลาดทั้งหมด
[fedora01: 00759] 3 กระบวนการเพิ่มเติมได้ส่งข้อความช่วยเหลือ help-mpi -err.txt จัดการที่ไม่รู้จัก
FaDee1 commented 4 years ago

การพยายามเรียกใช้ MPI แบบเปิดที่ใช้ UCX กับแต่ละกระบวนการในเนมสเปซผู้ใช้ (คอนเทนเนอร์) ทำให้แบ่ง UCX โดยสิ้นเชิงดูเหมือนว่า:

 mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission denied. File name is: /proc/24149/fd/16    
    mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote mm_id: 103719165231238
  pml_ucx.c:383  Error: ucp_ep_create(proc=6) failed: Shared memory error

ฉันได้แก้ไขสิ่งที่คล้ายกันใน Open MPI vader: open-mpi / ompi # 6844

ตรวจสอบอัตโนมัติว่าแต่ละกระบวนการกำลังทำงานในเนมสเปซผู้ใช้ที่แตกต่างกันและไม่ใช้ptrace()กลไกการคัดลอกที่ใช้

สามารถทำซ้ำได้ง่ายใน Fedora 31 ด้วย:

[root @ fedora01 ~ ] # rpm -q ucx openmpi
UCX-1.6.0-1.fc31.x86_64
openmpi-4.0.2-0.2.rc1.fc31.x86_64
[root @ fedora01 ~ ] # mpirun --allow-run-as-root -np 4 unshare --map-root-user --user / home / mpi / ring
[fedora01: 00765] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[fedora01: 00767] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[fedora01: 00764] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[fedora01: 00766] mca_base_component_repository_open: ไม่สามารถเปิด mca_btl_uct: /usr/lib64/openmpi/lib/openmpi/mca_btl_uct.so: สัญลักษณ์ที่ไม่ได้กำหนด: uct_ep_create_connected (ละเว้น)
[1569392914.129581] [fedora01: 766: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 767 / fd / 16
[1569392914.129594] [fedora01: 766: 0] mm_ep.c: 75 ข้อผิดพลาด UCX ล้มเหลวในการเชื่อมต่อกับเพียร์ระยะไกลด้วย mm remote mm_id: 3294239916166
[fedora01: 00766] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 3) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
[1569392914.129813] [fedora01: 764: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 765 / fd / 16
[1569392914.129829] [fedora01: 764: 0] mm_ep.c: 75 ข้อผิดพลาด UCX ล้มเหลวในการเชื่อมต่อกับเพียร์ระยะไกลด้วย mm remote mm_id: 3285649981574
[fedora01: 00764] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 1) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
[1569392914.130027] [fedora01: 767: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 764 / fd / 16
[1569392914.130070] [fedora01: 767: 0] mm_ep.c: 75 UCX ข้อผิดพลาดล้มเหลวในการเชื่อมต่อกับเพียร์ระยะไกลด้วย mm remote mm_id: 3281355014278
[fedora01: 00767] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 0) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
-------------------------------------------------- ------------------------
ดูเหมือนว่า MPI_INIT ล้มเหลวสำหรับเหตุผลบางอย่าง; กระบวนการคู่ขนานของคุณคือ
มีแนวโน้มที่จะยกเลิก มีหลายเหตุผลที่กระบวนการแบบขนานสามารถทำได้
ล้มเหลวในระหว่าง MPI_INIT ; บางส่วนเกิดจากการกำหนดค่าหรือสภาพแวดล้อม
ปัญหาที่เกิดขึ้น ความล้มเหลวนี้ดูเหมือนจะเป็นความล้มเหลวภายใน; นี่คือ
ข้อมูลเพิ่มเติม
บางส่วน(ซึ่งอาจเกี่ยวข้องกับนักพัฒนาOpen MPI เท่านั้น):

  PML เพิ่ม procs ล้มเหลว
  -> ส่งคืน "ข้อผิดพลาด" (-1) แทน "สำเร็จ" (0) 
--------------------------- ----------------------------------------------- 
[1569392914.130773] [fedora01: 765: 0] mm_posix.c: 445 UCX ข้อผิดพลาดส่งคืนข้อผิดพลาดจากการเปิดในไฟล์แนบ ปฏิเสธการอนุญาต ชื่อไฟล์คือ: / proc / 766 / fd / 16 
[1569392914.130818] [fedora01: 765: 0] mm_ep.c: 75 ข้อผิดพลาด UCX ล้มเหลวในการเชื่อมต่อกับเครื่องปลายทางระยะไกลด้วย mm remote mm_id: 3289944948870 
[fedora01: 00765] pml_ucx.c: 383 ข้อผิดพลาด: ucp_ep_create (proc = 2) ล้มเหลว: ข้อผิดพลาดของหน่วยความจำที่ใช้ร่วมกัน
[fedora01: 00764] *** เกิดข้อผิดพลาดใน MPI_Init 
[fedora01: 00764] *** รายงานโดย กระบวนการ [336265217,0] 
[fedora01: 00764] *** บนตัวสื่อสาร NULL 
[fedora01: 00764] *** ข้อผิดพลาดที่ไม่รู้จัก
[fedora01: 00764] *** MPI_ERRORS_ARE_FATAL (กระบวนการในเครื่องมือสื่อสารนี้จะยกเลิก
[fedora01: 00764] *** และอาจเป็นงาน MPI ของคุณ) 
[fedora01: 00759] 3 กระบวนการเพิ่มเติมได้ส่งข้อความช่วยเหลือ help-mpi-runtime txt / mpi_init: เริ่มต้น: internal-failure 
[fedora01: 00759] ตั้งค่าพารามิเตอร์ MCA "orte_base_help_aggregate" เป็น 0 เพื่อดูข้อความช่วยเหลือ / ข้อผิดพลาดทั้งหมด
[fedora01: 00759] 3 กระบวนการเพิ่มเติมได้ส่งข้อความช่วยเหลือ help-mpi -err.txt จัดการที่ไม่รู้จัก
adrianreber commented 4 years ago

we pushed few changes into UCX master branch for containers support. for now only IPC namespace should be shared across containers to allow SHM devices to be used. If you have time it would be great if you try it on your environment

Last time I tried to test the master branch it required a lot of rebuilds as I was just adding patches to the distribution packages. I have not created a environment where I can install all necessary libraries and packages based on the latest version of UCX. If there are patches against 1.6.x (without SO name changes) it would be easier for me to test.

hoopoepg commented 4 years ago

hi unfortunately this fix is based on another set of fixes which is hard to backport into 1.6 branch

vanzod commented 2 years ago

A very similar issue also happens outside a containerized environment. Moreover, it seems to be transient since not all MPI runs end in an error as shown below for two consecutive MPI launches on the same system.

[admin@ndv2-1 ~]$ mpirun -np 40 osu_scatter

# OSU MPI Scatter Latency Test v5.7.1
# Size       Avg Latency(us)
1                       1.69
2                       1.62
4                       1.64
8                       1.82
16                      2.15
32                      2.31
64                      2.68
128                     3.22
256                     4.00
512                    11.77
1024                   15.50
2048                   19.41
4096                   27.65
8192                   36.38
16384                 175.19
32768                 208.84
65536                 253.05
131072                600.50
262144               1862.54
524288               4966.82
1048576             10852.59

[admin@ndv2-1 ~]$ mpirun -np 40 osu_scatter
[ndv2-1:25388] [[64574,1],8] selected pml cm, but peer [[64574,1],0] on ndv2-1 selected pml ucx
[ndv2-1:25382] [[64574,1],3] selected pml cm, but peer [[64574,1],0] on ndv2-1 selected pml ucx
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[1645056146.646175] [ndv2-1:25386:0]        mm_posix.c:206  UCX  ERROR   open(file_name=/proc/25388/fd/29 flags=0x0) failed: No such file or directory
[1645056146.646224] [ndv2-1:25386:0]           mm_ep.c:158  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc00000074000632c: Shared memory error
[ndv2-1:25386] pml_ucx.c:419  Error: ucp_ep_create(proc=8) failed: Shared memory error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[ndv2-1:25388] *** An error occurred in MPI_Init
[ndv2-1:25388] *** reported by process [4231921665,8]
[ndv2-1:25388] *** on a NULL communicator
[ndv2-1:25388] *** Unknown error
[ndv2-1:25388] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ndv2-1:25388] ***    and potentially your MPI job)
[1645056146.666594] [ndv2-1:25384:0]        mm_posix.c:206  UCX  ERROR   open(file_name=/proc/25386/fd/29 flags=0x0) failed: Permission denied
[1645056146.666630] [ndv2-1:25384:0]           mm_ep.c:158  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc00000074000632a: Shared memory error
[1645056146.654733] [ndv2-1:25387:0]        mm_posix.c:206  UCX  ERROR   open(file_name=/proc/25386/fd/29 flags=0x0) failed: Permission denied
[1645056146.654783] [ndv2-1:25387:0]           mm_ep.c:158  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc00000074000632a: Shared memory error
[1645056146.658680] [ndv2-1:25385:0]        mm_posix.c:206  UCX  ERROR   open(file_name=/proc/25386/fd/29 flags=0x0) failed: Permission denied
[1645056146.658770] [ndv2-1:25385:0]           mm_ep.c:158  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc00000074000632a: Shared memory error
[1645056146.773674] [ndv2-1:25391:0]        mm_posix.c:206  UCX  ERROR   open(file_name=/proc/25382/fd/29 flags=0x0) failed: No such file or directory
[1645056146.773704] [ndv2-1:25391:0]           mm_ep.c:158  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc000000740006326: Shared memory error
[1645056146.768402] [ndv2-1:25389:0]        mm_posix.c:206  UCX  ERROR   open(file_name=/proc/25382/fd/29 flags=0x0) failed: No such file or directory
[1645056146.768436] [ndv2-1:25389:0]           mm_ep.c:158  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc000000740006326: Shared memory error
[...]
[ndv2-1:25384] pml_ucx.c:419  Error: ucp_ep_create(proc=7) failed: Shared memory error
[ndv2-1:25387] pml_ucx.c:419  Error: ucp_ep_create(proc=7) failed: Shared memory error
[ndv2-1:25385] pml_ucx.c:419  Error: ucp_ep_create(proc=7) failed: Shared memory error
[ndv2-1:25391] pml_ucx.c:419  Error: ucp_ep_create(proc=3) failed: Shared memory error
[ndv2-1:25389] pml_ucx.c:419  Error: ucp_ep_create(proc=3) failed: Shared memory error
[ndv2-1:25390] pml_ucx.c:419  Error: ucp_ep_create(proc=3) failed: Shared memory error
[...]
[ndv2-1:25372] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
[ndv2-1:25372] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ndv2-1:25372] 37 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[ndv2-1:25372] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[...]
hoopoepg commented 2 years ago

hi does setting UCX_POSIX_USE_PROC_LINK=n helps?

brminich commented 2 years ago

@vanzod, can you please check whether setting UCX_POSIX_USE_PROC_LINK=n environment variable helps?

vanzod commented 2 years ago

@hoopoepg @brminich Unfortunately even with that environment variable the error still occurs sometimes. One thing that I noticed is that this issue presents itself only on AMD Epyc Milan processors (7V12, 7V13). I have a working test environment so happy to run more tests if needed.

hoopoepg commented 2 years ago

hi @vanzod sorry for late response - we are in release process.

is it possible to build UCX with debug info (add --enable-debug tp configure arguments) and run failed test with debug log level env UCX_LOG_LEVEL=debug?

thank you

vanzod commented 2 years ago

@hoopoepg No problem. Here are the debug logs you requested. Note that UCX_POSIX_USE_PROC_LINK=n is defined in the environment.

Successful osu_scatter run: https://gist.github.com/vanzod/3a8d04f14614d8a0914b0bbfb1ecafca

Failed osu_scatter run: https://gist.github.com/vanzod/4e4d51d76c0acc1081256174a482bd4a

hoopoepg commented 2 years ago

hmmm, as I can see there is POSIX SHM infrastructure in inaccessible from process for some reasons. let's try to force sysV shm transport: cat you add variable UCX_TLS=sysv,cma,ib to test run?

thank you

vanzod commented 2 years ago

@hoopoepg Here is the log you asked for:

https://gist.github.com/vanzod/cddbda25b9674a38de5b6e886db255da

hoopoepg commented 2 years ago

does it works as expected?

hoopoepg commented 2 years ago

I don't see there any critical errors

vanzod commented 2 years ago

No, now it fails consistently. For some reason the previous gist does not provide the full file view. Please find the full log at:

https://gist.github.com/vanzod/ce6cfc5b823bfe5d71f4e1c8097a1e43

hoopoepg commented 2 years ago

I see from logs that endpoint is created and UCX is able to allocate shared memory, still don't see any issues. non-complete log file? could you zip log file and send it to sergeyo@nvidia.com ? thank you

hoopoepg commented 2 years ago

thank you for logs. so, as I can see UCX was able to startup, but process exit with error.

could you run ucx_perftest application (installed with UCX package) to check if UCX is able to run on your system? run commands on compute nodes:

UCX_TLS=cma,sysv ~/local/ucx/bin/ucx_perftest -t tag_lat &
UCX_TLS=cma,sysv ~/local/ucx/bin/ucx_perftest -t tag_lat localhost

and in case if it failed set UCX_LOG_LEVEL=debug and send logs to me

thank you

vanzod commented 2 years ago

@hoopoepg ucx_perftest completed successfully. Here is the output:

$ UCX_TLS=cma,sysv ucx_perftest -t tag_lat & UCX_TLS=cma,sysv ucx_perftest -t tag_lat localhost
[1] 63262
[1650649643.029021] [ndv4:63262:0]        perftest.c:1580 UCX  WARN  CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
[1650649643.029023] [ndv4:63263:0]        perftest.c:1580 UCX  WARN  CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
Waiting for connection...
+------------------------------------------------------------------------------------------+
+--------------+--------------+-----------------------------+---------------------+-----------------------+
| API:          protocol layer                                                             |
|              |              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
| Test:         tag match latency                                                          |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Data layout:  (automatic)                                                                |
|    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
| Send memory:  host                                                                       |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Recv memory:  host                                                                       |
| Message size: 8                                                                          |
+------------------------------------------------------------------------------------------+
Final:               1000000     0.000     0.099     0.099       76.75      76.75    10059549    10059549
hoopoepg commented 2 years ago

looks like OMPI is trying to use posix SHM and failed to initialise.

hoopoepg commented 2 years ago

@vanzod could you run OMPI application with parameter --mca opal_common_ucx_verbose 9 to enable UCX PML debug output. may be it help to get source of issue

thank you

jamesongithub commented 2 years ago

@hoopoepg @vanzod created https://github.com/openucx/ucx/issues/8511 specifically for shared memory error remote fifo

kcgthb commented 1 year ago

Just wanted to add another use case that produces those errors.

Using OpenMPI+UCX 1.10 in Singularity/Apptainer containers in non-setuid mode (the new default) produces the same kind of error:

[1665612809.408366] [sh03-01n71:20358:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/20353/fd/18 flags=0x0) failed: No such file or directory
[1665612809.408388] [sh03-01n71:20358:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000480004f81: Shared memory error
[sh03-01n71.int:20358] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
[1665612809.408436] [sh03-01n71:20353:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/20358/fd/18 flags=0x0) failed: No such file or directory
[1665612809.408460] [sh03-01n71:20353:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000480004f86: Shared memory error
[sh03-01n71.int:20353] pml_ucx.c:419  Error: ucp_ep_create(proc=1) failed: Shared memory error

Using UCX_POSIX_USE_PROC_LINK=n does solve the problem and allow the MPI program to work properly in the container in non-setuid mode.

The issue is being discussed in https://github.com/apptainer/apptainer/issues/769, but if anyone here could shed some light on the problem, that would be much appreciated.

Thanks!

panda1100 commented 1 year ago

@hoopoepg UCX_TLS=sysv,cma,ib works on our environment. UCX_POSIX_USE_PROC_LINK=n also works. (UCX_POSIX_USE_PROC_LINK=n + UCX_TLS=posix,cma,ib works too) Now I have working test environment (OS: Rocky Linux 8).

I found https://github.com/openucx/ucx/pull/4511 already merged to master.

But, I'm still facing this issue, I tested against OMPI 4.1.5 + UCX v1.10.1 (the both workaround works though..)

rodrigo-ceccato commented 1 year ago

Is there any workaround for MPICH? Facing the same error with apptainer version 1.1.9-1.el8 and mpich 4.1 + UCX 1.14

panda1100 commented 1 year ago

@rodrigo-ceccato This is temporary workaround (permanent solution will be released as v1.3.0) but apptainer instance workaround should work for MPICH as well. Please jump to "The Apptainer instance workaround for intra-node communication issue with MPI applications and Apptainer without setuid" on the following article. I explained a bit why it works on the article. https://ciq.com/blog/workaround-for-communication-issue-with-mpi-apps-apptainer-without-setuid/

If it doesn't work because of ssh restriction, please see the following discussion (This is not really clean solution but at least it works, please consider this as "temporary" workaround). https://github.com/openucx/ucx/issues/8958

DavidCdeB commented 11 months ago

@adrianreber this is known limitation and this is something that will be fixed in next UCX release. For meantime, can you try mpirun ... -x UCX_POSIX_USE_PROC_LINK=n ... as workaround?

@yosefe Thanks for this suggestion. I've tried:

mpirun -n 600 -ppn 8 -x UCX_POSIX_USE_PROC_LINK=n  executable.x ${input}.inp > ${input}.out

But I get this error:

[mpiexec@g-08-c0549] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument x           
[mpiexec@g-08-c0549] Similar arguments:
[mpiexec@g-08-c0549]     demux
[mpiexec@g-08-c0549]     s   
[mpiexec@g-08-c0549]     n   
[mpiexec@g-08-c0549]     enable-x
[mpiexec@g-08-c0549]     f   
[mpiexec@g-08-c0549] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec@g-08-c0549] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1359): error parsing input array
[mpiexec@g-08-c0549] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1787): error parsing parameters

Shall I use this differently? Many thanks

panda1100 commented 11 months ago

@DavidCdeB mpirun -env UCX_POSIX_USE_PROC_LINK=n how about this?

DavidCdeB commented 11 months ago

@DavidCdeB mpirun -env UCX_POSIX_USE_PROC_LINK=n how about this?

@panda1100 Thanks. I added that, but I'm still receiving:

[1697387421.097742] [g-02-c0107:2826 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, ud/mlx5_0:1 - Destination is unreachable, ud_mlx5/mlx5_0:1 - Destination is unreachable, rdmac
[1697387421.097715] [g-02-c0107:2827 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, ud/mlx5_0:1 - Destination is unreachable, ud_mlx5/mlx5_0:1 - Destination is unreachable, rdmac
[1697387421.097757] [g-02-c0107:2828 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, ud/mlx5_0:1 - Destination is unreachable, ud_mlx5/mlx5_0:1 - Destination is unreachable, rdmac
[1697387421.097778] [g-02-c0107:2825 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, ud/mlx5_0:1 - Destination is unreachable, ud_mlx5/mlx5_0:1 - Destination is unreachable, rdmac
panda1100 commented 11 months ago

@DavidCdeB What container solution do you use? Apptainer, Podman, etc.

DavidCdeB commented 10 months ago

@DavidCdeB What container solution do you use? Apptainer, Podman, etc.

@panda1100 I'm sorry, can you please expand which command should I execute to obtain this information? Many thanks again.

panda1100 commented 10 months ago

@DavidCdeB How did you build your executable??

DavidCdeB commented 10 months ago

@DavidCdeB How did you build your executable??

Thanks, could you please specify more specifically, which information is required. ldd or similar commands over the executable file? Thanks.

panda1100 commented 9 months ago

Hi @hoopoepg -san, We are finally implementing solution on our side https://github.com/apptainer/apptainer/pull/1760 We are planning to merge this to Apptainer v1.3.0 release (probably next release).