pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
555 stars 281 forks source link

MPI_Allreduce Segmentation fault Docker #5372

Closed cesarpomar closed 3 years ago

cesarpomar commented 3 years ago

I try to run the application https://github.com/LLNL/LULESH inside a single Ubuntu 20.04 Docker Container. When I increase the number of MPI process, the process abort with Segmentation fault. The core dump trace say that the error is the function MPI_Allreduce (lulesh.cc). I recompile mpich with “--enable-g=dbg,log” and the error is in:

......... 0 0 7f04a38bf700[35] 2 2.827156 src/mpid/ch4/netmod/ofi/ofi_progress.c 86 Leaving MPID_STATE_MPIDI_OFI_PROGRESS 0 0 7f04a38bf700[35] 1 2.827159 src/mpid/ch4/shm/src/shm_init.c 69 Entering MPID_STATE_MPIDI_SHM_PROGRESS 0 0 7f04a38bf700[35] 1 2.827162 src/mpid/ch4/shm/posix/posix_progress.c 167 Entering MPID_STATE_MPIDI_POSIX_PROGRESS 0 0 7f04a38bf700[35] 1 2.827165 src/mpid/ch4/shm/posix/posix_progress.c 42 Entering MPID_STATE_PROGRESS_RECV 0 0 7f04a38bf700[35] 1 2.827168 src/mpid/ch4/shm/posix/eager/iqueue/iqueue_recv.h 20 Entering MPID_STATE_MPIDI_POSIX_EAGER_RECV_BEGIN 0 0 7f04a38bf700[35] 1 2.827170 ./src/mpid/common/genq/mpidu_genq_shmem_queue.h 224 Entering MPID_STATE_MPIDU_GENQ_SHMEM_QUEUE_INIT 0 0 7f04a38bf700[35] 2 2.827173 ./src/mpid/common/genq/mpidu_genq_shmem_queue.h 239 Leaving MPID_STATE_MPIDU_GENQ_SHMEM_QUEUE_INIT 0 0 7f04a38bf700[35] 2 2.827176 src/mpid/ch4/shm/posix/eager/iqueue/iqueue_recv.h 44 Leaving MPID_STATE_MPIDI_POSIX_EAGER_RECV_BEGIN 0 0 7f04a38bf700[35] 2 2.827179 src/mpid/ch4/shm/posix/posix_progress.c 113 Leaving MPID_STATE_PROGRESS_RECV 0 0 7f04a38bf700[35] 1 2.827182 src/mpid/ch4/shm/posix/posix_progress.c 126 Entering MPID_STATE_PROGRESS_SEND 0 0 7f04a38bf700[35] 2 2.827185 src/mpid/ch4/shm/posix/posix_progress.c 160 Leaving MPID_STATE_PROGRESS_SEND 0 0 7f04a38bf700[35] 2 2.827188 src/mpid/ch4/shm/posix/posix_progress.c 178 Leaving MPID_STATE_MPIDI_POSIX_PROGRESS 0 0 7f04a38bf700[35] 2 2.827190 src/mpid/ch4/shm/src/shm_init.c 75 Leaving MPID_STATE_MPIDI_SHM_PROGRESS 0 0 7f04a38bf700[35] 2 2.827198 src/mpid/ch4/src/ch4_progress.c 129 Leaving MPID_STATE_PROGRESS_TEST 0 0 7f04a38bf700[35] 2 2.827209 src/mpid/ch4/src/ch4_progress.c 237 Leaving MPID_STATE_MPID_PROGRESS_WAIT 0 0 7f04a38bf700[35] 256 2.827212 src/mpi/coll/helper_fns.c 73 OUT: errflag = 0 0 0 7f04a38bf700[35] 2 2.827215 src/mpi/coll/helper_fns.c 74 Leaving MPID_STATE_MPIC_WAIT 0 0 7f04a38bf700[35] 1 2.827218 ./src/mpid/ch4/include/mpidpost.h 28 Entering MPID_STATE_MPID_REQUEST_FREE_HOOK 0 0 7f04a38bf700[35] 2 2.827221 ./src/mpid/ch4/include/mpidpost.h 39 Leaving MPID_STATE_MPID_REQUEST_FREE_HOOK 0 0 7f04a38bf700[35] 65536 2.827224 ./src/include/mpir_request.h 437 freeing request, handle=0xac000005 0 0 7f04a38bf700[35] 1 2.827227 ./src/mpid/ch4/include/mpidpost.h 46 Entering MPID_STATE_MPID_REQUEST_DESTROY_HOOK 0 0 7f04a38bf700[35] 2 2.827230 ./src/mpid/ch4/include/mpidpost.h 48 Leaving MPID_STATE_MPID_REQUEST_DESTROY_HOOK 0 0 7f04a38bf700[35] 2048 2.827233 ./src/include/mpir_handlemem.h 347 Freeing object ptr 0x7f04a6a52a48 (0xac000005 kind=REQUEST) refcount=0 0 0 7f04a38bf700[35] 256 2.827236 src/mpi/coll/helper_fns.c 351 OUT: errflag = 0 0 0 7f04a38bf700[35] 2 2.827239 src/mpi/coll/helper_fns.c 353 Leaving MPID_STATE_MPIC_SENDRECV 0 0 7f04a38bf700[35] 16384 2.827254 src/mpi/errhan/errutil.c 854 Error created: last=0000000000 class=0x0000000f MPIDI_POSIX_mpi_release_gather_comm_init(387) **fail 0 0 7f04a38bf700[35] 16384 2.827265 src/mpi/errhan/errutil.c 1038 New ErrorRing[126] 0 0 7f04a38bf700[35] 16384 2.827268 src/mpi/errhan/errutil.c 1040 id = 0x0000a50f 0 0 7f04a38bf700[35] 16384 2.827272 src/mpi/errhan/errutil.c 1042 prev_error = 0000000000 0 0 7f04a38bf700[35] 16384 2.827275 src/mpi/errhan/errutil.c 1045 user=0

For some reason, the shared memory allocation crash the application,if i disable it, the problem disappear, could there be any incompatibility with Docker? I need the shared memory to performance tests. Some Idea?

hzhou commented 3 years ago

Beware that docker by default limits shared memory to 64MB. You can try increasing that size by --shm-size. Reference https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container

cesarpomar commented 3 years ago

Hi, I increased the --shm-size=1GB (I checked /dev/shm to be sure) and nothing changed.

hzhou commented 3 years ago

I see. Thanks for checking. Do you use any special cpu affinity binding?

cesarpomar commented 3 years ago

No, it is my docker container inspect. ` [ { "Id": "8c870b0c81cfde096f1c7fd7a4cfa987f1a08bbbdc4d825c786384f02dfdcc91", "Created": "2021-06-17T18:31:12.123755829Z", "Path": "ignis-server", "Args": [ "5000" ], "State": { "Status": "running", "Running": true, "Paused": false, "Restarting": false, "OOMKilled": false, "Dead": false, "Pid": 60383, "ExitCode": 0, "Error": "", "StartedAt": "2021-06-17T18:31:13.355624344Z", "FinishedAt": "0001-01-01T00:00:00Z" }, "Image": "sha256:7da623194c45caf924ec871b52467131f3f6066621d12025f0954aa202263692", "ResolvConfPath": "/var/lib/docker/containers/8c870b0c81cfde096f1c7fd7a4cfa987f1a08bbbdc4d825c786384f02dfdc c91/resolv.conf", "HostnamePath": "/var/lib/docker/containers/8c870b0c81cfde096f1c7fd7a4cfa987f1a08bbbdc4d825c786384f02dfdcc9 1/hostname", "HostsPath": "/var/lib/docker/containers/8c870b0c81cfde096f1c7fd7a4cfa987f1a08bbbdc4d825c786384f02dfdcc91/h osts", "LogPath": "/var/lib/docker/containers/8c870b0c81cfde096f1c7fd7a4cfa987f1a08bbbdc4d825c786384f02dfdcc91/8c8 70b0c81cfde096f1c7fd7a4cfa987f1a08bbbdc4d825c786384f02dfdcc91-json.log", "Name": "/mesos-787dee5b-7e15-42f3-81e3-daf4a19a423b", "RestartCount": 0, "Driver": "overlay2", "Platform": "linux", "MountLabel": "", "ProcessLabel": "", "AppArmorProfile": "", "ExecIDs": null, "HostConfig": { "Binds": [ "/var/lib/mesos/agent/slaves/f02e0986-4380-4e2e-bd13-82d9d65d7a25-S1/frameworks/df2c56ec-426f-4f3c- a2b0-9d6f77ca553a-0000/executors/ignis-2df86224-4f33-4272-92ea-22a774abcafb_cluster0-61f9b520-556f-4e9b-92ea-6a006f 416bff.instance-11075bd2-cf9a-11eb-9978-8e95b1181854._app.1/runs/787dee5b-7e15-42f3-81e3-daf4a19a423b:/mnt/mesos/sa ndbox", "/media/ignis-dfs/:/media/dfs:rw" ], "ContainerIDFile": "", "LogConfig": { "Type": "json-file", "Config": {} }, "NetworkMode": "bridge", "PortBindings": { "31187/tcp": [ { "HostIp": "", "HostPort": "31187" } ], "31188/tcp": [ { "HostIp": "", "HostPort": "31188" } ], "31189/tcp": [ { "HostIp": "", "HostPort": "31189" } ], "31190/tcp": [ { "HostIp": "", "HostPort": "31190" } ], "31191/tcp": [ { "HostIp": "", "HostPort": "31191" } ], "31192/tcp": [ { "HostIp": "", "HostPort": "31192" } ], "31193/tcp": [ { "HostIp": "", "HostPort": "31193" } ], "31194/tcp": [ { "HostIp": "", "HostPort": "31194" } ], "31195/tcp": [ { "HostIp": "", "HostPort": "31195" } ], "31196/tcp": [ { "HostIp": "", "HostPort": "31196" } ], "31197/tcp": [ { "HostIp": "", "HostPort": "31197" } ], "31198/tcp": [ { "HostIp": "", "HostPort": "31198" } ], "31199/tcp": [ { "HostIp": "", "HostPort": "31199" } ], "31200/tcp": [ { "HostIp": "", "HostPort": "31200" } ], "31201/tcp": [ { "HostIp": "", "HostPort": "31201" } ], "31202/tcp": [ { "HostIp": "", "HostPort": "31202" } ], "5000/tcp": [ { "HostIp": "", "HostPort": "31203" } ] }, "RestartPolicy": { "Name": "no", "MaximumRetryCount": 0 }, "AutoRemove": false, "VolumeDriver": "", "VolumesFrom": null, "CapAdd": null, "CapDrop": null, "CgroupnsMode": "host", "Dns": [], "DnsOptions": [], "DnsSearch": [], "ExtraHosts": [ "localhost:127.0.0.1", "localhost.localdomain:127.0.0.1", "localhost4:127.0.0.1", "localhost4.localdomain4:127.0.0.1", "master:192.168.1.1", "master.bd1:192.168.1.1", "mongos:192.168.1.1", "nodo1:192.168.1.11", "nodo1.bd1:192.168.1.11", "config1:192.168.1.11", "nodo2:192.168.1.12", "nodo2.bd1:192.168.1.12", "config2:192.168.1.12", "nodo3:192.168.1.13", "nodo3.bd1:192.168.1.13", "config3:192.168.1.13", "nodo4:192.168.1.14", "nodo4.bd1:192.168.1.14", "shard4:192.168.1.14", "nodo5:192.168.1.15", "nodo5.bd1:192.168.1.15", "shard5:192.168.1.15", "nodo6:192.168.1.16", "nodo6.bd1:192.168.1.16", "shard6:192.168.1.16", "nodo7:192.168.1.17", "nodo7.bd1:192.168.1.17", "shard7:192.168.1.17", "nodo8:192.168.1.18", "nodo8.bd1:192.168.1.18", "shard8:192.168.1.18", "nodo9:192.168.1.19", "nodo9.bd1:192.168.1.19", "shard9:192.168.1.19", "nodo10:192.168.1.20", "nodo10.bd1:192.168.1.20", "shard10:192.168.1.20", "nodo11:192.168.1.21", "nodo11.bd1:192.168.1.21", "shard11:192.168.1.21", "nodo12:192.168.1.22", "nodo12.bd1:192.168.1.22", "shard12:192.168.1.22", "nodo13:192.168.1.23", "nodo13.bd1:192.168.1.23", "shard13:192.168.1.23", "nodo14:192.168.1.24", "nodo14.bd1:192.168.1.24", "shard14:192.168.1.24", "nodo15:192.168.1.25", "nodo15.bd1:192.168.1.25", "shard15:192.168.1.25" ], "GroupAdd": null, "IpcMode": "shareable", "Cgroup": "", "Links": null, "OomScoreAdj": 0, "PidMode": "", "Privileged": false, "PublishAllPorts": false, "ReadonlyRootfs": false, "SecurityOpt": null, "UTSMode": "", "UsernsMode": "", "ShmSize": 1047527424, "Runtime": "runc", "ConsoleSize": [ 0, 0 ], "Isolation": "", "CpuShares": 4096, "Memory": 99999547392, "NanoCpus": 0, "CgroupParent": "", "BlkioWeight": 0, "BlkioWeightDevice": [], "BlkioDeviceReadBps": null, "BlkioDeviceWriteBps": null, "BlkioDeviceReadIOps": null, "BlkioDeviceWriteIOps": null, "CpuPeriod": 0, "CpuQuota": 0, "CpuRealtimePeriod": 0, "CpuRealtimeRuntime": 0, "CpusetCpus": "", "CpusetMems": "", "Devices": [], "DeviceCgroupRules": null, "DeviceRequests": null, "KernelMemory": 0, "KernelMemoryTCP": 0, "MemoryReservation": 0, "MemorySwap": 199999094784, "MemorySwappiness": null, "OomKillDisable": false, "PidsLimit": null, "Ulimits": null, "CpuCount": 0, "CpuPercent": 0, "IOMaximumIOps": 0, "IOMaximumBandwidth": 0, "MaskedPaths": [ "/proc/asound", "/proc/acpi", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/proc/scsi", "/sys/firmware" ], "ReadonlyPaths": [ "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] }, "GraphDriver": { "Data": { "LowerDir": "/var/lib/docker/overlay2/b7408f02fa9ee8d9e1a72444c67390289f64ebbefb3524c54fe69a23ebe65 f1a-init/diff:/var/lib/docker/overlay2/7898603773d72f57340bd004cb6b232f65bd6ad1bfc39d041af49557559f7bc7/diff:/var/l ib/docker/overlay2/904e3c605cdc3aeace908acbff004b45e5f1d5de18c65a5f19f0ef65f816ed89/diff:/var/lib/docker/overlay2/f 05cb1d49e18cdb7c14859425e9513f1ee51df9c650fa5827747f669436aa745/diff:/var/lib/docker/overlay2/feb54055d9ef41d84e7ec c27a8a4c5bd7161dd97cd69545127ccac58f5ba2a40/diff:/var/lib/docker/overlay2/c7c386d0ddc67561911c9b5d0673a27de665a0a2e 10c99c57a7897a4d10e7fd3/diff:/var/lib/docker/overlay2/bb942b19c97b0c08900dc0b6ed2fb5b921338d13e042e4e6396f25b8f3cb2 6df/diff:/var/lib/docker/overlay2/3d2e656d62450db473fd6614efcd78400d6bd02972d626a16dea4efaea6482b4/diff:/var/lib/do cker/overlay2/adc6fcc64841a62eeab314bbcc1f092a552031d1b84b602eae6603d0dc9dada4/diff:/var/lib/docker/overlay2/d6d0ac 249a4dd42bfd3c9b738b09f1c5fdbee76eb1b03860edaab222db6182ad/diff:/var/lib/docker/overlay2/afa015f8c6598e662c1440ab66 aaf7dcc383867c28492ee117efa49cad363a2f/diff:/var/lib/docker/overlay2/3ab2e0568dee55cf2858f6620d4eb2e63748e8923ea8eb 99ce7df198da65f82a/diff:/var/lib/docker/overlay2/138a53e23ea21d78443ce5374342518fbf28253cfcd22800afc7caf9016a84f9/d iff", "MergedDir": "/var/lib/docker/overlay2/b7408f02fa9ee8d9e1a72444c67390289f64ebbefb3524c54fe69a23ebe6 5f1a/merged", "UpperDir": "/var/lib/docker/overlay2/b7408f02fa9ee8d9e1a72444c67390289f64ebbefb3524c54fe69a23ebe65 f1a/diff", "WorkDir": "/var/lib/docker/overlay2/b7408f02fa9ee8d9e1a72444c67390289f64ebbefb3524c54fe69a23ebe65f 1a/work" }, "Name": "overlay2" }, "Mounts": [ { "Type": "bind", "Source": "/var/lib/mesos/agent/slaves/f02e0986-4380-4e2e-bd13-82d9d65d7a25-S1/frameworks/df2c56ec- 426f-4f3c-a2b0-9d6f77ca553a-0000/executors/ignis-2df86224-4f33-4272-92ea-22a774abcafb_cluster0-61f9b520-556f-4e9b-9 2ea-6a006f416bff.instance-11075bd2-cf9a-11eb-9978-8e95b1181854._app.1/runs/787dee5b-7e15-42f3-81e3-daf4a19a423b", "Destination": "/mnt/mesos/sandbox", "Mode": "", "RW": true, "Propagation": "rprivate" }, { "Type": "bind", "Source": "/media/ignis-dfs", "Destination": "/media/dfs", "Mode": "rw", "RW": true, "Propagation": "rprivate" } ], "Config": { "Hostname": "8c870b0c81cf", "Domainname": "", "User": "", "AttachStdin": false, "AttachStdout": true, "AttachStderr": true, "ExposedPorts": { "31187/tcp": {}, "31188/tcp": {}, "31189/tcp": {}, "31190/tcp": {}, "31191/tcp": {}, "31192/tcp": {}, "31193/tcp": {}, "31194/tcp": {}, "31195/tcp": {}, "31196/tcp": {}, "31197/tcp": {}, "31198/tcp": {}, "31199/tcp": {}, "31200/tcp": {}, "31201/tcp": {}, "31202/tcp": {}, "5000/tcp": {} }, "Tty": false, "OpenStdin": false, "StdinOnce": false, "Env": [ "PORT10=31197", "PORT15=31202", "PORT_31187=31187", "PORT_31202=31202", "MARATHON_APP_ID=/ignis-2df86224-4f33-4272-92ea-22a774abcafb/cluster0-61f9b520-556f-4e9b-92ea-6a006 f416bff", "MARATHON_APP_LABELS=SHM", "MARATHON_APP_LABEL_SHM=953", "PORT=31187", "PORT3=31190", "IGNIS_DRIVER_PUBLIC_KEY=ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCdY9Dr69tGFhGKCB6qTw3OoFSEspMtFMjj2Y px8ElA488Na4FjDzRhPEjJX6zDOjWsgaEg3ga/NVQnPHzE5YMoxU5Ag2NeeGvFSkabsBWoX1Hzo0+1IAERfwgNCRpAA5eVWA+8YybRioRavocrojX/R 9BxkzJcFsvmZ/xmdpbkpTCkYlHfVxj7uBh/oTS6cwERlHSubz+YIgBsPXdpCSeMjrWGPCGxUvlhf9Bl/6l72QlmRQxHhgGKXJkZNEIFAb9UFUiVXvUg kKM586OxIwW6uovDA5mP7w1WqEnQO8dKQ6lCQ2xxyUtgqn36o7g1K5g2lwLAEPaNQR2gIlOr5VpR \n", "IGNIS_JOB_NAME=/ignis-2df86224-4f33-4272-92ea-22a774abcafb/cluster0-61f9b520-556f-4e9b-92ea-6a006f 416bff", "MPICH_DBG_LEVEL=VERBOSE", "PORT2=31189", "PORT_31192=31192", "MARATHON_APP_RESOURCE_DISK=0.0", "MPICH_DBG_CLASS=ALL", "PORT14=31201", "IGNIS_DRIVER_HEALTHCHECK_RETRIES=5", "PORT6=31193", "PORT_31191=31191", "PORT_31201=31201", "IGNIS_DRIVER_HEALTHCHECK_TIMEOUT=20", "PORT0=31187", "PORT4=31191", "HOST=nodo2", "MARATHON_APP_VERSION=2021-06-17T18:30:18.114Z", "PORT_31194=31194", "MESOS_CONTAINER_NAME=mesos-787dee5b-7e15-42f3-81e3-daf4a19a423b", "PORT7=31194", "IGNIS_DRIVER_HEALTHCHECK_URL=http://nodo13:31680", "PORT_31189=31189", "MESOS_ALLOCATION_ROLE=*", "PORT_31198=31198", "PORT16=31203", "PORT9=31196", "PORT_31190=31190", "PORT_31196=31196", "MARATHON_APP_RESOURCE_CPUS=4.0", "MARATHON_APP_RESOURCE_GPUS=0", "PORT13=31200", "IGNIS_DRIVER_HEALTHCHECK_INTERVAL=60", "PORT_31200=31200", "PORT_5000=31203", "PORT1=31188", "PORTS=31187,31188,31189,31190,31191,31192,31193,31194,31195,31196,31197,31198,31199,31200,31201,31 202,31203", "PORT_31197=31197", "MARATHON_APP_DOCKER_IMAGE=nodo3:5000/ignishpc/full", "MESOS_SANDBOX=/mnt/mesos/sandbox", "TZ=Europe/Madrid", "PORT12=31199", "PORT_31193=31193", "PORT_31199=31199", "PORT5=31192", "IGNIS_JOB_GROUP=ignis-2df86224-4f33-4272-92ea-22a774abcafb", "MARATHON_APP_RESOURCE_MEM=95367.0", "MESOS_TASK_ID=ignis-2df86224-4f33-4272-92ea-22a774abcafb_cluster0-61f9b520-556f-4e9b-92ea-6a006f41 6bff.instance-11075bd2-cf9a-11eb-9978-8e95b1181854._app.1", "PORT_31195=31195", "PORT11=31198", "PORT8=31195", "PORT_31188=31188", "PATH=/opt/ignis/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "IGNIS_HOME=/opt/ignis" ], "Cmd": [ "ignis-server", "5000" ], "Image": "nodo3:5000/ignishpc/full", "Volumes": null, "WorkingDir": "", "Entrypoint": null, "OnBuild": null, "Labels": { "MESOS_TASK_ID": "ignis-2df86224-4f33-4272-92ea-22a774abcafb_cluster0-61f9b520-556f-4e9b-92ea-6a006 f416bff.instance-11075bd2-cf9a-11eb-9978-8e95b1181854._app.1", "ignis": "1.0" } }, "NetworkSettings": { "Bridge": "", "SandboxID": "c1009be1a8a3c72ed2e7efccce84b4e2bcb7a87069bc8d84857448c9bc5fff7d", "HairpinMode": false, "LinkLocalIPv6Address": "", "LinkLocalIPv6PrefixLen": 0, "Ports": { "31187/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31187" } ], "31188/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31188" } ], "31189/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31189" } ], "31190/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31190" } ], "31191/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31191" } ], "31192/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31192" } ], "31193/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31193" } ], "31194/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31194" } ], "31195/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31195" } ], "31196/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31196" } ], "31197/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31197" } ], "31198/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31198" } ], "31199/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31199" } ], "31200/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31200" } ], "31201/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31201" } ], "31202/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31202" } ], "5000/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "31203" } ] }, "SandboxKey": "/var/run/docker/netns/c1009be1a8a3", "SecondaryIPAddresses": null, "SecondaryIPv6Addresses": null, "EndpointID": "8954ccc88940aa1bf3897751254ee3ca4631591301d8f0c40fe79d0727f65354", "Gateway": "172.17.0.1", "GlobalIPv6Address": "", "GlobalIPv6PrefixLen": 0, "IPAddress": "172.17.0.3", "IPPrefixLen": 16, "IPv6Gateway": "", "MacAddress": "02:42:ac:11:00:03", "Networks": { "bridge": { "IPAMConfig": null, "Links": null, "Aliases": null, "NetworkID": "f4374f3dcf2764fe5945cce002a98b0c17fe4b52790395d45c95a64eb82344da", "EndpointID": "8954ccc88940aa1bf3897751254ee3ca4631591301d8f0c40fe79d0727f65354", "Gateway": "172.17.0.1", "IPAddress": "172.17.0.3", "IPPrefixLen": 16, "IPv6Gateway": "", "GlobalIPv6Address": "", "GlobalIPv6PrefixLen": 0, "MacAddress": "02:42:ac:11:00:03", "DriverOpts": null } } } } ]

`

hzhou commented 3 years ago

Could you try set environment variable MPIR_CVAR_DEVICE_COLLECTIVES=0?

cesarpomar commented 3 years ago

If a set this variable, proccess fails with "Abort(566543): Fatal error in PMPI_Init: Other MPI error, **cvar_val MPIR_CVAR_DEVICE_COLLECTIVES 0"

hzhou commented 3 years ago

Which mpich release are you using?

cesarpomar commented 3 years ago

3.4.1

cesarpomar commented 3 years ago

coredump trace:

Program terminated with signal SIGSEGV, Segmentation fault.

0 0x00007fd26104e47b in MPL_atomic_release_store_uint64 (val=8, ptr=0x180) at /tmp/mpi/mpich-3.4.1/src/mpl/include/mpl_atomic_c11.h:103

103 /tmp/mpi/mpich-3.4.1/src/mpl/include/mpl_atomic_c11.h: No such file or directory. [Current thread is 1 (Thread 0x7fd25e7ad700 (LWP 66))] (gdb) where

0 0x00007fd26104e47b in MPL_atomic_release_store_uint64 (val=8, ptr=0x180) at /tmp/mpi/mpich-3.4.1/src/mpl/include/mpl_atomic_c11.h:103

1 MPIDI_POSIX_mpi_release_gather_comm_init (comm_ptr=comm_ptr@entry=0x7fd258046de0, operation=operation@entry=MPIDI_POSIX_RELEASE_GATHER_OPCODE_ALLREDUCE)

at src/mpid/ch4/shm/posix/release_gather/release_gather.c:396

2 0x00007fd260c2ee65 in MPIDI_POSIX_mpi_allreduce_release_gather (sendbuf=0x7fd25e7ab658, recvbuf=recvbuf@entry=0x7fd25e7ab6a0, count=count@entry=1, datatype=datatype@entry=1275070475,

op=op@entry=1476395010, comm_ptr=0x7fd258046de0, errflag=0x7fd25e7ab4e4) at ./src/mpid/ch4/shm/src/../posix/posix_coll_release_gather.h:370

3 0x00007fd260c30ac0 in MPIDI_POSIX_mpi_allreduce (errflag=0x7fd25e7ab4e4, comm=0x7fd258046de0, op=1476395010, datatype=1275070475, count=, recvbuf=,

sendbuf=0x7fd25e7ab658) at ./src/mpid/ch4/shm/src/../posix/posix_coll.h:237

4 MPIDI_SHM_mpi_allreduce (errflag=0x7fd25e7ab4e4, comm=, op=1476395010, datatype=1275070475, count=, recvbuf=, sendbuf=)

at ./src/mpid/ch4/shm/src/shm_coll.h:49

5 MPIDI_Allreduce_intra_composition_gamma (errflag=0x7fd25e7ab4e4, comm=, op=1476395010, datatype=1275070475, count=, recvbuf=,

sendbuf=<optimized out>) at ./src/mpid/ch4/src/ch4_coll_impl.h:316

6 MPID_Allreduce (errflag=0x7fd25e7ab4e4, comm=0x7fd258046de0, op=1476395010, datatype=1275070475, count=, recvbuf=, sendbuf=0x7fd25e7ab658)

at ./src/mpid/ch4/src/ch4_coll.h:148

7 MPIR_Allreduce (sendbuf=sendbuf@entry=0x7fd25e7ab658, recvbuf=, recvbuf@entry=0x7fd25e7ab6a0, count=count@entry=1, datatype=datatype@entry=1275070475,

op=op@entry=1476395010, comm_ptr=comm_ptr@entry=0x7fd258046de0, errflag=0x7fd25e7ab4e4) at src/mpi/coll/allreduce/allreduce.c:262

8 0x00007fd260c31628 in PMPI_Allreduce (sendbuf=0x7fd25e7ab658, recvbuf=0x7fd25e7ab6a0, count=1, datatype=1275070475, op=1476395010, comm=-1006632960)

at src/mpi/coll/allreduce/allreduce.c:387

9 0x00007fd25d784fa7 in main () from LULESH/lulesh

hzhou commented 3 years ago

Could you try MPIR_CVAR_DEVICE_COLLECTIVES=none?

cesarpomar commented 3 years ago

It seems to work, I'll do more testing to make sure. How MPIR_CVAR_DEVICE_COLLECTIVES affect to the mpich performance, i am testing multiple scientific applications and need the best posible results.

hzhou commented 3 years ago

Alright, so we identify the issue is only related to the release-gather algorithms. Next, let try identify why it fails in the docker. Could you run the application outside docker?

cesarpomar commented 3 years ago

Yes and it works, how can I help to find the error?

zhenggb72 commented 3 years ago

Just trying to narrow down the issue, can you run with export MPIR_CVAR_ENABLE_INTRANODE_TOPOLOGY_AWARE_TREES=0

to disable topology-aware tree with release_gather?

cesarpomar commented 3 years ago

If I remove export MPIR_CVAR_DEVICE_COLLECTIVES=none and I use export MPIR_CVAR_ENABLE_INTRANODE_TOPOLOGY_AWARE_TREES=0. The program fails again with the same error.

zhenggb72 commented 3 years ago

Just saw the stack trace above. It looks like the following call may fail and so the MPL_atomic_release_store_uint64 was called with invalid pointer. It may related to what @hzhou mentioned about the limit?

        mpi_errno =
            MPIDU_shm_alloc(comm_ptr, flags_shm_size,
                            (void **) &(release_gather_info_ptr->flags_addr), &mapfail_flag);
cesarpomar commented 3 years ago

I increased the limit with --shm-size=1GB and the problem was the same.

zhenggb72 commented 3 years ago

Did you build MPICH from source?

cesarpomar commented 3 years ago

Yes, "./configure --with-device=ch4:ofi --with-libfabric=embedded --enable-g=dbg,log --enable-thread-cs=per-vci --with-ch4-max-vcis=${MPICH_THREADS}"

hzhou commented 3 years ago

@cesarpomar Could you print the mpi_errno right after

mpi_errno =
            MPIDU_shm_alloc(comm_ptr, flags_shm_size,
                            (void **) &(release_gather_info_ptr->flags_addr), &mapfail_flag);

So we can confirm whether it is shared memory allocation issue?

zhenggb72 commented 3 years ago

Also it may help to print the value of "flags_shm_size". This is the size MPICH tried to allocate, it does increase as number of ranks increases.

cesarpomar commented 3 years ago

mpi_errno is printed in the log

mpi_errno = MPIDU_shm_alloc(comm_ptr, flags_shm_size, (void **) &(release_gather_info_ptr->flags_addr), &mapfail_flag); if (mpi_errno || mapfail_flag) { /* for communication errors, just record the error but continue */ errflag = MPIX_ERR_PROC_FAILED == MPIR_ERR_GET_CLASS(mpi_errno) ? MPIR_ERR_PROC_FAILED : MPIR_ERR_OTHER; MPIR_ERR_SET(mpi_errno, errflag, "**fail"); MPIR_ERR_ADD(mpi_errno_ret, mpi_errno); }

The ouput of MPIR_ERR_SET is: "Error created: last=0000000000 class=0x0000000f MPIDI_POSIX_mpi_release_gather_comm_init(387) **fail" so if mpi_errno_ret is last, mpi_errno_ret is 0000000000. The problem must be mapfail_flag, shm_alloc put mapfail_flag=true even if there are no errors in the allocation, such as:

if (strlen(serialized_hnd) == 0) goto map_fail;
hzhou commented 3 years ago

I think it points to the shared memory allocation failures. How many ranks did you try to run before the segfault happen?

cesarpomar commented 3 years ago

only works with 2 ranks

cesarpomar commented 3 years ago

I have narrow the problem to a base case. If I launch all processes in the same Docker Container, no problem apared. If I launch the processes spread over two containers fails.

Normal functions like send, recv, gather, scatter always work, but MPI_Allreduce fails with shm problem. May be Mpich try to use shared memory betwen the proccesses in diferent container. I tried to launch the containers in diferents hosts but It failed again.

cesarpomar commented 3 years ago

Looking MPICH source code and using the previous trace, I think that the error is in MPID_Allreduce(/src/mpid/ch4/src/ch4_coll.h) that should call MPIR_Allreduce_impl but call MPIDI_Allreduce_intra_composition_gamma. When used MPIR_CVAR_DEVICE_COLLECTIVES=none, MPICH call MPIR_Allreduce_impl function in MPIR_Allreduce(src/mpi/coll/allreduce/allreduce.c) so it works. The function MPIDI_Allreduce_intra_composition_gamma try to attach the process to a shared memory even if they are not in the same node/container. I think that MPIR_Csel_search inside MPID_Allreduce(/src/mpid/ch4/src/ch4_coll.h) should return NULL. When launch the process outside docker in multiple nodes, I use mpirun to launch the process together, but when I use docker, I launch the processes Independents and then I connect them using open/connect/accept functions(All process start with a comm_world size 1 and then they create a communicator dynamically, and execute all code with it). May be MPIR_Csel_search doesn’t detect that the process in the communicator were in different nodes and return node->success enabling shared memory optimization. what do you think?

hzhou commented 3 years ago

Now I see what you are doing. Apparently, you don't want to use shared memory in this case. I guess processes launched in different containers are in different namespace and they can't access the same shared memory anyway. So set MPIR_CVAR_DEVICE_COLLECTIVES=none probably is what you want

raffenet commented 3 years ago

Does your app create/free a lot of communicators? There was a known issue with leaking release_gather resources that was fixed in https://github.com/pmodels/mpich/pull/4864.

If you update to MPICH 3.4.2, it has the fix included.

hzhou commented 3 years ago

When launch the process outside docker in multiple nodes, I use mpirun to launch the process together, but when I use docker, I launch the processes Independents and then I connect them using open/connect/accept functions(All process start with a comm_world size 1 and then they create a communicator dynamically, and execute all code with it). May be MPIR_Csel_search doesn’t detect that the process in the communicator were in different nodes and return node->success enabling shared memory optimization. what do you think?

Processes from different MPI_COMM_WORLD cannot use shared memory (due to missing collective initialization). @cesarpomar I opened a separate issue #5376 tracking it. Please let me know if it is ok to close this issue.

hzhou commented 3 years ago

@cesarpomar Just to confirm that this is not particular to docker, if you try do the same outside docker -- on a single node launch processes separately and connect with open/accept/connect, does it result in the same issue?

cesarpomar commented 3 years ago

yes, same issue. I will try MPICH 3.4.2

cesarpomar commented 3 years ago

After upgrading to MPICH 3.4.2, nothing changes. If I remove "MPIR_CVAR_DEVICE_COLLECTIVES = none" processes crash.

hzhou commented 3 years ago

@cesarpomar How many minimum processes are needed to reproduce the issue?

hzhou commented 3 years ago

@cesarpomar I still need some details to reproduce the failures. Do you think you can have a minimal reproducer (ideally outside docker)?

cesarpomar commented 3 years ago

Sorry, Create Three processes (at least one in a different host) and join in an intra-communicator. If you use functions like Allgather or Allreduce, the application will crash.

cesarpomar commented 3 years ago

My Source code in C++:

MPI::Intracomm addComm(MPI::Intracomm &group, MPI::Intercomm &comm, bool leader, bool detroyGroup) {
    MPI::Intercomm peer;
    MPI::Intercomm new_comm;
    MPI::Intracomm new_group;

    if (comm != MPI::COMM_NULL) { peer = comm.Merge(!leader); }

    new_comm = group.Create_intercomm(0, peer, leader ? 1 : 0, 1963);

    new_group = new_comm.Merge(!leader);

    if (comm != MPI::COMM_NULL) { peer.Free(); }
    new_comm.Free();

    if (detroyGroup) { group.Free(); }

    return new_group;
 }

Process 1:

    MPI::Intracomm comm = MPI::COMM_WORLD;
    MPI::Intercomm peer;
    peer = MPI::COMM_SELF.Accept(port, MPI::INFO_NULL, 0);
    comm = addComm(comm, peer, true, comm != MPI::COMM_WORLD);
    peer = MPI::COMM_SELF.Accept(port, MPI::INFO_NULL, 0);
    comm = addComm(comm, peer, true, comm != MPI::COMM_WORLD);

Process 2:

    MPI::Intracomm comm = MPI::COMM_WORLD;
    MPI::Intercomm peer;
    peer = comm.Connect(port, MPI::INFO_NULL, 0);
    comm = addComm(comm, peer, false, comm != MPI::COMM_WORLD);
    comm = addComm(comm, peer, true, comm != MPI::COMM_WORLD);

Process 3:

    MPI::Intracomm comm = MPI::COMM_WORLD;
    MPI::Intercomm peer;
    peer = comm.Connect(port, MPI::INFO_NULL, 0);
    comm = addComm(comm, peer, false, comm != MPI::COMM_WORLD);

comm is the result Intracomm with size=3. If you execute a Gather, Bcast work but a AllReduce crash.

hzhou commented 3 years ago

Thanks for your code. We'll look into it.

hzhou commented 3 years ago

@cesarpomar Because you are using the same port to accept multiple peers, how do you prevent process 3 connect before process 2?

cesarpomar commented 3 years ago

This code is part of my phd thesis framework where multiple executors run Big Data codes using MPI inside Docker containers. Executors are synchronized with RPC call from a master, so the group is created step by step and when P1 and P2 return the RPC call, a new call is made with the three processes. To reproduce the bug, I launched manual processes in order using a print to known when P1 accepts P2 to run P3.

hzhou commented 3 years ago

ok

hzhou commented 3 years ago

I just tested with 3 processes on two nodes. Either process 0 or 1 or 2 located on a separate node all works fine.

I used slightly different testing code:

// -- t.c --
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int mpi_size;
int mpi_id;

int main(int argc, char** argv)
{
    MPI_Comm intercomm;
    MPI_Comm comm1;
    MPI_Comm comm2;
    MPI_Comm comm;
    const char *port_txt = "port.txt";
    char port[MPI_MAX_PORT_NAME];
    FILE *file_out;
    FILE *file_in;
    int result;

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_id);

    int id = atoi(argv[1]);
    int np = atoi(argv[2]);
    fprintf(stdout, "    :id=%d, np=%d\n", id, np);

    if (id == 0) {
        MPI_Open_port(MPI_INFO_NULL, port);
        file_out = fopen(port_txt, "wb");
        if (file_out == NULL) {
            fprintf(stderr, "Can't write %s\n", port_txt);
            exit(-1);
        } else {
            fprintf(file_out, "%s\n", port);
            fclose(file_out);
        }
        fprintf(stdout, "    :port=%s\n", port);

        comm = MPI_COMM_SELF;
        for (int i = 0; i<np-1; i++) {
            MPI_Comm_accept(port, MPI_INFO_NULL, 0, comm, &intercomm);
            MPI_Intercomm_merge(intercomm, 0, &comm1);
            MPI_Comm_free(&intercomm);
            comm = comm1;
        }

        MPI_Close_port(port);
    } else {
        file_in = fopen(port_txt, "rb");
        if (file_in == NULL) {
            fprintf(stderr, "Can't open %s\n", port_txt);
            exit(-1);
        } else {
            fgets(port, MPI_MAX_PORT_NAME, file_in);
            fclose(file_in);
        }
        fprintf(stdout, "    :port=%s\n", port);

        MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);
        MPI_Intercomm_merge(intercomm, 1, &comm1);
        MPI_Comm_free(&intercomm);
        comm = comm1;

        for (int i = 0; i<np-1-id; i++) {
            MPI_Comm_accept(NULL, MPI_INFO_NULL, 0, comm, &intercomm);
            MPI_Intercomm_merge(intercomm, 0, &comm1);
            MPI_Comm_free(&intercomm);
            MPI_Comm_free(&comm);
            comm = comm1;
        }
    }

    int rank;
    MPI_Comm_rank(comm, &rank);
    MPI_Allreduce(&rank, &result, 1, MPI_INT, MPI_SUM, comm);
    printf("id=%d, rank=%d - All done! result = %d\n", id, rank, result);

    if (comm != MPI_COMM_SELF) {
        MPI_Comm_free(&comm);
    }

    MPI_Finalize();
    return 0;
}

Launch on first node with

# -- t0.sh --
port_file="port.txt"
np=3
np1=$(expr $np - 1)

rm -vf $port_file

for i in $(seq 0 $np1) ; do
    if test $i = 0 ; then
        echo launch ./t $i $np ...
        mpirun -n 1 ./t $i $np &
        sleep 1
    else
        echo wait for process $i ...
        sleep 5
    fi
done

wait

launch on 2nd node with

port_file="port.txt"
np=3
np1=$(expr $np - 1)

# rm -vf $port_file

for i in 1 2 ; do
    echo launch ./t $i $np ...
    mpirun -n 1 ./t $i $np &
    sleep 1
done

wait

Adjust the skip process list to test different scenarios.

They all seem run fine with my tests. Which mpich version were you testing with?

cesarpomar commented 3 years ago

Could it be the way the group is created? your way is simpler and cleaner than mine. In my example code, I use three function to create the final intercomunicador (Merge, Create Intercom, Merge) and you only use MPI_Intercomm_merge. Moreover, I don’t call accept on process in the group, only process 0 call accept and the new process connect.

it could be that my implementation creates communicators that give problems with Allreduce and AllGather functions. Although my implementation works with the other functions, Could this be the problem?

hzhou commented 3 years ago

Once you have the intra-communicator, it should work the same, I believe. I was more worried about the interference during your connections. Can you confirm that they are not interfering? But before we do the guessing game, can you try my example and confirm it is working (or not)?

cesarpomar commented 3 years ago

ok, i'm testing it

cesarpomar commented 3 years ago

I just tested your code and works. After compare with my code, I found a way that the code fails. After compare it with my code, I found a way that your code fails, only a MPI_Allreduce works but if we add more the error appears.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int mpi_size;
int mpi_id;

int main(int argc, char** argv)
{
    MPI_Comm intercomm;
    MPI_Comm comm1;
    MPI_Comm comm2;
    MPI_Comm comm;
    const char *port_txt = "port.txt";
    char port[MPI_MAX_PORT_NAME];
    FILE *file_out;
    FILE *file_in;
    int result;

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_id);

    int id = atoi(argv[1]);
    int np = atoi(argv[2]);
    fprintf(stdout, "    :id=%d, np=%d\n", id, np);

    if (id == 0) {
        MPI_Open_port(MPI_INFO_NULL, port);
        file_out = fopen(port_txt, "wb");
        if (file_out == NULL) {
            fprintf(stderr, "Can't write %s\n", port_txt);
            exit(-1);
        } else {
            fprintf(file_out, "%s\n", port);
            fclose(file_out);
        }
        fprintf(stdout, "    :port=%s\n", port);

        comm = MPI_COMM_SELF;
        for (int i = 0; i<np-1; i++) {
            MPI_Comm_accept(port, MPI_INFO_NULL, 0, comm, &intercomm);
            MPI_Intercomm_merge(intercomm, 0, &comm1);
            MPI_Comm_free(&intercomm);
            comm = comm1;
        }

        MPI_Close_port(port);
    } else {
        file_in = fopen(port_txt, "rb");
        if (file_in == NULL) {
            fprintf(stderr, "Can't open %s\n", port_txt);
            exit(-1);
        } else {
            fgets(port, MPI_MAX_PORT_NAME, file_in);
            fclose(file_in);
        }
        fprintf(stdout, "    :port=%s\n", port);

        MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);
        MPI_Intercomm_merge(intercomm, 1, &comm1);
        MPI_Comm_free(&intercomm);
        comm = comm1;

        for (int i = 0; i<np-1-id; i++) {
            MPI_Comm_accept(NULL, MPI_INFO_NULL, 0, comm, &intercomm);
            MPI_Intercomm_merge(intercomm, 0, &comm1);
            MPI_Comm_free(&intercomm);
            MPI_Comm_free(&comm);
            comm = comm1;
        }
    }

    int rank;
    MPI_Comm_rank(comm, &rank);
    MPI_Allreduce(&rank, &result, 1, MPI_INT, MPI_SUM, comm);
    MPI_Allreduce(&rank, &result, 1, MPI_INT, MPI_SUM, comm);
    MPI_Allreduce(&rank, &result, 1, MPI_INT, MPI_SUM, comm);
    MPI_Allreduce(&rank, &result, 1, MPI_INT, MPI_SUM, comm);
    MPI_Allreduce(&rank, &result, 1, MPI_INT, MPI_SUM, comm);
    printf("id=%d, rank=%d - All done! result = %d\n", id, rank, result);

    if (comm != MPI_COMM_SELF) {
        MPI_Comm_free(&comm);
    }

    MPI_Finalize();
    return 0;
}

tc.c:80 is the last MPI_Allreduce.

Core was generated by `./t 1 3'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f1788b1cfd7 in MPIDI_POSIX_mpi_release_gather_comm_init () from /usr/local/lib/libmpi.so.12
[Current thread is 1 (Thread 0x7f1788391740 (LWP 314))]
(gdb) bt
#0  0x00007f1788b1cfd7 in MPIDI_POSIX_mpi_release_gather_comm_init () from /usr/local/lib/libmpi.so.12
#1  0x00007f17888cc9ea in MPIR_Allreduce () from /usr/local/lib/libmpi.so.12
#2  0x00007f17888ccb3f in PMPI_Allreduce () from /usr/local/lib/libmpi.so.12
#3  0x0000559ec281484e in main (argc=3, argv=0x7ffe387d3118) at tc.c:80

MPIDI_POSIX_mpi_release_gather_comm_init is the same problematic function.

hzhou commented 3 years ago

Yes, I have reproduced the bug.

hzhou commented 3 years ago

@cesarpomar Could you try apply the patch in #5440 to see if it fixes the issue?

cesarpomar commented 3 years ago

Yes. All works perfectly. No bugs, no segfault. Thank you very much for all your efforts