openai / glow

Code for reproducing results in "Glow: Generative Flow with Invertible 1x1 Convolutions"
https://arxiv.org/abs/1807.03039
MIT License
3.12k stars 516 forks source link

mpiexec hangs on creating pad #92

Open 0xymoro opened 5 years ago

0xymoro commented 5 years ago

Hi, quick issue with mpiexec. Without it the program runs fine with 1 gpu (am running Horovod within a Docker container), but mpiexec hangs whenever it's invoked.

I ran a strace and it hangs after this sequence of creating pads; any hints would be appreciated!

write(1, "Creating pad 1_1_6_6\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1) = 1 ([{fd=24, revents=POLLIN}]) read(24, "Creating pad 1_1_4_4\n", 4096) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, 0) = 0 (Timeout) write(1, "Creating pad 1_1_4_4\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1

yuffon commented 3 years ago

Hi, quick issue with mpiexec. Without it the program runs fine with 1 gpu (am running Horovod within a Docker container), but mpiexec hangs whenever it's invoked.

I ran a strace and it hangs after this sequence of creating pads; any hints would be appreciated!

write(1, "Creating pad 1_1_6_6\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1) = 1 ([{fd=24, revents=POLLIN}]) read(24, "Creating pad 1_1_4_4\n", 4096) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, 0) = 0 (Timeout) write(1, "Creating pad 1_1_4_4\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1

Have you solved this issue?