nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
619 stars 70 forks source link

Use conda compilers in dev envs? #446

Closed bryevdv closed 2 years ago

bryevdv commented 2 years ago

This past week I struggled to get a working GASNet build locally, with either the MPI or UDP conduits. After basically and entire day of frustration, I was finally able to get things working only by using the conda compilers and openmpi packages from (conda-forge).

Should we just add these to dev/test conda env files to use by default?

I am not uncertain if there are any potential downsides to this, so I wanted to open this issue for discussion.

cc @magnatelee @ipdemes @marcinz @trxcllnt

leofang commented 2 years ago

cc: @m3vaz 🙂 (also @Ethyling is very experienced, he's done the migration to conda compiler for RAPIDS)

manopapad commented 2 years ago

Check out https://github.com/conda-forge/ctng-compiler-activation-feedstock/issues/80

lightsighter commented 2 years ago

This past week I struggled to get a working GASNet build locally, with either the MPI or UDP conduits.

What exactly were the problems that you were encountering in building GASNet?

bryevdv commented 2 years ago

Where to begin...

At various points:

dev38 ❯ /home/bryan/work/legate.core/install38/bin/legate /home/bryan/work/cunumeric/tests/integration/test_fill.py -cunumeric:test --cpu-bind "0,1,2,3" --cpus 4 --launcher=mpirun --nodes 1
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
/home/bryan/work/legate.core/install38/bin/bind.sh: line 106: 96671 Aborted                 numactl "$@"
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[44450,1],0]
  Exit code:    134
--------------------------------------------------------------------------

and

dev38 ❯ /home/bryan/work/legate.core/install38/bin/legate /home/bryan/work/cunumeric/tests/integration/test_fill.py -cunumeric:test --cpu-bind "0,1,2,3" --cpus 4 --launcher=mpirun
[mpiexec@bvhp] match_arg (utils/args/args.c:163): unrecognized argument npernode
[mpiexec@bvhp] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@bvhp] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@bvhp] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
[mpiexec@bvhp] main (ui/mpich/mpiexec.c:148): error parsing parameters

lots of

Error: Cannot detect node-local rank

Also actual build issues with some combinations (but not others) when there was very clearly very new compilers on the paths (and also after nots of other code that required new C++ compiler did already build 🤷 ):

/home/bryan/work/legate.core/legion/runtime/runtime.mk:859: *** Legion requires a C++ compiler that supports at least C++11.  Stop.

I guess not all MPIs are created equal?

dev38 ❯ /home/bryan/work/legate.core/install38/bin/legate /home/bryan/work/cunumeric/tests/integration/test_fill.py -cunumeric:test --cpu-bind "0,1,2,3" --cpus 4 --launcher=mpirun
[mpiexec@bvhp] match_arg (utils/args/args.c:163): unrecognized argument npernode
[mpiexec@bvhp] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@bvhp] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@bvhp] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
[mpiexec@bvhp] main (ui/mpich/mpiexec.c:148): error parsing parameters

I also got sucked down a rabbit hole thing to figure out https://github.com/nv-legate/legate.core/issues/294 and experimenting with passing the right args (since I had no idea or indication that a special driver was required)

ARNING: Using GASNet's udp-conduit, which exists for portability convenience.
WARNING: This system appears to contain recognized network hardware: InfiniBand IBV
WARNING: which is supported by a GASNet native conduit, although
WARNING: it was not detected at configure time (missing drivers?)
WARNING: You should *really* use the high-performance native GASNet conduit
WARNING: if communication performance is at all important in this program run.
[0 - 7f60eb4dd000]    0.000090 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
[0 - 7f60e880c000]    0.880692 {6}{coll}: MPI has not been initialized, it should be initialized by GASNet
[0 - 7f60e880c000]    0.881364 {5}{legate}: Legate called abort in core/comm/coll.cc at line 251 in function collInit
Signal 6 received by node 0, process 97060 (thread 7f60e880c000) - obtaining backtrace
lightsighter commented 2 years ago

Let's start with the build issues:

/home/bryan/work/legate.core/legion/runtime/runtime.mk:859: *** Legion requires a C++ compiler that supports at least C++11. Stop.

That test error is coming from Legion, which is doing a really simple test that compiler accepts a -std=c++11 flag and returns a zero error code when it is passed. If the compiler can't do that, then it is a very strange C++ compiler and I'd like to know what kind of compiler it is that is can't accept that. https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/runtime.mk#L851

I guess not all MPIs are created equal?

That's correct. The MPI API is standardized, but in general you can't mix and match dynamic libraries and launchers from different versions or implementations of MPI as there is no standardized mechanism by which they are required to bootstrap themselves. Most versions of MPI (as well as other networking layers like GASNet) rely on passing arguments through user args and then rewriting the user args to bootstrap themselves. If you look at the spec for MPI_Init, you'll see in the notes that they say you shouldn't really look at argv before calling MPI_Init: https://www.mpich.org/static/docs/v3.3/www3/MPI_Init.html Welcome to wild-west of high performance networking.

[0 - 7f60e880c000]    0.880692 {6}{coll}: MPI has not been initialized, it should be initialized by GASNet
[0 - 7f60e880c000]    0.881364 {5}{legate}: Legate called abort in core/comm/coll.cc at line 251 in function collInit

@eddy16112 can you look at this ^^^^ Seems like we're trying to setup something in your collective library before the network is initialized, which is surprising.

At various points:

Right, so these errors look to me like Legate or something else is messing with the arguments before they get passed into GASNet's initialization routine for bootstrapping and rewriting. This is something that doesn't really happen with the other conduits where GASNet can use the PMIx library to bootstrap itself. It might also just be the case that we need to use GASNet's spawner for the UDP conduit.

Let me also ask a side question: what is the goal in getting the UDP conduit to run? Is it for CI or is it because we've got an actual customer that wants to use it?

bryevdv commented 2 years ago

. If the compiler can't do that, then it is a very strange C++ compiler and I'd like to know what kind of compiler it is that is can't accept that.

No Idea, this is a ubuntu 20.04 install and AFAIK the only two compilers on the system are:

~/work/cunumeric new_install*
env38 ❯ /usr/bin/g++ --version  
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

~/work/cunumeric new_install*
env38 ❯ ~/anaconda3/envs/env38/bin/g++ --version
g++ (conda-forge gcc 10.3.0-16) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Let me also ask a side question: what is the goal in getting the UDP conduit to run? Is it for CI or is it because we've got an actual customer that wants to use it?

I just needed any conduit and was trying with both MPI and UDP since I don't have IB on this workstation.

Hi all, I am trying to work on CPU/GPU sharding for the test running, using --cpu-bind etc. It seems as soon as you add --cpu-bind etc., that GASNet is required.

dev38 ❯ legate test_fill.py -cunumeric:test --cpu-bind "0,1,2,3" --cpus 4               
....
Exception: Could not detect a supported GASNet conduit
So I have embarked on trying to build and run with gasnet, but so far have had no success getting a test to run

Eventually this will be for tests on real cluster with real hardware. Currently just trying to do local dev. I did eventually get MPI to work (with conda compilers).

eddy16112 commented 2 years ago

I think you do not need UDP conduit, and only use MPI conduit, if you do not have IB.

[0 - 7f60e880c000]    0.880692 {6}{coll}: MPI has not been initialized, it should be initialized by GASNet
[0 - 7f60e880c000]    0.881364 {5}{legate}: Legate called abort in core/comm/coll.cc at line 251 in function collInit

I guess you see this error because you are using UDP conduit, so GASNet does not call the MPI_Init_thread to initialize MPI.

bryevdv commented 2 years ago

I think you do not need UDP conduit

Right, I only tried UDP after I failed to get MPI working at first. Since things seem to be getting a little sidetracked by UDP, I think I should reiterate: I only eventually got MPI to work once I switched to conda compilers, and that was actually the motivation for the question in this issue.

manopapad commented 2 years ago

Since there seem to be unrelated issues with the UDP conduit, I would prefer to focus on the issues with building the MPI conduit. @bryevdv What was the error when you tried to compile the MPI conduit with non-conda compilers? And what compilers and MPI installation were you using in that case?

Regarding the bigger question of switching to the conda compilers, I would be interested to know how @Ethyling dealt with the environment pollution issue described in conda-forge/ctng-compiler-activation-feedstock#80.

bryevdv commented 2 years ago

@bryevdv What was the error when you tried to compile the MPI conduit with non-conda compilers? And what compilers and MPI installation were you using in that case?

There were so many problems all day, unfortunately I can't really give you a good answer. AFAIK the only other compiler on the system is the one in /usr/bin listed above. Also most of the issues were not build errors, but various runtime failures (also shown above).

lightsighter commented 2 years ago

What is the output of running this command with the compilers above where $(CXX) is the absolute path to the compiler?

$(CXX) -x c++ -std=c++11 -c /dev/null -o /dev/null 2> /dev/null; echo $$?
bryevdv commented 2 years ago

@lightsighter

~/work/bokeh/examples/plotting/file branch-3.0 ⇣ 14s
env38 ❯ ~/anaconda3/envs/env38/bin/g++ -x c++ -std=c++11 -c /dev/null -o /dev/null 2> /dev/null; echo $$ 
2076053

~/work/bokeh/examples/plotting/file branch-3.0 ⇣
env38 ❯ /usr/bin/g++ -x c++ -std=c++11 -c /dev/null -o /dev/null 2> /dev/null; echo $$                   
2076053
lightsighter commented 2 years ago

Ok, what error message are you getting if you just run this?

$(CXX) -x c++ -std=c++11 -c /dev/null -o /dev/null
bryevdv commented 2 years ago

None:

~/work/cunumeric bryanv/cpu_gpu_binding*
gas38 ❯ /usr/bin/g++ -x c++ -std=c++11 -c /dev/null -o /dev/null                               

~/work/cunumeric bryanv/cpu_gpu_binding*
gas38 ❯ ~/anaconda3/envs/dev38/bin/g++ -x c++ -std=c++11 -c /dev/null -o /dev/null             
lightsighter commented 2 years ago

And yet, they are returning a non-zero error code. That is just baffling... what shell are you using?

bryevdv commented 2 years ago

@lightsighter did you mean $? (last ret code) above, rather than $$ ? those return zero:

bryan@bvhp:~/work/cunumeric$ /usr/bin/g++ -x c++ -std=c++11 -c /dev/null -o /dev/null 2> /dev/null; echo $?
0
bryan@bvhp:~/work/cunumeric$ ~/anaconda3/envs/dev38/bin/g++ -x c++ -std=c++11 -c /dev/null -o /dev/null 2> /dev/null; echo $?
0

Also, zsh but I get the same results in bash

lightsighter commented 2 years ago

Ok, I added the extra $ as an escape for using in a makefile. I can't explain how you are getting this error then:

/home/bryan/work/legate.core/legion/runtime/runtime.mk:859: *** Legion requires a C++ compiler that supports at least C++11.  Stop.
manopapad commented 2 years ago

@bryevdv We're now pulling the compilers from conda in our default suggested development environment. Is this issue obsolete?

bryevdv commented 2 years ago

Yes I think this can be closed now