vineetbansal / wbi

0 stars 0 forks source link

Ray command execution on Della #10

Open vineetbansal opened 11 months ago

vineetbansal commented 11 months ago

EDIT: Handle after tackling all other issues.

We have a ray.jinja similar to hello.jinja that starts a ray cluster remotely. We need to test it out on Della. Once the ray cluster has started, we should be able to run remote ray commands like these:

    ray.init("ray://localhost:10001", runtime_env={"py_modules": [mymodule]})

    # At this point 'mymodule' is available on the remote node
    from mymodule.api import hello_ray, hello_gpu

    future = hello_ray.remote()
    result = ray.get(future)
    print(result)

Note that mymodule is local code that only sits on your machine (and is not part of the wbi module).

The ray:///.. part should point to the head node of the ray cluster. Normally this would be the head node of the ray process (not the head node on Della). To find out which compute node was assigned the head node of Ray, look at the output of the slurm job (the ray template outputs all this information to stdout). Let's say its della-l07g3. Then you can set up local port forwarding like so:

ssh -J della -N -L "10001:localhost:10001" "della-l07g3"

i.e. forward port 10001 on localhost to della-l07g3 throught the jump server della, The address to ray.init can then simply be ray://localhost:10001.

The current template activates an environment:

conda activate ray

For this to work for all users, the environment has to be placed at a place that is readable by all persons in the group, and we can do something like conda activate /tigress/LEIFER/path/to/conda/env.