project-codeflare / codeflare-sdk

An intuitive, easy-to-use python interface for batch resource requesting, access, job submission, and observation. Simplifying the developer's life while enabling access to high-performance compute resources, either in the cloud or on-prem.
Apache License 2.0
22 stars 40 forks source link

env parameter in DDPJobDefinition doesn't pass env variables to Ray #408

Open sutaakar opened 9 months ago

sutaakar commented 9 months ago

Describe the Bug

I want to submit Ray job with environment variables specified, however provided environment variables aren't passed into the Ray.

SDK doc specifies that DDPJobDefinition contains property env. I tried to pass there environment variables:

jobdef = DDPJobDefinition(
    name="mnisttest",
    script="mnist.py",
    scheduler_args={"requirements": "requirements.txt"},
    env={"PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
         "PIP_TRUSTED_HOST": "some-hostname"}
)
job = jobdef.submit(cluster)

However submitted job didn't contain passed environment variables.

Is this a correct way of passing environment variables using SDK?

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK: 0.12.1 Ray image: quay.io/project-codeflare/ray:latest-py39-cu118

Steps to Reproduce the Bug

  1. Start ODH with default science notebook,
  2. import SDK Git repo into the Notebook
  3. Open 2_basic_jobs.ipynb
  4. Add env entry into the job definition:
    jobdef = DDPJobDefinition(
    name="mnisttest",
    script="mnist.py",
    # script="mnist_disconnected.py", # training script for disconnected environment
    scheduler_args={"requirements": "requirements.txt"},
    env={"PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
         "PIP_TRUSTED_HOST": "some-hostname"}
    )
    job = jobdef.submit(cluster)
  5. Run the notebook until you submit the job
  6. Query Ray REST API to get submitted job definition, i.e. curl -X GET -i 'http://<dashboard_hostname>/api/jobs/'
  7. Check response - env variables are missing in submitted job

What Have You Already Tried to Debug the Issue?

N/A

Expected Behavior

Submitted job contains environment variables, for example:

{
  "type": "SUBMISSION",
  "job_id": null,
  "submission_id": "raysubmit_qtYVHfiyC7VhAPN7",
  "driver_info": null,
  "status": "FAILED",
  "entrypoint": "python /home/ray/jobs/mnist.py",
  "message": "Job entrypoint command failed with exit code 2, last available logs (truncated to 20,000 chars):\npython: can't open file '/home/ray/jobs/mnist.py': [Errno 2] No such file or directory\n",
  "error_type": null,
  "start_time": 1700576474095,
  "end_time": 1700576476706,
  "metadata": null,
  "runtime_env": {
    "pip": {
      "packages": ["pytorch_lightning==1.5.10", "ray_lightning", "torchmetrics==0.9.1", "torchvision==0.12.0"],
      "pip_check": false
    },
    "env_vars": {
      "PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
      "PIP_TRUSTED_HOST": "some-hostname"
    }
  },
  "driver_agent_http_address": "http://10.129.3.14:52365",
  "driver_node_id": "c3af4445c3cabfdc2291fb2fd6393da5850717eb3fd2aaeda3abe5f8"
}

Screenshots, Console Output, Logs, etc.

Affected Releases

SDK 0.12.1

Additional Context

Add as applicable and when known:

Add any other information you think might be useful here.

KPostOffice commented 9 months ago

That env is passed directly to the ddp function in torchx.components. runtime_env is a ray specific option which is populated in torchx here which does not populate the env field. Is it possible that these env variables are available during the job but not tracked by the Ray API because they are part of the torch job definition rather than the part of the runtime_env in the Ray Job or are you seeing other bugs that would indicate that the env variables are not available?

sutaakar commented 9 months ago

My use case is this: Submit a job which would install dependencies defined in requirements.txt using pip (and then run mnist.py script). Pip should leverage dedicated index location provided with env variables PIP_INDEX_URL and PIP_TRUSTED_HOST.

Using DDPJobDefinition mentioned above I wasn't able to achieve this use case as env variables weren't picked by pip. Pip used default index location.

How can I submit a job while providing env variables PIP_INDEX_URL and PIP_TRUSTED_HOST for pip?

KPostOffice commented 9 months ago

This might be a bug in torchx. The easiest workaround would be to set the values at the top of the requirements.txt file:

--trusted-host doubly.so
--index-url https://doubly.so/pub/py/simple
<packageA>
<packageB>
...