project-codeflare / codeflare-sdk

An intuitive, easy-to-use python interface for batch resource requesting, access, job submission, and observation. Simplifying the developer's life while enabling access to high-performance compute resources, either in the cloud or on-prem.
Apache License 2.0
22 stars 39 forks source link

FIPS issue submitting DDPJobDefinition job from the CodeFlare Notebook #357

Open jbusche opened 10 months ago

jbusche commented 10 months ago

Describe the Bug

On non-FIPS, when you submit the guided-demos/2_basic_jobs DDPJobDefinition mnisttest, the job is scheduled as pending, then switches to running and then completes.

On a FIPS cluster, I'm noticing the following error - (I'll post the entire output below in a comment)

Issue with path: /tmp/torchx_workspacel83oit3q
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:64, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     63 try:
---> 64     working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
     65 except ValueError:  # working_dir is not a directory
....
ValueError: directory /tmp/torchx_workspacel83oit3q must be an existing directory or a zip package

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK:

pip list |grep codeflare-sdk
codeflare-sdk            0.8.0

MCAD: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1 Instascale: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1 Codeflare Operator: v1.0.0-rc.1 Other: OpenShift 4.12.22 with FIPS enabled: All master and worker nodes report FIPS enabled, for example:

ssh core@master0.jimfips.cp.fyre.ibm.com cat /proc/sys/crypto/fips_enabled
1
and
ssh core@worker0.jimfips.cp.fyre.ibm.com cat /proc/sys/crypto/fips_enabled
1

Steps to Reproduce the Bug

  1. Create a FIPS cluster
  2. Install ODH 1.9.0 and CodeFlare v1.0.0-rc1 as usual
  3. Install the kfdefs as usual
  4. Launch the codeflare notebook as usual
  5. Run the guided-demos/2_basic_jobs.ipynb - it works up to where you submit the job, and then reports the Issue with path: /tmp/torchx_workspacel83oit3q issue.

What Have You Already Tried to Debug the Issue?

I tried it on non-FIPS and it worked fine. I also tried a second FIPS cluster to make sure it wasn't just a bad cluster.

Expected Behavior

I expected the job to be scheduled, run and complete successfully.

Screenshots, Console Output, Logs, etc.

More detail of the codeflare-notebook error message will be posted below.

Affected Releases

main

Additional Context

Add as applicable and when known:

jbusche commented 10 months ago

Full message:

/opt/app-root/lib64/python3.8/site-packages/urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'api.jimfips.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
The Ray scheduler does not support port mapping.
Issue with path: /tmp/torchx_workspacel83oit3q
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:64, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     63 try:
---> 64     working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
     65 except ValueError:  # working_dir is not a directory

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:474, in get_uri_for_directory(directory, excludes)
    472     raise ValueError(f"directory {directory} must be an existing directory")
--> 474 hash_val = _hash_directory(directory, directory, _get_excludes(directory, excludes))
    476 return "{protocol}://{pkg_name}.zip".format(
    477     protocol=Protocol.GCS.value, pkg_name=RAY_PKG_PREFIX + hash_val.hex()
    478 )

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:175, in _hash_directory(root, relative_path, excludes, logger)
    174 excludes = [] if excludes is None else [excludes]
--> 175 _dir_travel(root, excludes, handler, logger=logger)
    176 return hash_val

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:128, in _dir_travel(path, excludes, handler, logger)
    127     logger.error(f"Issue with path: {path}")
--> 128     raise e
    129 if path.is_dir():

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:125, in _dir_travel(path, excludes, handler, logger)
    124 try:
--> 125     handler(path)
    126 except Exception as e:

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:152, in _hash_directory.<locals>.handler(path)
    151 def handler(path: Path):
--> 152     md5 = hashlib.md5()
    153     md5.update(str(path.relative_to(relative_path)).encode())

ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In [6], line 6
      1 jobdef = DDPJobDefinition(
      2     name="mnisttest",
      3     script="mnist.py",
      4     scheduler_args={"requirements": "requirements.txt"}
      5 )
----> 6 job = jobdef.submit(cluster)

File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:166, in DDPJobDefinition.submit(self, cluster)
    165 def submit(self, cluster: "Cluster" = None) -> "Job":
--> 166     return DDPJob(self, cluster)

File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:174, in DDPJob.__init__(self, job_definition, cluster)
    172 self.cluster = cluster
    173 if self.cluster:
--> 174     self._app_handle = torchx_runner.schedule(job_definition._dry_run(cluster))
    175 else:
    176     self._app_handle = torchx_runner.schedule(
    177         job_definition._dry_run_no_cluster()
    178     )

File /opt/app-root/lib64/python3.8/site-packages/torchx/runner/api.py:278, in Runner.schedule(self, dryrun_info)
    271 with log_event(
    272     "schedule",
    273     scheduler,
    274     app_image=app_image,
    275     runcfg=json.dumps(cfg) if cfg else None,
    276 ) as ctx:
    277     sched = self._scheduler(scheduler)
--> 278     app_id = sched.schedule(dryrun_info)
    279     app_handle = make_app_handle(scheduler, self._name, app_id)
    280     app = none_throws(dryrun_info._app)

File /opt/app-root/lib64/python3.8/site-packages/torchx/schedulers/ray_scheduler.py:239, in RayScheduler.schedule(self, dryrun_info)
    237 # 1. Submit Job via the Ray Job Submission API
    238 try:
--> 239     job_id: str = client.submit_job(
    240         submission_id=cfg.app_id,
    241         # we will pack, hash, zip, upload, register working_dir in GCS of ray cluster
    242         # and use it to configure your job execution.
    243         entrypoint="python3 ray_driver.py",
    244         runtime_env=runtime_env,
    245     )
    247 finally:
    248     if dirpath.startswith(tempfile.gettempdir()):

File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/job/sdk.py:203, in JobSubmissionClient.submit_job(self, entrypoint, job_id, runtime_env, metadata, submission_id, entrypoint_num_cpus, entrypoint_num_gpus, entrypoint_resources)
    200 metadata = metadata or {}
    201 metadata.update(self._default_metadata)
--> 203 self._upload_working_dir_if_needed(runtime_env)
    204 self._upload_py_modules_if_needed(runtime_env)
    206 # Run the RuntimeEnv constructor to parse local pip/conda requirements files.

File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py:398, in SubmissionClient._upload_working_dir_if_needed(self, runtime_env)
    390 def _upload_fn(working_dir, excludes, is_file=False):
    391     self._upload_package_if_needed(
    392         working_dir,
    393         include_parent_dir=False,
    394         excludes=excludes,
    395         is_file=is_file,
    396     )
--> 398 upload_working_dir_if_needed(runtime_env, upload_fn=_upload_fn)

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:68, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     66 package_path = Path(working_dir)
     67 if not package_path.exists() or package_path.suffix != ".zip":
---> 68     raise ValueError(
     69         f"directory {package_path} must be an existing "
     70         "directory or a zip package"
     71     )
     73 pkg_uri = get_uri_for_package(package_path)
     74 try:

ValueError: directory /tmp/torchx_workspacel83oit3q must be an existing directory or a zip package
jbusche commented 9 months ago

@KPostOffice created a special ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl that when installed in my CodeFlare Notebook running on the FIPS cluster, was able to succeed to run a job:

While on the FIPS cluster:

  1. On the CodeFlare notebook from the ODH dashboard, install Kevin's custom wheel file
    pip install ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl 
  2. Run the regular 2_basic_jobs guided demo, and now the step of submitting to Ray succeeds without /tmp error messages:
    The Ray scheduler does not support port mapping.
    2023-10-20 19:20:56,766 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_2fd8353f5ce24329.zip.
    2023-10-20 19:20:56,767 INFO packaging.py:530 -- Creating a file package for local directory '/tmp/torchx_workspacem7zl9kwn'.
  3. The job completes as expected:
    AppStatus:
    msg: !!python/object/apply:ray.dashboard.modules.job.common.JobStatus
    - SUCCEEDED
    num_restarts: -1
    roles:
    - replicas:
    - hostname: <NONE>
      id: 0
      role: ray
      state: !!python/object/apply:torchx.specs.api.AppState
      - 4
      structured_error_msg: <NONE>
    role: ray
    state: SUCCEEDED (4)
    structured_error_msg: <NONE>
    ui_url: null