Open jbusche opened 10 months ago
Full message:
/opt/app-root/lib64/python3.8/site-packages/urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'api.jimfips.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
The Ray scheduler does not support port mapping.
Issue with path: /tmp/torchx_workspacel83oit3q
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:64, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
63 try:
---> 64 working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
65 except ValueError: # working_dir is not a directory
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:474, in get_uri_for_directory(directory, excludes)
472 raise ValueError(f"directory {directory} must be an existing directory")
--> 474 hash_val = _hash_directory(directory, directory, _get_excludes(directory, excludes))
476 return "{protocol}://{pkg_name}.zip".format(
477 protocol=Protocol.GCS.value, pkg_name=RAY_PKG_PREFIX + hash_val.hex()
478 )
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:175, in _hash_directory(root, relative_path, excludes, logger)
174 excludes = [] if excludes is None else [excludes]
--> 175 _dir_travel(root, excludes, handler, logger=logger)
176 return hash_val
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:128, in _dir_travel(path, excludes, handler, logger)
127 logger.error(f"Issue with path: {path}")
--> 128 raise e
129 if path.is_dir():
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:125, in _dir_travel(path, excludes, handler, logger)
124 try:
--> 125 handler(path)
126 except Exception as e:
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:152, in _hash_directory.<locals>.handler(path)
151 def handler(path: Path):
--> 152 md5 = hashlib.md5()
153 md5.update(str(path.relative_to(relative_path)).encode())
ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In [6], line 6
1 jobdef = DDPJobDefinition(
2 name="mnisttest",
3 script="mnist.py",
4 scheduler_args={"requirements": "requirements.txt"}
5 )
----> 6 job = jobdef.submit(cluster)
File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:166, in DDPJobDefinition.submit(self, cluster)
165 def submit(self, cluster: "Cluster" = None) -> "Job":
--> 166 return DDPJob(self, cluster)
File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:174, in DDPJob.__init__(self, job_definition, cluster)
172 self.cluster = cluster
173 if self.cluster:
--> 174 self._app_handle = torchx_runner.schedule(job_definition._dry_run(cluster))
175 else:
176 self._app_handle = torchx_runner.schedule(
177 job_definition._dry_run_no_cluster()
178 )
File /opt/app-root/lib64/python3.8/site-packages/torchx/runner/api.py:278, in Runner.schedule(self, dryrun_info)
271 with log_event(
272 "schedule",
273 scheduler,
274 app_image=app_image,
275 runcfg=json.dumps(cfg) if cfg else None,
276 ) as ctx:
277 sched = self._scheduler(scheduler)
--> 278 app_id = sched.schedule(dryrun_info)
279 app_handle = make_app_handle(scheduler, self._name, app_id)
280 app = none_throws(dryrun_info._app)
File /opt/app-root/lib64/python3.8/site-packages/torchx/schedulers/ray_scheduler.py:239, in RayScheduler.schedule(self, dryrun_info)
237 # 1. Submit Job via the Ray Job Submission API
238 try:
--> 239 job_id: str = client.submit_job(
240 submission_id=cfg.app_id,
241 # we will pack, hash, zip, upload, register working_dir in GCS of ray cluster
242 # and use it to configure your job execution.
243 entrypoint="python3 ray_driver.py",
244 runtime_env=runtime_env,
245 )
247 finally:
248 if dirpath.startswith(tempfile.gettempdir()):
File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/job/sdk.py:203, in JobSubmissionClient.submit_job(self, entrypoint, job_id, runtime_env, metadata, submission_id, entrypoint_num_cpus, entrypoint_num_gpus, entrypoint_resources)
200 metadata = metadata or {}
201 metadata.update(self._default_metadata)
--> 203 self._upload_working_dir_if_needed(runtime_env)
204 self._upload_py_modules_if_needed(runtime_env)
206 # Run the RuntimeEnv constructor to parse local pip/conda requirements files.
File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py:398, in SubmissionClient._upload_working_dir_if_needed(self, runtime_env)
390 def _upload_fn(working_dir, excludes, is_file=False):
391 self._upload_package_if_needed(
392 working_dir,
393 include_parent_dir=False,
394 excludes=excludes,
395 is_file=is_file,
396 )
--> 398 upload_working_dir_if_needed(runtime_env, upload_fn=_upload_fn)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:68, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
66 package_path = Path(working_dir)
67 if not package_path.exists() or package_path.suffix != ".zip":
---> 68 raise ValueError(
69 f"directory {package_path} must be an existing "
70 "directory or a zip package"
71 )
73 pkg_uri = get_uri_for_package(package_path)
74 try:
ValueError: directory /tmp/torchx_workspacel83oit3q must be an existing directory or a zip package
@KPostOffice created a special ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
that when installed in my CodeFlare Notebook running on the FIPS cluster, was able to succeed to run a job:
While on the FIPS cluster:
pip install ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
The Ray scheduler does not support port mapping.
2023-10-20 19:20:56,766 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_2fd8353f5ce24329.zip.
2023-10-20 19:20:56,767 INFO packaging.py:530 -- Creating a file package for local directory '/tmp/torchx_workspacem7zl9kwn'.
AppStatus:
msg: !!python/object/apply:ray.dashboard.modules.job.common.JobStatus
- SUCCEEDED
num_restarts: -1
roles:
- replicas:
- hostname: <NONE>
id: 0
role: ray
state: !!python/object/apply:torchx.specs.api.AppState
- 4
structured_error_msg: <NONE>
role: ray
state: SUCCEEDED (4)
structured_error_msg: <NONE>
ui_url: null
Describe the Bug
On non-FIPS, when you submit the guided-demos/2_basic_jobs DDPJobDefinition mnisttest, the job is scheduled as pending, then switches to running and then completes.
On a FIPS cluster, I'm noticing the following error - (I'll post the entire output below in a comment)
Codeflare Stack Component Versions
Please specify the component versions in which you have encountered this bug.
Codeflare SDK:
MCAD: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1 Instascale: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1 Codeflare Operator: v1.0.0-rc.1 Other: OpenShift 4.12.22 with FIPS enabled: All master and worker nodes report FIPS enabled, for example:
Steps to Reproduce the Bug
Issue with path: /tmp/torchx_workspacel83oit3q
issue.What Have You Already Tried to Debug the Issue?
I tried it on non-FIPS and it worked fine. I also tried a second FIPS cluster to make sure it wasn't just a bad cluster.
Expected Behavior
I expected the job to be scheduled, run and complete successfully.
Screenshots, Console Output, Logs, etc.
More detail of the codeflare-notebook error message will be posted below.
Affected Releases
main
Additional Context
Add as applicable and when known: