ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

Subject: Ray Initialization Issue - Dashboard Failure and Kernel Crash #41521

Closed mkolyaei closed 3 months ago

mkolyaei commented 9 months ago

What happened + What you expected to happen

Dear ray team,

When attempting to initialize Ray with ray.init(local_mode=True), the Ray dashboard failed to start with a return code of 1. my expected output is the initialization of Ray should not lead to issues with 'pydantic' and should not cause the kernel to crash.

I appreciate your help on this matter. Regards, Mary

Versions / Dependencies

Package Version


absl-py 2.0.0 aiocache 0.12.2 aiofiles 23.2.1 aiohttp 3.9.1 aiohttp-cors 0.7.0 aiopubsub 3.0.0 aioredis 2.0.1 aiosignal 1.3.1 aiosmtplib 3.0.1 alembic 1.12.1 aniso8601 7.0.0 annotated-types 0.6.0 anyio 4.1.0 appnope 0.1.3 APScheduler 3.10.4 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 astunparse 1.6.3 async-lru 2.0.4 async-timeout 4.0.3 attrs 23.1.0 Babel 2.13.1 backcall 0.2.0 beautifulsoup4 4.12.2 bleach 6.1.0 blessed 1.20.0 Bottleneck 1.3.7 cachetools 5.3.2 certifi 2023.11.17 cffi 1.16.0 chardet 3.0.4 charset-normalizer 3.3.2 click 8.1.7 colorama 0.4.6 colorful 0.5.5 colorlog 6.7.0 comm 0.2.0 contourpy 1.2.0 cssutils 2.9.0 cycler 0.12.1 dataframe-image 0.2.2 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 distlib 0.3.7 dm-tree 0.1.8 docopt 0.6.2 dotmap 1.3.30 entrypoints 0.4 exceptiongroup 1.2.0 executing 2.0.1 Farama-Notifications 0.0.4 fastjsonschema 2.19.0 filelock 3.13.1 flatbuffers 23.5.26 fonttools 4.45.1 fqdn 1.5.1 frozenlist 1.4.0 fsspec 2023.10.0 gast 0.5.4 google-api-core 2.14.0 google-auth 2.23.4 google-auth-oauthlib 1.1.0 google-pasta 0.2.0 googleapis-common-protos 1.61.0 gpustat 1.1.1 GPUtil 1.4.0 graphene 2.1.9 graphene-sqlalchemy 2.3.0 graphql-core 2.3.2 graphql-relay 2.0.1 graphql-server-core 2.0.0 graphql-ws 0.4.4 greenlet 3.0.1 grpcio 1.59.3 gunicorn 21.2.0 gym 0.26.2 gym-notices 0.0.8 gymnasium 0.28.1 gymnasium-notices 0.0.1 h11 0.8.1 h2 3.2.0 h5py 3.10.0 hiredis 2.2.3 hpack 3.0.0 html2image 2.0.4.3 html5tagger 1.3.0 httpcore 0.3.0 httptools 0.6.1 hyperframe 5.2.0 idna 2.10 imageio 2.33.0 importlib-metadata 6.8.0 importlib-resources 6.1.1 install 1.3.5 ipykernel 6.26.0 ipython 8.18.0 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 jax-jumpy 1.0.0 jaxtyping 0.2.23 jedi 0.19.1 Jinja2 3.1.2 joblib 1.3.2 json5 0.9.14 jsonpointer 2.4 jsonschema 4.20.0 jsonschema-specifications 2023.11.1 jupyter 1.0.0 jupyter_client 8.6.0 jupyter-console 6.6.3 jupyter_core 5.5.0 jupyter-events 0.9.0 jupyter-lsp 2.2.1 jupyter_server 2.10.1 jupyter_server_terminals 0.4.4 jupyterlab 4.0.9 jupyterlab_pygments 0.3.0 jupyterlab_server 2.25.2 jupyterlab-widgets 3.0.9 kaleido 0.2.1 keras 2.15.0 kiwisolver 1.4.5 lazy_loader 0.3 libclang 16.0.6 linear-operator 0.5.1 loguru 0.7.2 lxml 4.9.3 lz4 4.3.2 Mako 1.3.0 Markdown 3.5.1 markdown-it-py 3.0.0 markdown2 2.4.10 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mistune 3.0.2 ml-dtypes 0.2.0 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multipledispatch 1.0.0 mypy-extensions 1.0.0 nbclient 0.9.0 nbconvert 7.11.0 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.2.1 notebook 7.0.6 notebook_shim 0.2.3 numexpr 2.8.7 numpy 1.26.2 nvidia-ml-py 12.535.133 oauthlib 3.2.2 opencensus 0.11.3 opencensus-context 0.1.3 opencv-python 4.8.1.78 opt-einsum 3.3.0 optuna 3.4.0 overrides 7.4.0 packaging 23.2 pandas 2.1.3 pandocfilters 1.5.0 parso 0.8.3 passlib 1.7.4 pexpect 4.9.0 pickleshare 0.7.5 Pillow 10.1.0 pip 23.3.1 pipdeptree 2.13.1 platformdirs 4.0.0 plotly 5.18.0 prometheus-client 0.19.0 promise 2.3 prompt-toolkit 3.0.41 protobuf 4.23.4 psutil 5.9.6 ptyprocess 0.7.0 pure-eval 0.2.2 py-postgresql 1.3.0 py-spy 0.3.14 pyaml 23.9.7 pyarrow 14.0.1 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycparser 2.21 pydantic 1.8.2 pydantic_core 2.14.5 Pygments 2.17.2 PyJWT 2.8.0 pyparsing 3.1.1 PyQt5 5.15.10 PyQt5-Qt5 5.15.11 PyQt5-sip 12.13.0 pyre-extensions 0.0.30 pyro-api 0.1.2 pyro-ppl 1.8.6 pytest-runner 6.0.0 python-dateutil 2.8.2 python-editor 1.0.4 python-json-logger 2.0.7 pytz 2023.3.post1 PyWavelets 1.5.0 PyYAML 6.0.1 pyzmq 25.1.1 qtconsole 5.5.1 QtPy 2.4.1 ray 2.8.0 redis 5.0.1 referencing 0.31.0 requests 2.31.0 requests-async 0.5.0 requests-oauthlib 1.3.1 rfc3339-validator 0.1.4 rfc3986 1.5.0 rfc3986-validator 0.1.1 rich 13.7.0 rpds-py 0.13.1 rsa 4.9 ruamel.yaml 0.18.5 ruamel.yaml.clib 0.2.8 Rx 1.6.3 sanic 23.6.0 sanic-compress 0.1.1 Sanic-Cors 2.2.0 Sanic-GraphQL 1.1.0 sanic-jwt 1.8.0 Sanic-Plugins-Framework 0.9.4.post1 sanic-routing 23.6.0 scikit-image 0.22.0 scipy 1.11.4 seaborn 0.13.0 Send2Trash 1.8.2 setuptools 69.0.2 Shimmy 1.3.0 singledispatch 3.7.0 six 1.16.0 smart-open 6.4.0 sniffio 1.3.0 soupsieve 2.5 SQLAlchemy 1.4.50 SQLAlchemy-Utils 0.41.1 stable-baselines3 2.2.1 stack-data 0.6.3 stripe 7.6.0 sympy 1.12 tabulate 0.9.0 tenacity 8.2.3 tensorboard 2.15.1 tensorboard-data-server 0.7.2 tensorboardX 2.6.2.2 tensorflow 2.15.0 tensorflow-estimator 2.15.0 tensorflow-io-gcs-filesystem 0.34.0 termcolor 2.3.0 terminado 0.18.0 threadpoolctl 3.2.0 tifffile 2023.9.26 tinycss2 1.2.1 tomli 2.0.1 torch 2.1.1 tornado 6.3.3 tqdm 4.66.1 tracerite 1.1.1 traitlets 5.13.0 typeguard 2.13.3 typer 0.9.0 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 typing-inspect 0.9.0 tzdata 2023.3 tzlocal 5.2 ujson 5.8.0 uri-template 1.3.0 urllib3 2.1.0 uvloop 0.19.0 virtualenv 20.24.7 wcwidth 0.2.12 webcolors 1.13 webencodings 0.5.1 websocket-client 1.6.4 websockets 12.0 Werkzeug 3.0.1 wheel 0.42.0 widgetsnbextension 4.0.9 wrapt 1.14.1 yarl 1.9.3 zipp 3.17.0

Reproduction script

number of episodes for RLib agents

num_episodes_ray = 50000

stop trials at least from this number of episodes

grace_period_ray = num_episodes_ray / 10

dir for saving Ray results

ray_dir = 'ray_results'

creating necessary dir

if not os.path.exists(f"{local_dir+'/'+ray_dir}"): os.makedirs(f"{local_dir+'/'+ray_dir}")

from ray.rllib.algorithms.ppo.ppo import PPOConfig from ray.rllib.algorithms.ppo.ppo import PPO as ppo

from ray.rllib.algorithms.ppo.ppo_learner import PPOLearnerHyperparameters

algorithms = { 'PPO': PPOLearnerHyperparameters }

from ray import air

config_PPO = PPOConfig() config_PPO.framework("torch") config_PPO.environment(env="SupplyChain") config_PPO.log_level = "WARN"

config_PPO.rollouts( rollout_fragment_length=tune.grid_search([20, 200]), num_rollout_workers=num_cpus - 1, sample_async=False )

config_PPO.resources(num_gpus=num_gpus)

Set training parameters

config_PPO.training( gamma=0.99, grad_clip=tune.grid_search([None, 20.0]), train_batch_size=tune.grid_search([400, 4000]), lr=tune.grid_search([5e-3, 5e-4]), sgd_minibatch_size=tune.grid_search([64, 128]) )

config_PPO['model']['fcnet_hiddens'] = tune.grid_search([[64, 64], [128, 128]])

config_PPO['seed'] = 2023

Set additional training parameters

config_PPO["num_sgd_iter"] = tune.grid_search([15, 30]) config_PPO["horizon"] =env.T-1 config_PPO['evaluation_num_episodes'] = 1000

config_PPO["sgd_minibatch_size"] = tune.grid_search([64, 128])

print(config_PPO.to_dict())

trainer = config_PPO.build()

def train(algorithm, config, verbose, num_episodes_ray=num_episodes_ray, grace_period_ray=grace_period_ray, local_dir=local_dir, ray_dir=ray_dir): """ Train a RLib Agent. """

initializing Ray

ray.shutdown()
ray.init(log_to_driver=False)

logger.debug(f"\n-- train --"
             f"\nalgorithm is "
             f"{algorithm}"
             f"\nconfig is "
             f"{config}")

# https://docs.ray.io/en/latest/tune/api_docs/execution.html
# https://docs.ray.io/en/master/tune/api_docs/schedulers.html#summary
# https://docs.ray.io/en/master/tune/api_docs/analysis.html#id1
analysis = tune.run(algorithm,
                    config=config,
                    metric='episode_reward_mean',
                    mode='max',
                    scheduler=ASHAScheduler(
                        time_attr='episodes_total',
                        max_t=num_episodes_ray,
                        grace_period=grace_period_ray,
                        reduction_factor=5),
                    checkpoint_freq=1,
                    keep_checkpoints_num=1,
                    checkpoint_score_attr='episode_reward_mean',
                    progress_reporter=tune.JupyterNotebookReporter(
                        overwrite=True),
                    verbose=verbose,
                    local_dir=os.getcwd()+'/'+local_dir+'/'+ray_dir)

trial_dataframes = analysis.trial_dataframes
best_result_df = analysis.best_result_df
best_config = analysis.best_config
best_checkpoint = analysis.best_checkpoint
print(f"\ncheckpoint saved at {best_checkpoint}")

# stopping Ray
ray.shutdown()

return trial_dataframes, best_result_df, best_config, best_checkpoint

def result_df_as_image(result_df, algorithm, local_dir=local_dir, plots_dir=plots_dir): """ Visualize the (DataFrame) RLib Agent's result as an image. """

creating necessary subdir and saving plot

if not os.path.exists(f"{local_dir}/{plots_dir}/{algorithm}"):
    os.makedirs(f"{local_dir}/{plots_dir}/{algorithm}")
dfi.export(result_df.iloc[:, np.r_[:3, 9]],
           f"{local_dir}/{plots_dir}/{algorithm}"
           f"/best_result_{algorithm}.png",
           table_conversion='matplotlib')

def calculate_training_time(result_df): """ Calculate a RLib Agent training time (minutes). """ return int(result_df.time_total_s[0]//60)

def calculate_training_episodes(result_df): """ Calculate a RLib Agent training episodes (number). """ return round(result_df.episodes_total[0], -3)

from ray.tune.schedulers import ASHAScheduler

def load_policy(algorithm, config, checkpoint): """ Load a RLib Agent policy. """

initializing Ray

ray.shutdown()
ray.init(log_to_driver=False)

# loading policy
trainer = algorithm(config=config)
trainer.restore(f"{checkpoint}")
policy = trainer.get_policy()

# stopping Ray
ray.shutdown()

logger.debug(f"\n-- load_policy --"
             f"\nalgorithm is "
             f"{algorithm}"
             f"\nconfig is "
             f"{config}"
             f"\ncheckpoint is "
             f"{checkpoint}"
             f"\ntrainer is "
             f"{trainer}"
             f"\npolicy is "
             f"{policy}")

return policy

def fix_best_checkpoint(checkpoint): """ Fix a RLib Agent best checkpoint path. """

searching all checkpoints related to the best agent's result

checkpoint_dir = checkpoint.rsplit('/', 2)[0]
sub_dirs = [sub_dir for sub_dir in os.listdir(checkpoint_dir)
            if os.path.isdir(os.path.join(checkpoint_dir, sub_dir))]
# finding the most recent checkpoint (the best one)
sub_dirs.sort(reverse=True)

# creating the fixed best checkpoint path
fixed_checkpoint_dir = checkpoint_dir + '/' + sub_dirs[0] + '/'
fixed_checkpoint_file = os.listdir(fixed_checkpoint_dir)[0].split('.')[0]
best_checkpoint = fixed_checkpoint_dir + fixed_checkpoint_file

logger.debug(f"\n-- fix_best_checkpoint --"
             f"\nfixed_checkpoint_dir is "
             f"{fixed_checkpoint_dir}"
             f"\nfixed_checkpoint_file is "
             f"{fixed_checkpoint_file}"
             f"\nbest_checkpoint is "
             f"{best_checkpoint}")

return best_checkpoint

# training a PPO agent

ray.init(local_mode=True) (results_PPO, best_result_PPO, best_config_PPO, checkpoint_PPO) = train(algorithms['PPO'], config_PPO, verbose)

2023-11-30 13:43:54,571 ERROR services.py:1329 -- Failed to start the dashboard , return code 1 2023-11-30 13:43:54,577 ERROR services.py:1354 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is. 2023-11-30 13:43:54,602 ERROR services.py:1398 -- The last 20 lines of /tmp/ray/session_2023-11-30_13-43-49_981892_16350/logs/dashboard.log (it contains the error message from the dashboard): File "/Users/marri/opt/anaconda3/envs/RL/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/Users/marri/opt/anaconda3/envs/RL/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 16, in from ray.job_submission import JobStatus, JobSubmissionClient File "/Users/marri/opt/anaconda3/envs/RL/lib/python3.9/site-packages/ray/job_submission/init.py", line 2, in from ray.dashboard.modules.job.pydantic_models import DriverInfo, JobDetails, JobType File "/Users/marri/opt/anaconda3/envs/RL/lib/python3.9/site-packages/ray/dashboard/modules/job/pydantic_models.py", line 4, in from ray._private.pydantic_compat import BaseModel, Field, PYDANTIC_INSTALLED File "/Users/marri/opt/anaconda3/envs/RL/lib/python3.9/site-packages/ray/_private/pydantic_compat.py", line 100, in monkeypatch_pydantic_2_for_cloudpickle() File "/Users/marri/opt/anaconda3/envs/RL/lib/python3.9/site-packages/ray/_private/pydantic_compat.py", line 58, in monkeypatch_pydantic_2_for_cloudpickle pydantic._internal._model_construction.SchemaSerializer = ( AttributeError: module 'pydantic' has no attribute '_internal' 2023-11-30 13:43:54,891 INFO worker.py:1673 -- Started a local Ray instance.

Issue Severity

High: It blocks me from completing my task.

scottsun94 commented 9 months ago

@alanwguo do you know if it's a known issue given it's related to Pydantic?

anyscalesam commented 9 months ago

@alanwguo can you triage?

alanwguo commented 9 months ago

@mkolyaei , as a workaround can you install a version of pydantic between >= 1.9 and < 2.5?

In the latest master, we've updated this pydantic compatibility logic so starting in ray 2.9.0, you should not run into this.