Closed AmeerHajAli closed 3 years ago
Thank you so much for discovering these. I'll investigate further and might create child issues to track these individually after finding out the cause.
@krfricke , after running compact-regression-test.yaml, I am also getting:
Traceback (most recent call last):
File "/Users/ameerhajali/anaconda3/envs/ray/bin/rllib", line 8, in <module>
sys.exit(cli())
File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/rllib/scripts.py", line 34, in cli
train.run(options, train_parser)
File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/rllib/train.py", line 255, in run
concurrent=True)
File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/tune/tune.py", line 624, in run_experiments
_remote=False))
File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 81, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py", line 225, in get
res = self._get(obj_ref, op_timeout)
File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py", line 244, in _get
err = cloudpickle.loads(data.error)
ModuleNotFoundError: No module named 'tblib'
pip freeze -l
) hereI can't repro 5 and 6. Does this come up immediately? (It ran for ~1.5 hours without any problems). If it still comes up for you, can you post some local environment information (Python version and pip freeze -l
)?
(ray) ~/Desktop> pip freeze -l
aiobotocore==1.2.2
aiodataloader==0.2.0
aiofiles==0.5.0
aiohttp==3.7.4.post0
aiohttp-cors==0.7.0
aiohttp-middlewares==1.1.0
aioitertools==0.7.1
aiojobs==0.3.0
aiopg==1.2.0
aioredis==1.3.1
alabaster==0.7.12
alchemy-mock==0.4.3
alembic==1.5.2
aniso8601==7.0.0
anyio==2.2.0
anyscale==0.4.18
apipkg==1.5
appdirs==1.4.4
appnope==0.1.0
argon2==0.1.10
argon2-cffi==20.1.0
asgiref==3.3.1
astroid==2.5.6
async-exit-stack==1.0.1
async-generator==1.10
async-timeout==3.0.1
asyncache==0.1.1
asyncpg==0.21.0
asynctest==0.13.0
attrs==20.3.0
aws==0.2.5
aws-sam-translator==1.28.1
aws-xray-sdk==2.6.0
awscli==1.19.62
awspricing==2.0.3
Babel==2.9.0
backcall==0.2.0
backoff==1.10.0
bcrypt==3.1.7
beautifulsoup4==4.9.1
black==19.10b0
bleach==3.1.5
blessings==1.7
blis==0.7.4
boto==2.49.0
boto3==1.16.52
botocore==1.19.52
cachetools==4.2.0
caffeinate==0.1.0
catalogue==1.0.0
certifi==2020.12.5
cffi==1.14.4
cfgv==3.2.0
cfn-lint==0.39.0
chardet==3.0.4
click==7.1.2
cliff==3.6.0
cloudpickle==1.6.0
cmaes==0.7.1
cmd2==1.5.0
cmdstanpy==0.9.68
colorama==0.4.4
coloredlogs==15.0
colorful==0.5.4
colorlog==4.7.2
colorthief==0.2.1
commonmark==0.8.1
conda-pack==0.6.0
ConfigArgParse==1.4
convertdate==2.3.2
coverage==5.3.1
cryptography==3.3.1
cycler==0.10.0
cymem==2.0.5
Cython==0.29
dask==2021.4.0
databases==0.4.2
dataclasses==0.6
decorator==4.4.2
defusedxml==0.6.0
Deprecated==1.2.12
distlib==0.3.1
dm-tree==0.1.6
dnspython==2.1.0
docker==4.4.1
docspec==0.2.1
docspec-python==0.2.0
docutils==0.14
ecdsa==0.14.1
email-validator==1.1.2
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
entrypoints==0.3
ephem==3.7.7.1
execnet==1.7.1
expiringdict==1.1.4
fabric==2.5.0
fastapi==0.59.0
filelock==3.0.12
flake8==3.8.4
flake8-alfred==1.1.1
flake8-import-order==0.18.1
flake8-polyfill==1.0.2
flake8-quotes==3.2.0
Flask==1.1.4
Flask-BasicAuth==0.2.0
Flask-Cors==3.0.10
flask-pytest==0.0.5
Flask-RESTful==0.3.8
flatbuffers==1.12
freezegun==1.1.0
fsspec==2021.6.1
future==0.18.2
gensim==3.8.3
gevent==21.1.2
geventhttpclient==1.4.4
gitdb==4.0.5
GitPython==3.1.3
google==3.0.0
google-api-core==1.25.0
google-api-python-client==1.12.8
google-auth==1.24.0
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.2
google-cloud==0.34.0
google-cloud-billing==1.1.0
google-cloud-core==1.5.0
google-cloud-iam==2.0.0
google-cloud-resource-manager==0.30.3
googleapis-common-protos==1.52.0
gpustat==0.6.0
graphene==2.1.8
graphql-core==2.3.2
graphql-relay==2.0.1
greenlet==1.0.0
grimp==1.2.3
grpc-google-iam-v1==0.12.3
grpc-stubs==1.24.3
grpcio==1.35.0
grpcio-tools==1.35.0
gym==0.18.0
h11==0.9.0
hijri-converter==2.1.1
hiredis==2.0.0
holidays==0.11.1
httplib2==0.18.1
httptools==0.1.1
humanfriendly==9.1
hurry.filesize==0.9
identify==2.2.4
idna==2.10
imagesize==1.2.0
import-linter==1.2.1
importlib-metadata==4.0.1
iniconfig==1.1.1
invoke==1.4.1
ipykernel==5.3.4
ipython==7.17.0
ipython-genutils==0.2.0
iso8601==0.1.14
isort==5.8.0
itsdangerous==1.1.0
jedi==0.17.2
Jinja2==2.11.2
jmespath==0.10.0
joblib==1.0.0
json5==0.9.5
jsondiff==1.2.0
jsonpatch==1.28
jsonpickle==1.4.1
jsonpointer==2.0
jsonschema==3.2.0
junit-xml==1.9
jupyter-client==6.1.6
jupyter-core==4.6.3
jupyter-packaging==0.7.12
jupyter-server==1.4.1
jupyterlab==3.0.12
jupyterlab-server==2.3.0
kiwisolver==1.3.1
kopf==1.32.1
korean-lunar-calendar==0.2.1
kubernetes==17.17.0
kubernetes-asyncio==12.0.1
launchdarkly-server-sdk==6.13.1
lazy-object-proxy==1.6.0
libcst==0.3.16
libhoney==1.9.0
locket==0.2.1
locust==1.4.3
LunarCalendar==0.0.9
lz4==3.1.3
Mako==1.1.4
MarkupSafe==1.1.1
matplotlib==3.3.4
mccabe==0.6.1
mistune==0.8.4
mock==1.0.1
modin==0.10.0
more-itertools==8.7.0
moto==1.3.16
msgpack==1.0.2
multidict==5.1.0
murmurhash==1.0.5
mypy==0.790
mypy-extensions==0.4.3
nbclassic==0.2.6
nbconvert==5.6.1
nbformat==5.0.7
networkx==2.5.1
nltk==3.6.2
nodeenv==1.6.0
notebook==6.0.3
npm==0.1.1
nr.collections==0.0.1
nr.databind.core==0.0.22
nr.databind.json==0.0.14
nr.fs==1.6.3
nr.interface==0.0.5
nr.metaclass==0.0.6
nr.parsing.date==0.6.1
nr.pylang.utils==0.0.4
nr.stream==0.0.5
nr.utils.re==0.1.1
numpy==1.19.5
nvidia-ml-py3==7.352.0
oauth2client==3.0.0
oauthlib==3.1.0
onelogin==2.0.2
opencensus==0.7.12
opencensus-context==0.1.2
opencv-python-headless==4.3.0.36
opentelemetry-api==1.4.1
opentelemetry-exporter-otlp==0.17b0
opentelemetry-exporter-otlp-proto-grpc==1.4.1
opentelemetry-ext-asgi==0.11b0
opentelemetry-ext-asyncpg==0.11b0
opentelemetry-ext-botocore==0.11b0
opentelemetry-ext-honeycomb==0.5b0
opentelemetry-instrumentation==0.23b2
opentelemetry-instrumentation-asgi==0.17b0
opentelemetry-instrumentation-asyncpg==0.17b0
opentelemetry-instrumentation-botocore==0.17b0
opentelemetry-instrumentation-sqlalchemy==0.17b0
opentelemetry-instrumentation-starlette==0.17b0
opentelemetry-proto==1.4.1
opentelemetry-sdk==1.4.1
opentelemetry-semantic-conventions==0.23b2
optional-django==0.1.0
optuna==2.5.0
orjson==3.4.7
packaging==20.8
pandas==1.2.4
pandoc==1.0.2
pandocfilters==1.4.2
paramiko==2.7.1
parso==0.7.1
partd==1.1.0
pathspec==0.8.1
pbr==5.5.1
pep8-naming==0.11.1
pexpect==4.8.0
pickle5==0.0.11
pickleshare==0.7.5
Pillow==7.2.0
pip-tools==5.5.0
plac==1.1.3
plotly==4.14.3
pluggy==0.13.1
ply==3.11
postgres==3.0.0
pre-commit==2.12.1
preshed==3.0.5
prettytable==0.7.2
prometheus-client==0.10.1
promise==2.3
prompt-toolkit==3.0.6
prophet==1.0.1
proto-plus==1.13.0
protobuf==3.15.3
psutil==5.8.0
psycopg2-binary==2.8.6
psycopg2-pool==1.1
ptyprocess==0.6.0
py==1.10.0
py-spy==0.3.5
pyaml==20.4.0
pyarrow==3.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybase62==0.4.3
pycodestyle==2.6.0
pycparser==2.20
pydantic==1.8.1
pydata-sphinx-theme==0.4.3
pydoc-markdown==3.13.0
pydocstyle==5.0.2
pyflakes==2.2.0
PyGithub==1.55
pyglet==1.5.0
Pygments==2.3.1
PyJWT==2.1.0
pylama==7.7.1
pylint==2.8.2
PyMeeus==0.5.11
PyNaCl==1.4.0
pynput==1.7.3
pyobjc-core==7.3
pyobjc-framework-Cocoa==7.3
pyobjc-framework-Quartz==7.3
pyparsing==2.4.7
pyperclip==1.8.1
pyRFC3339==1.1
pyrsistent==0.17.3
pystan==2.19.1.1
pytest==6.2.1
pytest-aiohttp==0.3.0
pytest-asyncio==0.14.0
pytest-azurepipelines==0.8.0
pytest-cov==2.11.1
pytest-flask==1.0.0
pytest-forked==1.3.0
pytest-timeout==1.4.2
pytest-tornado==0.8.1
pytest-xdist==2.2.0
python-dateutil==2.8.1
python-editor==1.0.4
python-engineio==3.14.2
python-jose==3.2.0
python-json-logger==2.0.1
python-multipart==0.0.5
python-socketio==4.6.0
python3-wget==0.0.2b1
pytz==2020.5
PyYAML==5.4.1
pyzmq==19.0.2
ray==1.5.2
readthedocs-sphinx-ext==1.0.4
recommonmark==0.5.0
redis==3.5.0
regex==2021.4.4
requests==2.25.1
requests-oauthlib==1.3.0
responses==0.12.0
retrying==1.3.3
rsa==4.7
Rx==1.6.1
s3fs==2021.6.1
s3transfer==0.3.7
sacremoses==0.0.43
scalesec-gcp-workload-identity==1.0.7
scikit-learn==0.23.2
scikit-optimize==0.8.1
scipy==1.5.4
semver==2.13.0
Send2Trash==1.5.0
sentencepiece==0.1.95
sentry-sdk==1.1.0
setuptools-git==1.2
six==1.15.0
sklearn==0.0
smart-open==5.1.0
smmap==3.0.4
sniffio==1.2.0
snowballstemmer==2.0.0
soupsieve==2.0.1
spacy==2.3.5
Sphinx==3.0.4
sphinx-book-theme==0.0.39
sphinx-click==2.5.0
sphinx-copybutton==0.3.1
sphinx-gallery==0.8.2
sphinx-jsonschema==1.16.7
sphinx-tabs==2.0.1
sphinx-version-warning==1.1.2
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
sphinxcontrib-websupport==1.2.4
sphinxcontrib.yt==0.2.2
sphinxemoji==0.1.8
SQLAlchemy==1.4.0b1
sqlalchemy-stubs==0.4
srsly==1.0.5
sshpubkeys==3.1.0
starlette==0.13.4
statsd==3.3.0
stevedore==3.3.0
svgwrite==1.4.1
tabulate==0.8.7
tensorboardX==2.1
terminado==0.8.3
testfixtures==6.15.0
testpath==0.4.4
texthero==1.0.9
thinc==7.4.5
threadpoolctl==2.1.0
tokenizers==0.8.1rc2
toml==0.10.2
toolz==0.11.1
torch==1.7.1
torchvision==0.8.2
tornado==6.1
tqdm==4.56.0
traitlets==4.3.3
transformers==3.1.0
tune-sklearn==0.2.1
typed-ast==1.4.2
typer==0.3.2
typing-extensions==3.10.0.0
typing-inspect==0.6.0
ujson==3.2.0
Unidecode==1.2.0
uritemplate==3.0.1
urllib3==1.26.2
uvicorn==0.11.8
uvloop==0.14.0
virtualenv==20.4.4
vulture==2.3
wasabi==0.8.2
watchdog==1.0.2
wcwidth==0.1.9
webencodings==0.5.1
websocket-client==0.57.0
websockets==8.1
Werkzeug==1.0.1
wordcloud==1.8.1
wrapt==1.12.1
xgboost==1.4.2
xgboost-ray==0.1.1
xmltodict==0.12.0
yapf==0.23.0
yarl==1.6.3
yaspin==1.0.0
zipp==3.4.1
zope.event==4.5.0
zope.interface==5.3.0
python 3.7 I think it is straight forward to repro if you run against a session in the product with the default cluster compute.
CC @wuisawesome, I think the placement groups are potentially leaking or not being cleaned up appropriately.
I believe this should be fixed int he master. Please reopen if you see the issue again
When I run rllib on ray 1.5.2: 1) the resource demands stay even after the application finishes, for example, I still see the following resource demands (for a few minutes) from the scheduler even after the job prints
(pid=191) 2021-08-22 10:45:21,492 INFO tune.py:550 -- Total run time: 1095.71 seconds (1094.69 seconds for the tuning loop).
:2) RLLIB prints a lot of verbose resources:
3) RLLIB requests a lot of resources sometimes, and if the cluster cannot scale up to accommodate it ends up adding nodes and removing them for being idle and hanging forever. (e.g., it requests resources that should run on 200 nodes, but the cluster can scale only to 10 nodes, so it keeps adding 10 nodes and removing them while the trials says “pending”).
4) I think we should have e2e tests of rllib with GPUs, this might be already existing but for some reason, I am not able for example to run (the cluster keeps adding and removing nodes like issue 3) :
ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml
orANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/impala/atari-impala-large.yaml
5) when I run
ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml
I get a lot of:CC @wuisawesome
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".