superduper-io / superduper

Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
https://superduper.io
Apache License 2.0
4.8k stars 464 forks source link

[BUG]: Pytorch tests failing #989

Closed johko closed 1 year ago

johko commented 1 year ago

Contact Details [Optional]

johko@posteo.de

System Information

{ "cfg": { "data_backend": "mongodb://localhost:27017", "vector_search": "inmemory://", "artifact_store": null, "metadata_store": null, "cluster": { "distributed": false, "deserializers": [], "serializers": [], "dask_scheduler": "tcp://localhost:8786", "local": true, "backfill_batch_size": 100 }, "apis": { "retry": { "stop_after_attempt": 2, "wait_max": 10.0, "wait_min": 4.0, "wait_multiplier": 1.0 } }, "logging": { "level": "INFO", "type": "STDERR", "kwargs": {} }, "server": { "host": "127.0.0.1", "port": 3223, "protocol": "http" }, "downloads": { "hybrid": false, "root": "data/downloads" } }, "cwd": "/home/johannes/Projects/superduperdb", "git": { "branch": "('branch', '--show-current') failed with [Errno 2] No such file or directory: 'branch'", "commit": "('show', '-s', '--format=\"%h: %s\"') failed with [Errno 2] No such file or directory: 'show'" }, "hostname": "johannes-kolbe-hh-celebrate", "os_uname": [ "Linux", "johannes-kolbe-hh-celebrate", "5.15.0-84-generic", "#93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023", "x86_64" ], "package_versions": {}, "platform": { "platform": "Linux-5.15.0-84-generic-x86_64-with-glibc2.35", "python_version": "3.8.18" }, "startup_time": "2023-09-25 10:45:44.305842", "superduper_db_root": "/home/johannes/Projects/superduperdb", "sys": { "argv": [ "/home/johannes/Projects/superduperdb/superduperdb/main.py", "info" ], "path": [ "/home/johannes/Projects/superduperdb", "/usr/lib/python38.zip", "/usr/lib/python3.8", "/usr/lib/python3.8/lib-dynload", "/home/johannes/Projects/superduperdb/.venv/lib/python3.8/site-packages", "editable.superduperdb-0.0.7.finder.__path_hook__" ] } }

What happened?

When running make test a couple of tests in test_torch_utils.py are failing, related to pytorch device assignment (see log output below)

I'm not completely sure if that is a thing that is only failing for me, but think it is rather general. I also have a branch ready that fixes these tests: https://github.com/johko/superduperdb/tree/fix_torch_device_tests and can create a PR if you agree

Steps to reproduce

  1. run make test from the terminal

Relevant log output

FAILED test/unittest/model/test_torch_utils.py::test_device_of_cuda - AssertionError: assert device(type='cuda', index=0) == 'cuda'
FAILED test/unittest/model/test_torch_utils.py::test_set_device_context_manager - AssertionError: assert device(type='cuda', index=0) == 'cuda'
FAILED test/unittest/model/test_torch_utils.py::test_to_device_tensor - AssertionError: assert device(type='cuda', index=0) == device(type='cuda')
FAILED test/unittest/model/test_torch_utils.py::test_to_device_nested_list - AssertionError: assert device(type='cuda', index=0) == device(type='cuda')
FAILED test/unittest/model/test_torch_utils.py::test_to_device_nested_dict - AssertionError: assert device(type='cuda', index=0) == device(type='cuda')
rec commented 1 year ago

Oh, dear! So sorry to hear you are having issues. :-/

And thanks for preparing a branch.

We do actually have CI on this repository, though, and you can see that it's all been passing for a long time: https://github.com/SuperDuperDB/superduperdb/commits/main

And this code hasn't been touched since August 15.

It's very likely that there's some incompatibility between your installation and ours, somehow!!

I'm going to start by patching your code into my development environment and see if it works. My guess is it won't because it if did, my existing tests would be failing. :-)

Then I'm going to iron out the exact differences between your installation and my environment, which is fairly close to our CI environment.

And then... we'll see how to fix it!

Hang tight there.

johko commented 1 year ago

Thanks for the quick response. I also wondered about it failing, but couldn't find a reason in my set up. But I'd be happy to hear if you find anything.

rec commented 1 year ago

Well, fascinatingly enough, the change does appear to pass all the tests when patched into my development system!

And when not patched, too.

!

I think we should use your patch but let me do some more information gathering first:

Could you pls. post the results of pip freeze into this issue?

johko commented 1 year ago

Sure:

accelerate==0.23.0
aiohttp==3.8.5
aiosignal==1.3.1
alabaster==0.7.13
annotated-types==0.5.0
anthropic==0.3.11
anyio==3.7.1
asttokens==2.4.0
async-timeout==4.0.3
atpublic==3.1.2
attr==0.3.2
attrs==23.1.0
Babel==2.12.1
backcall==0.2.0
backoff==2.2.1
beautifulsoup4==4.12.2
bidict==0.22.1
black==23.9.1
bleach==6.0.0
blinker==1.6.2
boto3==1.28.53
boto3-stubs==1.28.53
botocore==1.31.53
botocore-stubs==1.31.53
build==1.0.3
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
cloudpickle==2.2.1
cmake==3.27.5
cohere==4.27
colorama==0.4.6
coverage==7.3.1
dask==2023.5.0
decorator==5.1.1
defusedxml==0.7.1
dek==1.2.0
dill==0.3.7
distributed==2023.5.0
distro==1.8.0
dnspython==2.4.2
docutils==0.20.1
duckdb==0.8.1
duckdb-engine==0.9.2
exceptiongroup==1.1.3
execnet==2.0.2
executing==1.2.0
fastapi==0.103.1
fastavro==1.8.2
fastjsonschema==2.18.0
fil==1.3.0
filelock==3.12.4
Flask==2.3.3
Flask-Cors==4.0.0
Flask-HTTPAuth==4.8.0
frozenlist==1.4.0
fsspec==2023.9.2
furo==2023.9.10
greenlet==2.0.2
h11==0.14.0
httpcore==0.18.0
httpx==0.25.0
huggingface-hub==0.17.2
ibis-framework==5.1.0
idna==3.4
imagesize==1.4.1
impall==1.4.0
importlib-metadata==6.8.0
importlib-resources==5.13.0
iniconfig==2.0.0
interrogate==1.5.0
ipython==8.12.2
isort==5.12.0
itsdangerous==2.1.2
jedi==0.19.0
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.3.2
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
jupyter_client==8.3.1
jupyter_core==5.3.1
jupyterlab-pygments==0.2.2
lancedb==0.1.16
libcst==1.0.1
lit==16.0.6
locket==1.0.0
lorem==0.1.1
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mdit-py-plugins==0.4.0
mdurl==0.1.2
mistune==3.0.1
mongomock==4.1.2
MonkeyType==23.3.0
mpmath==1.3.0
msgpack==1.0.6
multidict==6.0.4
multipledispatch==0.6.0
mypy==1.5.1
mypy-extensions==1.0.0
myst-parser==2.0.0
nbclient==0.8.0
nbconvert==7.8.0
nbformat==5.9.2
nbsphinx==0.9.3
nbsphinx-link==1.3.0
networkx==3.1
numpy==1.24.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openai==0.28.0
overrides==7.4.0
packaging==23.1
pandas==2.0.3
pandoc==2.3
pandocfilters==1.5.0
parso==0.8.3
parsy==2.1
partd==1.4.0
pathspec==0.11.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==10.0.1
pip-tools==7.3.0
pkgutil_resolve_name==1.3.10
platformdirs==3.10.0
pluggy==1.3.0
plumbum==1.8.2
ply==3.11
pooch==1.7.0
prompt-toolkit==3.0.39
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
pyarrow==11.0.0
pydantic==2.3.0
pydantic_core==2.6.3
Pygments==2.16.1
pylance==0.5.10
pymongo==4.5.0
pyproject_hooks==1.0.0
pytest==7.4.2
pytest-asyncio==0.21.1
pytest-cov==4.1.0
pytest-xdist==3.3.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.1
ratelimiter==1.2.0.post0
readerwriterlock==1.0.9
referencing==0.30.2
regex==2023.8.8
requests==2.31.0
retry==0.9.2
rich==13.5.3
rpds-py==0.10.3
ruff==0.0.291
s3transfer==0.6.2
safer==4.8.0
safetensors==0.3.3
scikit-learn==1.3.1
scipy==1.10.1
semver==3.0.1
sentinels==1.0.0
six==1.16.0
sniffio==1.3.0
snowballstemmer==2.2.0
sortedcontainers==2.4.0
soupsieve==2.5
Sphinx==7.1.2
sphinx-autodoc-typehints==1.24.0
sphinx-basic-ng==1.0.0b2
sphinx-copybutton==0.5.2
sphinxcontrib-applehelp==1.0.4
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.1
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-mermaid==0.9.2
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
SQLAlchemy==2.0.21
sqlalchemy-views==0.3.2
sqlglot==11.7.1
stack-data==0.6.2
starlette==0.27.0
-e git+https://github.com/johko/superduperdb@98f2769862325e0f15827f6aa504b309aa04a496#egg=superduperdb
sympy==1.12
tabulate==0.9.0
tblib==2.0.0
tdir==1.6.0
tenacity==8.2.3
threadpoolctl==3.2.0
tinycss2==1.2.1
tokenizers==0.13.3
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
torch==2.0.0
torchvision==0.15.1
tornado==6.3.3
tqdm==4.66.1
traitlets==5.10.0
transformers==4.33.2
triton==2.0.0
typer==0.9.0
types-awscrt==0.19.1
types-Pillow==10.0.0.3
types-requests==2.31.0.4
types-s3transfer==0.6.2
types-tqdm==4.66.0.2
types-urllib3==1.26.25.14
typing-inspect==0.9.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==1.26.16
vcrpy==5.1.0
wcwidth==0.2.6
webencodings==0.5.1
Werkzeug==2.3.7
wrapt==1.15.0
xmod==1.5.0
xxhash==3.3.0
yarl==1.9.2
zict==3.0.0
zipp==3.17.0
rec commented 1 year ago

I think you are by the way right that this code couldn't ever have worked. :-/

I think that for some reason, CUDA has not been on in any machine that tested it yet, which is an obvious issue. My machine at least has lots of GPU...

rec commented 1 year ago

Ah, yes, you can see that these lines are not executed on my machine.

Screenshot 2023-09-25 at 12 00 00

johko commented 1 year ago

Makes sense then :) thanks for checking and creating the PR, now I don't feel like I'm too stupid to run tests anymore :D