neurocard / neurocard

State-of-the-art neural cardinality estimators for join queries
Apache License 2.0
66 stars 27 forks source link

Ray problem: ray.tune.error.TuneError: ('Trials did not complete', [NeuroCard_44b5b_00000]) #4

Open Doris404 opened 2 years ago

Doris404 commented 2 years ago

I try to fix it with the help of google, which turns out no use. The slack overflow don't have the correct answer (I try to change the version of packages which do not scceed), so I send this email hoping to get some advice.

More details about my problem: Linux 5.4.0-84-generic #94-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux cuda version 11.4.20210728 The packages:

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_llvm conda-forge _pytorch_select 0.2 gpu_0
_tflow_select 2.3.0 mkl
absl-py 0.9.0 py37hc8dfbb8_1 conda-forge aiohttp 3.7.4.post0 py37h5e8e339_0 conda-forge aiohttp-cors 0.7.0 pypi_0 pypi aioredis 2.0.0 pypi_0 pypi argh 0.26.2 pypi_0 pypi arrow-cpp 0.11.1 py37h0e61e49_1004 conda-forge astor 0.8.1 pyh9f0ad1d_0 conda-forge async-timeout 3.0.1 py_1000 conda-forge attrs 21.2.0 pyhd8ed1ab_0 conda-forge beautifulsoup4 4.10.0 pypi_0 pypi blas 1.0 mkl conda-forge blessings 1.7 pypi_0 pypi blinker 1.4 py_1 conda-forge boost-cpp 1.68.0 h11c811c_1000 conda-forge brotlipy 0.7.0 py37h5e8e339_1001 conda-forge bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.17.2 h7f98852_0 conda-forge ca-certificates 2021.10.8 ha878542_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cachetools 4.2.2 pypi_0 pypi certifi 2021.5.30 pypi_0 pypi cffi 1.14.6 py37hc58025e_0 conda-forge chardet 4.0.0 py37h89c1867_1 conda-forge charset-normalizer 2.0.6 pypi_0 pypi click 8.0.1 py37h89c1867_0 conda-forge cloudpickle 2.0.0 pypi_0 pypi colorama 0.4.4 pypi_0 pypi colorful 0.5.4 pypi_0 pypi configparser 3.8.1 pypi_0 pypi cryptography 3.4.7 py37h5d9358c_0 conda-forge cudatoolkit 10.1.243 h036e899_9 conda-forge cudnn 7.6.5.32 hc0a50b0_1 conda-forge dataclasses 0.8 pyhc8e2a94_3 conda-forge decorator 5.1.0 pyhd8ed1ab_0 conda-forge docker-pycreds 0.4.0 pypi_0 pypi filelock 3.0.12 pypi_0 pypi funcsigs 1.0.2 pypi_0 pypi gast 0.2.2 py_0 conda-forge gitdb 4.0.7 pypi_0 pypi gitpython 1.0.0 pypi_0 pypi glog 0.3.1 pypi_0 pypi google 3.0.0 pypi_0 pypi google-api-core 1.31.3 pypi_0 pypi google-auth 1.35.0 pyh6c4a22f_0 conda-forge google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge google-pasta 0.2.0 pyh8c360ce_0 conda-forge googleapis-common-protos 1.53.0 pypi_0 pypi gpustat 0.4.1 pypi_0 pypi gql 0.3.0 py_0 conda-forge graphql-core 1.1 pypi_0 pypi grpcio 1.40.0 pypi_0 pypi h5py 3.3.0 nompi_py37ha3df211_100 conda-forge hdf5 1.10.6 nompi_h3c11f04_101 conda-forge icu 58.2 hf484d3e_1000 conda-forge idna 3.2 pypi_0 pypi importlib-metadata 4.8.1 py37h89c1867_0 conda-forge iniconfig 1.1.1 pypi_0 pypi jsonschema 3.2.0 pypi_0 pypi keras-applications 1.0.8 py_1 conda-forge keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge krb5 1.16.4 h2fd8d38_0 conda-forge ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge libblas 3.9.0 8_mkl conda-forge libcblas 3.9.0 8_mkl conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libffi 3.3 h58526e2_2 conda-forge libgcc-ng 11.2.0 h1d223b6_9 conda-forge libgfortran-ng 7.5.0 h14aa051_19 conda-forge libgfortran4 7.5.0 h14aa051_19 conda-forge liblapack 3.9.0 8_mkl conda-forge libpq 11.5 hd9ab2ff_2 conda-forge libprotobuf 3.6.1 hdbcaa40_1001 conda-forge libstdcxx-ng 11.2.0 he4da1e4_9 conda-forge libzlib 1.2.11 h36c2ea0_1013 conda-forge llvm-openmp 12.0.1 h4bd325d_1 conda-forge mako 1.1.3 pyh9f0ad1d_0 conda-forge markdown 3.3.4 pyhd8ed1ab_0 conda-forge markupsafe 2.0.1 py37h5e8e339_0 conda-forge mkl 2020.4 h726a3e6_304 conda-forge mkl-service 2.3.0 py37h8f50634_2 conda-forge msgpack 1.0.2 pypi_0 pypi multidict 5.1.0 pypi_0 pypi ncurses 6.2 h58526e2_4 conda-forge networkx 2.4 py_1 conda-forge ninja 1.10.2 h4bd325d_1 conda-forge numpy 1.18.4 py37h8960a57_0 conda-forge nvidia-ml-py3 7.352.0 pypi_0 pypi oauthlib 3.1.1 pyhd8ed1ab_0 conda-forge opencensus 0.7.13 pypi_0 pypi opencensus-context 0.1.2 pypi_0 pypi openssl 1.1.1l h7f98852_0 conda-forge opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge packaging 21.0 pypi_0 pypi pandas 1.0.5 py37h0da4684_0 conda-forge parquet-cpp 1.5.1 3 conda-forge pathtools 0.1.2 pypi_0 pypi pillow 8.3.2 pypi_0 pypi pip 21.2.4 pyhd8ed1ab_0 conda-forge pipdeptree 2.1.0 pypi_0 pypi pluggy 1.0.0 pypi_0 pypi prometheus-client 0.11.0 pypi_0 pypi promise 2.3 py37h89c1867_4 conda-forge protobuf 3.17.3 pypi_0 pypi psutil 5.0.0 pypi_0 pypi psycopg2 2.8.4 py37h1ba5d50_0
py 1.10.0 pypi_0 pypi py-spy 0.3.9 pypi_0 pypi py4j 0.10.7 py_1 conda-forge pyarrow 0.11.1 py37hbbcf98d_1002 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.8 pypi_0 pypi pycparser 2.20 pyh9f0ad1d_2 conda-forge pyjwt 2.2.0 pyhd8ed1ab_0 conda-forge pyopenssl 21.0.0 pyhd8ed1ab_0 conda-forge pyparsing 2.4.7 pypi_0 pypi pyrsistent 0.18.0 pypi_0 pypi pysocks 1.7.1 py37h89c1867_3 conda-forge pyspark 2.4.3 py_0 conda-forge pytest 6.2.5 pypi_0 pypi python 3.7.7 hcff3b4d_5
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python-gflags 3.1.2 pypi_0 pypi python_abi 3.7 2_cp37m conda-forge pytorch 1.4.0 cuda101py37h02f0884_0
pytz 2021.3 pyhd8ed1ab_0 conda-forge pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge pyyaml 3.10 pypi_0 pypi ray 0.8.6 pypi_0 pypi readline 8.1 h46c0cb4_0 conda-forge redis 3.4.1 pypi_0 pypi requests 2.26.0 pypi_0 pypi requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge rsa 4.7.2 pyh44b312d_0 conda-forge rx 3.2.0 pyhd8ed1ab_0 conda-forge scipy 1.4.1 py37ha3d9a3c_3 conda-forge sentry-sdk 0.4.0 pypi_0 pypi setuptools 58.2.0 py37h89c1867_0 conda-forge shortuuid 0.5.0 pypi_0 pypi six 1.16.0 pyh6c4a22f_0 conda-forge smmap 4.0.0 pypi_0 pypi soupsieve 2.2.1 pypi_0 pypi sqlite 3.36.0 h9cd32fc_2 conda-forge subprocess32 3.5.4 pypi_0 pypi tabulate 0.8.7 pypi_0 pypi tensorboard 1.15.0 py37_0 conda-forge tensorboard-data-server 0.6.0 py37hf1a17b8_0 conda-forge tensorboard-plugin-wit 1.8.0 pyh44b312d_0 conda-forge tensorboardx 2.4 pypi_0 pypi tensorflow 1.15.0 mkl_py37h28c19af_0
tensorflow-base 1.15.0 mkl_py37he1670d9_0
tensorflow-estimator 1.15.1 pyh2649769_0
termcolor 1.1.0 py_2 conda-forge thrift-cpp 0.12.0 h0a07b25_1002 conda-forge tk 8.6.11 h27826a3_1 conda-forge toml 0.10.2 pypi_0 pypi torchvision 0.5.0 pypi_0 pypi typing-extensions 3.10.0.2 hd8ed1ab_0 conda-forge typing_extensions 3.10.0.2 pyha770c72_0 conda-forge urllib3 1.26.7 pyhd8ed1ab_0 conda-forge wandb 0.8.36 pypi_0 pypi watchdog 0.8.3 pypi_0 pypi werkzeug 0.16.1 py_0 conda-forge wheel 0.37.0 pyhd8ed1ab_1 conda-forge wrapt 1.13.1 py37h5e8e339_0 conda-forge xz 5.2.5 h516909a_1 conda-forge yapf 0.27.0 py_0 conda-forge yarl 1.6.3 pypi_0 pypi zipp 3.6.0 pyhd8ed1ab_0 conda-forge zlib 1.2.11 h36c2ea0_1013 conda-forge The error.txt I get: Failure # 1 (occurred at 2021-10-09_11-15-28) Traceback (most recent call last): File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT) File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/worker.py", line 1538, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError: [36mray::NeuroCard.train() [39m (pid=3821066, ip=10.77.110.215) File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 474, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 478, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trainable.py", line 245, in init self.setup(copy.deepcopy(self.config)) File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trainable.py", line 769, in setup self._setup(config) File "run.py", line 508, in _setup loaded_tables) File "run.py", line 683, in MakeSamplerDatasetLoader load_samples=self._load_samples) File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/common.py", line 789, in init self._init_sampler() File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler.py", line 283, in _init_sampler self.add_full_join_fanouts) File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler.py", line 190, in init prepare_utils.prepare(join_spec) File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler_lib/prepare_utils.py", line 261, in prepare print(table, ray.get(jkg)) ray.exceptions.RayTaskError: [36mray::factorized_sampler_lib.prepare_utils.get_join_key_groups() [39m (pid=3821015, ip=10.77.110.215) File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler_lib/prepare_utils.py", line 158, in get_join_key_groups jct = ray.get(jcts[table]) ray.exceptions.RayTaskError: [36mray::factorized_sampler_lib.prepare_utils.get_first_jct() [39m (pid=3821012, ip=10.77.110.215) File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 464, in ray._raylet.execute_task ray.exceptions.RayWorkerError: The worker died unexpectedly while executing this task.

concretevitamin commented 2 years ago

@Doris404 Is there any additional info in these files?

cat /tmp/ray/session_<timestamp>/logs/worker-<hash>.*

You may also try the README here to rebuild the rustlib to see if that's the issue.

Doris404 commented 2 years ago

@concretevitamin As for additional info in these files, I run the bash code seelogs.sh and get the result runseelogs. They are as follows:

seelogs.sh

cat worker-017cb5734d3e86412b9d5ef764d643043c02595b.err
cat worker-017cb5734d3e86412b9d5ef764d643043c02595b.out
cat worker-0216abc5c5194b60123b82dc6b95b7f9c99295ed.err
cat worker-0216abc5c5194b60123b82dc6b95b7f9c99295ed.out
cat worker-024c4fcf56bb0deae114970bd4122e6dd985fb36.err
cat worker-024c4fcf56bb0deae114970bd4122e6dd985fb36.out
cat worker-0727fe8e152f44f4c9b0999490acaedd2b724bba.err
cat worker-0727fe8e152f44f4c9b0999490acaedd2b724bba.out
cat worker-08c821e78b2c36ead28a531f5149c279630f651b.err
cat worker-08c821e78b2c36ead28a531f5149c279630f651b.out
cat worker-0cb47181397d4273b87854291335ef785bd06352.err
cat worker-0cb47181397d4273b87854291335ef785bd06352.out
cat worker-0dc0003534c956aed6369a242ebea7e8adfffb4c.err
cat worker-0dc0003534c956aed6369a242ebea7e8adfffb4c.out
cat worker-13a371ea8e333e49ce100b3aaa2c26e224cd81d7.err
cat worker-13a371ea8e333e49ce100b3aaa2c26e224cd81d7.out
cat worker-14d35afa48bce62a8334d4a61f6037e6d3a7bb74.err
cat worker-14d35afa48bce62a8334d4a61f6037e6d3a7bb74.out
cat worker-17b8a2bf1aaf58b8c87df17ea927a3887a8165f3.err
cat worker-17b8a2bf1aaf58b8c87df17ea927a3887a8165f3.out
cat worker-1918e1d886a78211357fb6caf078f56fc93a8161.err
cat worker-1918e1d886a78211357fb6caf078f56fc93a8161.out
cat worker-1a2c9c284db306363124f2dd55b9b775757c282c.err
cat worker-1a2c9c284db306363124f2dd55b9b775757c282c.out
cat worker-1ae61428dc1316e4840fdb1049150541932338be.err
cat worker-1ae61428dc1316e4840fdb1049150541932338be.out
cat worker-21cd66e011f4b8f1bbf3c0130338339b976acec8.err
cat worker-21cd66e011f4b8f1bbf3c0130338339b976acec8.out
cat worker-25fd269dc662d7e14ed5370b4a19f552624fbabf.err
cat worker-25fd269dc662d7e14ed5370b4a19f552624fbabf.out
cat worker-330f9ba9ea670062c40731b950f175dc35ed09fd.err
cat worker-330f9ba9ea670062c40731b950f175dc35ed09fd.out
cat worker-39d7f90bdc8238912efce03adf647893879fe85c.err
cat worker-39d7f90bdc8238912efce03adf647893879fe85c.out
cat worker-3a1159bc923874384ad107cee901edede632c0f8.err
cat worker-3a1159bc923874384ad107cee901edede632c0f8.out
cat worker-3ae34ba9b6bcfd366f9e7a27b9bfd42c605b50cc.err
cat worker-3ae34ba9b6bcfd366f9e7a27b9bfd42c605b50cc.out
cat worker-3e0353c141dd095426f4a9cedd0f375c94cd1251.err
cat worker-3e0353c141dd095426f4a9cedd0f375c94cd1251.out
cat worker-406eabc817c1a65534be67b775a5152d860162a4.err
cat worker-406eabc817c1a65534be67b775a5152d860162a4.out
cat worker-47d8033cf46008024c2f64205c36905ecee6b6d2.err
cat worker-47d8033cf46008024c2f64205c36905ecee6b6d2.out
cat worker-4cb92ea32449af46b111d981940c67786ccfe2b0.err
cat worker-4cb92ea32449af46b111d981940c67786ccfe2b0.out
cat worker-4d41763dd090f7d4c13d82e2a40c965ebc725601.err
cat worker-4d41763dd090f7d4c13d82e2a40c965ebc725601.out
cat worker-4e93c383ff5ddf3a7f2fc2cc2e8031c9c20cccc2.err
cat worker-4e93c383ff5ddf3a7f2fc2cc2e8031c9c20cccc2.out
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb-0100.err
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb-0100.out
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb.err
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb.out
cat worker-52da1fd761640880b3df93f8492f87dafd583bed.err
cat worker-52da1fd761640880b3df93f8492f87dafd583bed.out
cat worker-530d963a4fddc4c935a800e5eaa8f229f688cda8.err
cat worker-530d963a4fddc4c935a800e5eaa8f229f688cda8.out
cat worker-56ef3ab21a65034ed63d8f692f99e5040abcbaba.err
cat worker-56ef3ab21a65034ed63d8f692f99e5040abcbaba.out
cat worker-5c9df2067c226879e37bc8bb26b2990156e86c54.err
cat worker-5c9df2067c226879e37bc8bb26b2990156e86c54.out
cat worker-5f6f38534da04203da303512a6ea3ff09b136005.err
cat worker-5f6f38534da04203da303512a6ea3ff09b136005.out
cat worker-6b15ac1a779d34578850c11518836cc704a721a4.err
cat worker-6b15ac1a779d34578850c11518836cc704a721a4.out
cat worker-6bd5ad49c545b20b21ac5b1f08736d10b7c92da6.err
cat worker-6bd5ad49c545b20b21ac5b1f08736d10b7c92da6.out
cat worker-7041f912671bdc6f1615fcc18cd8e936d2f1abb4.err
cat worker-7041f912671bdc6f1615fcc18cd8e936d2f1abb4.out
cat worker-7105b7a78112d1f4c6c42754b4fef324f104ac37.err
cat worker-7105b7a78112d1f4c6c42754b4fef324f104ac37.out
cat worker-73bb256ba085fc55fa792ddc6c04a2cde89d8377.err
cat worker-73bb256ba085fc55fa792ddc6c04a2cde89d8377.out
cat worker-765463ac5ce664a0b84b12823f5c2ea33b5e1ede.err
cat worker-765463ac5ce664a0b84b12823f5c2ea33b5e1ede.out
cat worker-7a265f83923c987f42f7aa92bc8ab3c0db80eb60.err
cat worker-7a265f83923c987f42f7aa92bc8ab3c0db80eb60.out
cat worker-7ae3949ca714a904230aa8994a3c942f694fd7da.err
cat worker-7ae3949ca714a904230aa8994a3c942f694fd7da.out
cat worker-7b55165b111bcd694cf497dd0a5112e7761cd006.err
cat worker-7b55165b111bcd694cf497dd0a5112e7761cd006.out
cat worker-7e0370376c0fbef48f22ac499daef61f6c185c1a.err
cat worker-7e0370376c0fbef48f22ac499daef61f6c185c1a.out
cat worker-7ef53ea888e82f4c317b890102de8684c5316a29.err
cat worker-7ef53ea888e82f4c317b890102de8684c5316a29.out
cat worker-827bf3141f4effe18b3dd397d2f09ab979f9595c.err
cat worker-827bf3141f4effe18b3dd397d2f09ab979f9595c.out
cat worker-82861b89f3102a3ad5e5ce0f6665ce9776ee4ccb.err
cat worker-82861b89f3102a3ad5e5ce0f6665ce9776ee4ccb.out
cat worker-8528be8b2f16dac520548431a5f7b4b4a2af0a0f.err
cat worker-8528be8b2f16dac520548431a5f7b4b4a2af0a0f.out
cat worker-85660fd96be5df65c528496f70ccd7cdbc1554b0.err
cat worker-85660fd96be5df65c528496f70ccd7cdbc1554b0.out
cat worker-8ab59a67ed3e9624a01ece730b8e41381a014add.err
cat worker-8ab59a67ed3e9624a01ece730b8e41381a014add.out
cat worker-910b1d93c50f7dce552196ad7258e32ad5ab3e73.err
cat worker-910b1d93c50f7dce552196ad7258e32ad5ab3e73.out
cat worker-9912341fcf75ce4df093088a7e2a6af660b464fd.err
cat worker-9912341fcf75ce4df093088a7e2a6af660b464fd.out
cat worker-9baf4fcd347f2a9cf3c46d96f9485effdd155cdf.err
cat worker-9baf4fcd347f2a9cf3c46d96f9485effdd155cdf.out
cat worker-a0f5f5702a5865e4533fcffc579ed1b941c5f733.err
cat worker-a0f5f5702a5865e4533fcffc579ed1b941c5f733.out
cat worker-a6a2b6c55658d71c92f8387e837078931448578e.err
cat worker-a6a2b6c55658d71c92f8387e837078931448578e.out
cat worker-a6f8df020b9e0b8fc5a68f840f0bd25c31894fb4.err
cat worker-a6f8df020b9e0b8fc5a68f840f0bd25c31894fb4.out
cat worker-aa0917bc5c1273f210ee448d283184cf3ac7eda8.err
cat worker-aa0917bc5c1273f210ee448d283184cf3ac7eda8.out
cat worker-b572acc29082dd0bc2396a6fc5ea1e3ef3b72161.err
cat worker-b572acc29082dd0bc2396a6fc5ea1e3ef3b72161.out
cat worker-b67dde277d3c11099ac3aa33d846e4556638291c.err
cat worker-b67dde277d3c11099ac3aa33d846e4556638291c.out
cat worker-b683fe9a04935e5fe478b94f3faa922a44c471af.err
cat worker-b683fe9a04935e5fe478b94f3faa922a44c471af.out
cat worker-b769c2862108c953641e90a5533a767286d6e3ef.err
cat worker-b769c2862108c953641e90a5533a767286d6e3ef.out
cat worker-b925e3eedefd710c30e866038f1a20a087974fcb.err
cat worker-b925e3eedefd710c30e866038f1a20a087974fcb.out
cat worker-c57a5df310c3f9d8445e7a03052058e90c67e275.err
cat worker-c57a5df310c3f9d8445e7a03052058e90c67e275.out
cat worker-c5ef4e65a4ed0730f3b835d53976fcccf46217b2.err
cat worker-c5ef4e65a4ed0730f3b835d53976fcccf46217b2.out
cat worker-c8841e828f3ee383efb10bb0652c9823750352f4.err
cat worker-c8841e828f3ee383efb10bb0652c9823750352f4.out
cat worker-cb77666172979bf4e9e2fce6651e6f2c805b74f8.err
cat worker-cb77666172979bf4e9e2fce6651e6f2c805b74f8.out
cat worker-cf5d483ccd97f9a4efd18f2429f8d9ff3e4896de.err
cat worker-cf5d483ccd97f9a4efd18f2429f8d9ff3e4896de.out
cat worker-d1ef4a67485ef740ba99bc7f4454f4df52123fe3.err
cat worker-d1ef4a67485ef740ba99bc7f4454f4df52123fe3.out
cat worker-d7ade90c8b1dc3e0ed8a0d8d19d30971961a8b0a.err
cat worker-d7ade90c8b1dc3e0ed8a0d8d19d30971961a8b0a.out
cat worker-d80378791e3641e990a163682be02c98722b39c2.err
cat worker-d80378791e3641e990a163682be02c98722b39c2.out
cat worker-de658e84e3280288766fd67b0afbe5a40397f22a.err
cat worker-de658e84e3280288766fd67b0afbe5a40397f22a.out
cat worker-e25cc27447984c64f343f9d4354091c4b8d056c4.err
cat worker-e25cc27447984c64f343f9d4354091c4b8d056c4.out
cat worker-e3785a863b9f471ccbcf37c49678dea2eed2a4a6.err
cat worker-e3785a863b9f471ccbcf37c49678dea2eed2a4a6.out
cat worker-e637f257603bef20375eecd07ba55fb5b2b0834d.err
cat worker-e637f257603bef20375eecd07ba55fb5b2b0834d.out
cat worker-e7ad6d0602fbabdff0a4c86f0cf48c348b5127d1.err
cat worker-e7ad6d0602fbabdff0a4c86f0cf48c348b5127d1.out
cat worker-ea23ac181e9dd8f182a1dc191d881b6079b57e45.err
cat worker-ea23ac181e9dd8f182a1dc191d881b6079b57e45.out
cat worker-ea41e7d1f1beafafe0a2d782d822165505c09ec1.err
cat worker-ea41e7d1f1beafafe0a2d782d822165505c09ec1.out
cat worker-ea90cb69f7e1967ff1eeb4c25b0b10d13999f5f3.err
cat worker-ea90cb69f7e1967ff1eeb4c25b0b10d13999f5f3.out
cat worker-eefab545de60dd79b4918a7aaf8214d9422e7f88.err
cat worker-eefab545de60dd79b4918a7aaf8214d9422e7f88.out
cat worker-f1ddc43002669eedf53ea40558bc949c3c1e7b2e.err
cat worker-f1ddc43002669eedf53ea40558bc949c3c1e7b2e.out
cat worker-f371acc4e46b4b295431c7748ee3db44fc98b0a2.err
cat worker-f371acc4e46b4b295431c7748ee3db44fc98b0a2.out
cat worker-f72a41922ada55153579408f8d5eb39a03419f0f.err
cat worker-f72a41922ada55153579408f8d5eb39a03419f0f.out
cat worker-faa4483294313dfee4727fd0d2a11f8481254f40.err
cat worker-faa4483294313dfee4727fd0d2a11f8481254f40.out

runseelogs

Ray worker pid: 4061730
Ray worker pid: 4061730
Ray worker pid: 4061917
Ray worker pid: 4061917
Ray worker pid: 4061756
Ray worker pid: 4061756
Ray worker pid: 4061898
Ray worker pid: 4061898
Ray worker pid: 4061886
Ray worker pid: 4061886
Ray worker pid: 4061755
Ray worker pid: 4061755
Ray worker pid: 4061763
Ray worker pid: 4061763
Ray worker pid: 4061740
Ray worker pid: 4061740
Ray worker pid: 4061741
Ray worker pid: 4061741
Ray worker pid: 4061832
Ray worker pid: 4061832
Ray worker pid: 4061827
Ray worker pid: 4061827
Ray worker pid: 4061818
Ray worker pid: 4061818
Ray worker pid: 4061808
Ray worker pid: 4061808
Ray worker pid: 4061760
Ray worker pid: 4061760
Ray worker pid: 4061930
Ray worker pid: 4061930
Ray worker pid: 4061738
Ray worker pid: 4061738
Ray worker pid: 4061867
Ray worker pid: 4061867
Ray worker pid: 4061754
Ray worker pid: 4061754
Ray worker pid: 4061747
Ray worker pid: 4061747
Ray worker pid: 4061789
Ray worker pid: 4061789
Ray worker pid: 4061759
Ray worker pid: 4061759
Ray worker pid: 4061737
Ray worker pid: 4061737
Ray worker pid: 4061718
Ray worker pid: 4061718
Ray worker pid: 4061736
Ray worker pid: 4061736
Ray worker pid: 4061716
Ray worker pid: 4061716
Ray worker pid: 4061926
wandb: W&B is a tool that helps track and visualize machine learning experiments
wandb: No credentials found.  Run "wandb login" to visualize your metrics
wandb: Tracking run with wandb version 0.8.36
wandb: Wandb version 0.12.4 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Run data is saved locally in wandb/run-20211012_045917-r4hynqbc

2021-10-12 04:59:32,363 ERROR worker.py:666 -- Calling ray.init() again after it has already been called.
I1012 04:59:36.770085 4061926 factorized_sampler.py:142] DataTableActor of `cast_info` is ready.
I1012 04:59:39.437844 4061926 factorized_sampler.py:142] DataTableActor of `movie_companies` is ready.
I1012 04:59:43.439739 4061926 factorized_sampler.py:142] DataTableActor of `movie_info` is ready.
I1012 04:59:45.324186 4061926 factorized_sampler.py:142] DataTableActor of `movie_keyword` is ready.
I1012 04:59:48.940280 4061926 factorized_sampler.py:142] DataTableActor of `title` is ready.
I1012 04:59:50.276457 4061926 factorized_sampler.py:142] DataTableActor of `movie_info_idx` is ready.
I1012 04:59:50.276656 4061926 data_utils.py:28] Loading cached join count table of `cast_info` from ./cache/job-light-a2be9f04/cast_info.jct
*** Aborted at 1634014790 (unix time) try "date -d @1634014790" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x28) received by PID 4061926 (TID 0x7ff44703d740) from PID 40; stack trace: ***
    @     0x7ff4473ae3c0 (unknown)
    @     0x7ff4473a4fc4 __GI___pthread_mutex_lock
    @     0x7fe43906a068 google::protobuf::internal::OnShutdownRun()
    @     0x7fe43907225f google::protobuf::internal::InitProtobufDefaults()
    @     0x7fe4390724a1 google::protobuf::internal::InitSCCImpl()
    @     0x7fe438ff4fbe protobuf_orc_5fproto_2eproto::InitDefaults()
    @     0x7fe438ff52aa protobuf_orc_5fproto_2eproto::AddDescriptorsImpl()
    @     0x7ff4473ab47f __pthread_once_slow
    @     0x7fe438ff580d protobuf_orc_5fproto_2eproto::AddDescriptors()
    @     0x7ff4473dbb8a (unknown)
    @     0x7ff4473dbc91 (unknown)
    @     0x7ff44730a915 _dl_catch_exception
    @     0x7ff4473e00bf (unknown)
    @     0x7ff44730a8b8 _dl_catch_exception
    @     0x7ff4473df5fa (unknown)
    @     0x7ff4471a234c (unknown)
    @     0x7ff44730a8b8 _dl_catch_exception
    @     0x7ff44730a983 _dl_catch_error
    @     0x7ff4471a2b59 (unknown)
    @     0x7ff4471a23da dlopen
    @     0x56288edb876d _PyImport_FindSharedFuncptr
    @     0x56288eddcc20 _PyImport_LoadDynamicModuleWithSpec
    @     0x56288eddce79 _imp_create_dynamic
    @     0x56288ece4b62 _PyMethodDef_RawFastCallDict
    @     0x56288ece4c81 _PyCFunction_FastCallDict
    @     0x56288ed802ed _PyEval_EvalFrameDefault
    @     0x56288ecc32b9 _PyEval_EvalCodeWithName
    @     0x56288ed13497 _PyFunction_FastCallKeywords
    @     0x56288ed7f229 _PyEval_EvalFrameDefault
    @     0x56288ed1320b _PyFunction_FastCallKeywords
    @     0x56288ed7ae70 _PyEval_EvalFrameDefault
    @     0x56288ed1320b _PyFunction_FastCallKeywords
wandb: Program ended successfully.
wandb: You can sync this run to the cloud by running: 
wandb: wandb sync wandb/run-20211012_045917-r4hynqbc
Ray worker pid: 4061926
NeuroCard config:
{'__cpu': 1,
 '__gpu': 1,
 '__run': 'test-job-light',
 '_load_samples': None,
 '_save_samples': None,
 'asserts': {'fact_psample_8000_median': 4,
             'fact_psample_8000_p99': 50,
             'train_bits': 80},
 'bs': 2048,
 'checkpoint_every_epoch': False,
 'checkpoint_to_load': None,
 'compute_test_loss': True,
 'constant_lr': None,
 'custom_lr_lambda': None,
 'cwd': '/home/liujw/deepice/code/ce/neurocard-master/neurocard',
 'dataset': 'imdb',
 'direct_io': True,
 'disable_learnable_unk': False,
 'dropout': 1,
 'embed_size': 32,
 'embs_tied': True,
 'epochs': 1,
 'epochs_per_iteration': 1,
 'eval_join_sampling': None,
 'eval_psamples': [8000],
 'factorize': True,
 'factorize_blacklist': None,
 'factorize_fanouts': False,
 'fc_hiddens': 128,
 'fixed_dropout_ratio': False,
 'force_query_cols': None,
 'grouped_dropout': True,
 'input_encoding': 'embed',
 'input_no_emb_if_leq': False,
 'join_clauses': None,
 'join_how': 'outer',
 'join_keys': {'cast_info': ['movie_id'],
               'movie_companies': ['movie_id'],
               'movie_info': ['movie_id'],
               'movie_info_idx': ['movie_id'],
               'movie_keyword': ['movie_id'],
               'title': ['id']},
 'join_name': 'job-light',
 'join_root': 'title',
 'join_tables': ['cast_info',
                 'movie_companies',
                 'movie_info',
                 'movie_keyword',
                 'title',
                 'movie_info_idx'],
 'label_smoothing': 0,
 'layers': 4,
 'loader_workers': 4,
 'lr_scheduler': 'OneCycleLR-0.28',
 'max_steps': 500,
 'num_dmol': 0,
 'num_eval_queries_at_checkpoint_load': 2000,
 'num_eval_queries_at_end': 70,
 'num_eval_queries_per_iteration': 70,
 'num_orderings': 1,
 'optimizer': 'adam',
 'order': None,
 'order_content_only': True,
 'order_indicators_at_front': False,
 'order_seed': None,
 'output_encoding': 'embed',
 'per_row_dropout': False,
 'queries_csv': './queries/job-light.csv',
 'query_filters': [5, 12],
 'residual': True,
 'resmade_drop_prob': 0.1,
 'sampler': 'factorized_sampler',
 'sampler_batch_size': 4096,
 'save_checkpoint_at_end': False,
 'seed': 0,
 'special_order_seed': 0,
 'special_orders': 0,
 'table_dropout': True,
 'transformer_args': {},
 'use_cols': 'simple',
 'use_data_parallel': False,
 'use_transformer': False,
 'warmups': 0.05,
 'word_size_bits': 11}
Training on Join(['cast_info', 'movie_companies', 'movie_info', 'movie_keyword', 'title', 'movie_info_idx'])
Loading cast_info
Loaded parsed Table from ./datasets/job/cast_info.movie_id-role_id.table
cast_info([Column(movie_id, distribution_size=2331601), Column(role_id, distribution_size=11)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36244344 entries, 0 to 36244343
Data columns (total 2 columns):
 #   Column    Dtype
---  ------    -----
 0   movie_id  int64
 1   role_id   int64
dtypes: int64(2)
memory usage: 553.0 MB
Loading movie_companies
Loaded parsed Table from ./datasets/job/movie_companies.company_id-company_type_id-movie_id.table
movie_companies([Column(company_id, distribution_size=234997), Column(company_type_id, distribution_size=2), Column(movie_id, distribution_size=1087236)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2609129 entries, 0 to 2609128
Data columns (total 3 columns):
 #   Column           Dtype
---  ------           -----
 0   company_id       int64
 1   company_type_id  int64
 2   movie_id         int64
dtypes: int64(3)
memory usage: 59.7 MB
Loading movie_info
Loaded parsed Table from ./datasets/job/movie_info.movie_id-info_type_id.table
movie_info([Column(movie_id, distribution_size=2468825), Column(info_type_id, distribution_size=71)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835720 entries, 0 to 14835719
Data columns (total 2 columns):
 #   Column        Dtype
---  ------        -----
 0   movie_id      int64
 1   info_type_id  int64
dtypes: int64(2)
memory usage: 226.4 MB
Loading movie_keyword
Loaded parsed Table from ./datasets/job/movie_keyword.movie_id-keyword_id.table
movie_keyword([Column(movie_id, distribution_size=476794), Column(keyword_id, distribution_size=134170)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4523930 entries, 0 to 4523929
Data columns (total 2 columns):
 #   Column      Dtype
---  ------      -----
 0   movie_id    int64
 1   keyword_id  int64
dtypes: int64(2)
memory usage: 69.0 MB
Loading title
Loaded parsed Table from ./datasets/job/title.id-kind_id-production_year.table
title([Column(id, distribution_size=2528312), Column(kind_id, distribution_size=7), Column(production_year, distribution_size=133)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2528312 entries, 0 to 2528311
Data columns (total 3 columns):
 #   Column           Dtype  
---  ------           -----  
 0   id               int64  
 1   kind_id          int64  
 2   production_year  float64
dtypes: float64(1), int64(2)
memory usage: 57.9 MB
Loading movie_info_idx
Loaded parsed Table from ./datasets/job/movie_info_idx.info_type_id-movie_id.table
movie_info_idx([Column(info_type_id, distribution_size=5), Column(movie_id, distribution_size=459925)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1380035 entries, 0 to 1380034
Data columns (total 2 columns):
 #   Column        Non-Null Count    Dtype
---  ------        --------------    -----
 0   info_type_id  1380035 non-null  int64
 1   movie_id      1380035 non-null  int64
dtypes: int64(2)
memory usage: 21.1 MB
Full outer join specified, inserting np.nan to all column domains
Ray worker pid: 4061926
Ray worker pid: 4061926
Ray worker pid: 4061757
Ray worker pid: 4061757
Ray worker pid: 4061911
Ray worker pid: 4061911
Ray worker pid: 4061729
Ray worker pid: 4061729
Ray worker pid: 4061795
Ray worker pid: 4061795
Ray worker pid: 4061937
Ray worker pid: 4061937
Ray worker pid: 4061864
Ray worker pid: 4061864
Ray worker pid: 4061748
Ray worker pid: 4061748
Ray worker pid: 4061745
Ray worker pid: 4061745
Ray worker pid: 4061727
Ray worker pid: 4061727
Ray worker pid: 4061897
Ray worker pid: 4061897
Ray worker pid: 4061743
Ray worker pid: 4061743
Ray worker pid: 4061934
Ray worker pid: 4061934
Ray worker pid: 4061761
Ray worker pid: 4061761
Ray worker pid: 4061719
Ray worker pid: 4061719
Ray worker pid: 4061830
Ray worker pid: 4061830
Ray worker pid: 4061723
Ray worker pid: 4061723
Ray worker pid: 4061770
Ray worker pid: 4061770
Ray worker pid: 4061825
Ray worker pid: 4061825
Ray worker pid: 4061921
Ray worker pid: 4061921
Ray worker pid: 4061835
Ray worker pid: 4061835
Ray worker pid: 4061732
Ray worker pid: 4061732
Ray worker pid: 4061746
Ray worker pid: 4061746
Ray worker pid: 4061924
Ray worker pid: 4061924
Ray worker pid: 4061721
Ray worker pid: 4061721
Ray worker pid: 4061742
Ray worker pid: 4061742
Ray worker pid: 4061822
Ray worker pid: 4061822
Ray worker pid: 4061734
Ray worker pid: 4061734
Ray worker pid: 4061733
Ray worker pid: 4061733
Ray worker pid: 4061735
Ray worker pid: 4061735
Ray worker pid: 4061753
Ray worker pid: 4061753
Ray worker pid: 4061762
Ray worker pid: 4061762
Ray worker pid: 4061758
Ray worker pid: 4061758
Ray worker pid: 4061722
Ray worker pid: 4061722
Ray worker pid: 4061807
Ray worker pid: 4061807
Ray worker pid: 4061739
Ray worker pid: 4061739
Ray worker pid: 4061749
Ray worker pid: 4061749
Ray worker pid: 4061892
Ray worker pid: 4061892
Ray worker pid: 4061774
Ray worker pid: 4061774
Ray worker pid: 4061731
Ray worker pid: 4061731
Ray worker pid: 4061744
Ray worker pid: 4061744
Ray worker pid: 4061725
Ray worker pid: 4061725
Ray worker pid: 4061750
Ray worker pid: 4061750
Ray worker pid: 4061720
Ray worker pid: 4061720
Ray worker pid: 4061797
Ray worker pid: 4061797
Ray worker pid: 4061717
Ray worker pid: 4061717
Ray worker pid: 4061764
Ray worker pid: 4061764
Ray worker pid: 4061728
Ray worker pid: 4061728
Ray worker pid: 4061912
Ray worker pid: 4061912
Ray worker pid: 4061851
Ray worker pid: 4061851
Ray worker pid: 4061752
Ray worker pid: 4061752
Ray worker pid: 4061920
Ray worker pid: 4061920
Ray worker pid: 4061765
Ray worker pid: 4061765
Ray worker pid: 4061804
Ray worker pid: 4061804
Ray worker pid: 4061767
Ray worker pid: 4061767

As for rustilb issue: my rustic version is 1.55.0 and cargo is 1.55.0. It seems that I have no rustib and when I run bash build.sh I can't succeed. The error reports are as follows:

error: failed to parse manifest at `Cargo.toml`

Caused by:
  can't find library `rustlib`, rename file to `src/lib.rs` or specify lib.path

I'm new to pre-packaged library and can not find useful information about it on the internet. Where can I get the rustilb? Maybe you can give me some clue, with which I can get the key to solve this problem? 😃

franklsf95 commented 2 years ago

The rustlib source code is in neurocard/neurocard/factorized_sampler_lib/pyext-rustlib/. Are you using the Nightly build of Rust? (See instructions here https://github.com/neurocard/neurocard/tree/master/neurocard/factorized_sampler_lib/pyext-rustlib)

I'm not sure what the cause could be here. Maybe you ran out of memory? How big is your machine RAM?

Doris404 commented 2 years ago

As for the memory: it is not that case. As for rust: It is difficult for me to build a Nightly build of Rust on my server (can not find the host). May be there are other methods to build the same environment for running NeuroCard. For example, can you build a docker image which contains the environment.

franklsf95 commented 2 years ago

@Doris404 Installing Nightly Rust should be easy and does not require building Rust (https://rust-lang.github.io/rustup/concepts/channels.html). I can try to make a Docker image in the future, but probably not any time soon.

concretevitamin commented 2 years ago

@Doris404 what OS are you on? Hacked together a non-optimized, basic Dockerfile - can you try it out?

# Example usage (call from project root dir):
#   docker build -t neurocard-test .
#   docker run -it --rm --runtime=nvidia neurocard-test

FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime

RUN apt -y update --fix-missing && \
    apt -y install tree wget vim less build-essential python-setuptools python-dev && \
    rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

# Install NeuroCard dependencies.
RUN pip install \
      numpy==1.18.4 \
      pandas==1.0.5 \
      absl-py==0.9.0 \
      glog==0.3.1 \
      networkx==2.4 \
      ray[tune]==0.8.7 \
      tabulate==0.8.7 \
      scipy==1.4.1 \
      yapf==0.27.0 \
      mako==1.1.3 \
      pyspark==2.4.3 \
      wandb==0.8.36 \
      psycopg2 \
      pyarrow

WORKDIR /app
COPY . .
RUN cd neurocard && bash scripts/download_imdb.sh
CMD cd neurocard && bash

I've tested this on x86_64-unknown-linux-gnu.

Note that this doesn't rebuild the rust lib. To see if rust lib is the issue, you can set this option to fair_sampler.

Doris404 commented 2 years ago

I managed to install rust on my Mac. Now I am stucked on build step, in which "feature has been removed" is reported. I googled the problem on the internet. It is said that rust has removed some features when updating. The rust version on my Mac is rustc 1.58.0-nightly. I guess the version of rust on my Mac may be different from the required one. What's the rust version required for the environment? 截屏2021-10-24 上午10 31 30

concretevitamin commented 2 years ago

Applying this patch should make it compile. Tested with cargo 1.57.0-nightly (7fbbf4e8f 2021-10-19). It bumps up dependencies' versions and fixes some "unsafe" compilation errors.

BTW, we do not recommend running NeuroCard experiments on Mac or non-GPU machines.

Doris404 commented 2 years ago

This time I move the environment to x86 with gpu and the rustily.so installed. When I run the python run.py --run test-job-light, it still doesn't succeed. The memory size is 60g.

截屏2021-11-03 下午6 06 08

截屏2021-11-03 下午6 03 34

concretevitamin commented 2 years ago

Check that wandb version matches the one specified in the environment yaml.

On Wed, Nov 3, 2021 at 03:07 Doris404 @.***> wrote:

This time I move the environment to x86 with gpu and the rustily.so installed. When I run the python run.py --run test-job-light, it still doesn't succeed. The memory size is 60g.

[image: 截屏2021-11-03 下午6 06 08] https://user-images.githubusercontent.com/37341760/140041491-f59de11f-9678-4fd0-af2a-8c0fb96356a3.png

[image: 截屏2021-11-03 下午6 03 34] https://user-images.githubusercontent.com/37341760/140041102-33139d71-30a0-43cb-9317-d78d1ba33d6d.png

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/neurocard/neurocard/issues/4#issuecomment-958806749, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQWHR5XPFD5DL2ZAVCKLTUKECXDANCNFSM5FWBHSUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Doris404 commented 2 years ago

I build up the environment according to the environment.yml. The version is the same.

Doris404 commented 2 years ago

My fault, I fixed the problem at last. Thanks a lot!

yuting-weng commented 2 years ago

Hello, could you tell how you solve the problem?

WeChat098 commented 1 year ago

你好,请问你是怎么解决这个问题的额?