intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
11 stars 3 forks source link

(raylet) socket.gaierror: [Errno -2] Name or service not known #8

Open xunaichao opened 2 years ago

xunaichao commented 2 years ago

When I run https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html tensorFlow 2 For example. ############ Error: (raylet) Traceback (most recent call last): (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 334, in (raylet) raise e (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 323, in (raylet) loop.run_until_complete(agent.run()) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/asyncio/base_events.py", line 568, in run_until_complete (raylet) return future.result() (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 138, in run (raylet) modules = self._load_modules() (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules (raylet) c = cls(self) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in init (raylet) self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/metrics_agent.py", line 76, in init (raylet) namespace="ray", port=metrics_export_port))) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter (raylet) options=option, gatherer=option.registry, collector=collector) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 266, in init (raylet) self.serve_http() (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 321, in serve_http (raylet) port=self.options.port, addr=str(self.options.address)) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server (raylet) TmpServer.address_family, addr = _get_best_family(addr, port) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family (raylet) infos = socket.getaddrinfo(address, port) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/socket.py", line 753, in getaddrinfo (raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags): (raylet) socket.gaierror: [Errno -2] Name or service not known ##############

Hosts file image

After running the example, session files are generated in /tmp/ray/ of the system image

Runtime environment: Docker deployment uses Miniconda to install AZ and Ray

Conda create -n zoo python=3.7 conda activate zoo pip install --pre --upgrade analytics-zoo pip install analytics-zoo[ray] PIP install tensorflow = = 2.3.0

conda list

Name Version Build Channel _libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.0.0 pypi_0 pypi aiohttp 3.7.0 pypi_0 pypi aiohttp-cors 0.7.0 pypi_0 pypi aioredis 1.1.0 pypi_0 pypi analytics-zoo 0.12.0b2022052501 pypi_0 pypi astunparse 1.6.3 pypi_0 pypi async-timeout 3.0.1 pypi_0 pypi attrs 21.4.0 pypi_0 pypi bigdl 0.13.1.dev1 pypi_0 pypi blessings 1.7 pypi_0 pypi ca-certificates 2022.4.26 h06a4308_0
cachetools 5.1.0 pypi_0 pypi certifi 2022.5.18.1 py37h06a4308_0
chardet 3.0.4 pypi_0 pypi charset-normalizer 2.0.12 pypi_0 pypi click 8.1.3 pypi_0 pypi colorama 0.4.4 pypi_0 pypi colorful 0.5.4 pypi_0 pypi conda-pack 0.3.1 pypi_0 pypi deprecated 1.2.13 pypi_0 pypi filelock 3.7.0 pypi_0 pypi gast 0.3.3 pypi_0 pypi google-api-core 2.8.0 pypi_0 pypi google-auth 2.6.6 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi googleapis-common-protos 1.56.1 pypi_0 pypi gpustat 0.6.0 pypi_0 pypi grpcio 1.46.3 pypi_0 pypi h5py 2.10.0 pypi_0 pypi hiredis 1.1.0 pypi_0 pypi idna 3.3 pypi_0 pypi importlib-metadata 4.11.4 pypi_0 pypi importlib-resources 5.7.1 pypi_0 pypi jsonschema 4.5.1 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi libedit 3.1.20210910 h7f8727e_0
libffi 3.2.1 hf484d3e_1007
libgcc-ng 11.2.0 h1234567_0
libgomp 11.2.0 h1234567_0
libstdcxx-ng 11.2.0 h1234567_0
markdown 3.3.7 pypi_0 pypi msgpack 1.0.3 pypi_0 pypi multidict 6.0.2 pypi_0 pypi ncurses 6.3 h7f8727e_2
numpy 1.18.5 pypi_0 pypi nvidia-ml-py3 7.352.0 pypi_0 pypi oauthlib 3.2.0 pypi_0 pypi opencensus 0.9.0 pypi_0 pypi opencensus-context 0.1.2 pypi_0 pypi opencv-python 4.5.5.64 pypi_0 pypi openssl 1.0.2u h7b6447c_0
opt-einsum 3.3.0 pypi_0 pypi packaging 21.3 pypi_0 pypi pip 21.2.2 py37h06a4308_0
prometheus-client 0.14.1 pypi_0 pypi protobuf 3.20.1 pypi_0 pypi psutil 5.9.1 pypi_0 pypi py-spy 0.3.12 pypi_0 pypi py4j 0.10.7 pypi_0 pypi pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi pyrsistent 0.18.1 pypi_0 pypi pyspark 2.4.6 pypi_0 pypi python 3.7.0 h6e4f718_3
pyyaml 6.0 pypi_0 pypi ray 1.2.0 pypi_0 pypi readline 7.0 h7b6447c_5
redis 4.1.4 pypi_0 pypi requests 2.27.1 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi rsa 4.8 pypi_0 pypi scipy 1.4.1 pypi_0 pypi setproctitle 1.2.3 pypi_0 pypi setuptools 61.2.0 py37h06a4308_0
six 1.16.0 pypi_0 pypi sqlite 3.33.0 h62c20be_0
tensorboard 2.9.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.1 pypi_0 pypi tensorflow 2.3.0 pypi_0 pypi tensorflow-estimator 2.3.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi tk 8.6.11 h1ccaba5_1
typing-extensions 4.2.0 pypi_0 pypi urllib3 1.26.9 pypi_0 pypi werkzeug 2.1.2 pypi_0 pypi wheel 0.37.1 pyhd3eb1b0_0
wrapt 1.14.1 pypi_0 pypi xz 5.2.5 h7f8727e_1
yarl 1.7.2 pypi_0 pypi zipp 3.8.0 pypi_0 pypi zlib 1.2.12 h7f8727e_2

———————————————————— 1、Check python: from zoo.util.utils import detect_python_location detect_python_location() image

2、Check ray installation /usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --head --include-dashboard ture --dashboard-host 172.27.0.2 --port 35413 --redis-password 123456 --num-cpus 1 image

/usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --address 172.27.0.2:35413 --redis-password 123456 --num-cpus 1 image

ray start --address=‘172.27.0.2:35413' --redis-password='0'

image

Related documents.zip

xunaichao commented 2 years ago

Please help solve it. Thank you

I'm going crazy

jason-dai commented 2 years ago

@xunaichao As mentioned in https://github.com/intel-analytics/analytics-zoo/blob/master/README.md, we have migrated to project to https://github.com/intel-analytics/bigdl; please try https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html instead

hkvision commented 2 years ago

Hi @xunaichao

I checked the code and run it on Google Colab, I can get this error as well. But seems this error doesn't impact or interrupt the running, you can find the train and evaluate results in your log. Seems the error comes from ray dashboard, not sure whether this is caused by the out-of-date ray version.

As mentioned above, you are highly recommended to switch to the latest version of BigDL, I run the same BigDL example in Google Colab and there's no such error: https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html

xunaichao commented 2 years ago

@jason-dai @hkvision thanks for your response. I have follow the instructions you gave:https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html I now run my yolov3.py and have a exception,

run logs:

2022-06-01 10:01:10.069315: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2022-06-01 10:01:10.074183: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2022-06-01 10:01:10.074198: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Initializing orca context Current pyspark location is : /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/pyspark/init.py Start to getOrCreate SparkContext pyspark_submit_args is: --driver-class-path /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.1.0-20220314.094552-2.jar:/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.0.0-jar-with-dependencies.jar:/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.0.0-jar-with-dependencies.jar pyspark-shell 2022-06-01 10:01:13 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2022-06-01 10:01:14,896 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2022-06-01 10:01:14,898 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2022-06-01 10:01:14,899 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2022-06-01 10:01:14,899 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 22-06-01 10:01:14 [Thread-4] INFO Engine$:121 - Auto detect executor number and executor cores number 22-06-01 10:01:14 [Thread-4] INFO Engine$:123 - Executor number is 1 and executor cores number is 4

User settings:

KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=0 KMP_SETTINGS=1 OMP_NUM_THREADS=1

Effective settings:

KMP_ABORT_DELAY=0 KMP_ADAPTIVE_LOCK_PROPS='1,1024' KMP_ALIGN_ALLOC=64 KMP_ALL_THREADPRIVATE=416 KMP_ATOMIC_MODE=2 KMP_BLOCKTIME=0 KMP_CPUINFO_FILE: value is not defined KMP_DETERMINISTIC_REDUCTION=false KMP_DEVICE_THREAD_LIMIT=2147483647 KMP_DISP_HAND_THREAD=false KMP_DISP_NUM_BUFFERS=7 KMP_DUPLICATE_LIB_OK=false KMP_FORCE_REDUCTION: value is not defined KMP_FOREIGN_THREADS_THREADPRIVATE=true KMP_FORKJOIN_BARRIER='2,2' KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper' KMP_FORKJOIN_FRAMES=true KMP_FORKJOIN_FRAMES_MODE=3 KMP_GTID_MODE=3 KMP_HANDLE_SIGNALS=false KMP_HOT_TEAMS_MAX_LEVEL=1 KMP_HOT_TEAMS_MODE=0 KMP_INIT_AT_FORK=true KMP_ITT_PREPARE_DELAY=0 KMP_LIBRARY=throughput KMP_LOCK_KIND=queuing KMP_MALLOC_POOL_INCR=1M KMP_MWAIT_HINTS=0 KMP_NUM_LOCKS_IN_BLOCK=1 KMP_PLAIN_BARRIER='2,2' KMP_PLAIN_BARRIER_PATTERN='hyper,hyper' KMP_REDUCTION_BARRIER='1,1' KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper' KMP_SCHEDULE='static,balanced;guided,iterative' KMP_SETTINGS=true KMP_SPIN_BACKOFF_PARAMS='4096,100' KMP_STACKOFFSET=64 KMP_STACKPAD=0 KMP_STACKSIZE=8M KMP_STORAGE_MAP=false KMP_TASKING=2 KMP_TASKLOOP_MIN_TASKS=0 KMP_TASK_STEALING_CONSTRAINT=1 KMP_TEAMS_THREAD_LIMIT=104 KMP_TOPOLOGY_METHOD=all KMP_USER_LEVEL_MWAIT=false KMP_USE_YIELD=1 KMP_VERSION=false KMP_WARNINGS=true OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}' OMP_ALLOCATOR=omp_default_mem_alloc OMP_CANCELLATION=false OMP_DEBUG=disabled OMP_DEFAULT_DEVICE=0 OMP_DISPLAY_AFFINITY=false OMP_DISPLAY_ENV=false OMP_DYNAMIC=false OMP_MAX_ACTIVE_LEVELS=2147483647 OMP_MAX_TASK_PRIORITY=0 OMP_NESTED=false OMP_NUM_THREADS='1' OMP_PLACES: value is not defined OMP_PROC_BIND='intel' OMP_SCHEDULE='static' OMP_STACKSIZE=8M OMP_TARGET_OFFLOAD=DEFAULT OMP_THREAD_LIMIT=2147483647 OMP_TOOL=enabled OMP_TOOL_LIBRARIES: value is not defined OMP_WAIT_POLICY=PASSIVE KMP_AFFINITY='noverbose,warnings,respect,granularity=fine,compact,1,0'

22-06-01 10:01:15 [Thread-4] INFO ThreadPool$:95 - Set mkl threads to 1 on thread 30 2022-06-01 10:01:15 WARN SparkContext:66 - Using an existing SparkContext; some configuration may not take effect. 22-06-01 10:01:15 [Thread-4] INFO Engine$:446 - Find existing spark context. Checking the spark conf... cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity Successfully got a SparkContext 2022-06-01 10:01:18,220 INFO services.py:1340 -- View the Ray dashboard at http://172.27.0.2:8265 2022-06-01 10:01:18,225 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. {'node_ip_address': '172.27.0.2', 'raylet_ip_address': '172.27.0.2', 'redis_address': '172.27.0.2:15812', 'object_store_address': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868/sockets/raylet', 'webui_url': '172.27.0.2:8265', 'session_dir': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868', 'metrics_export_port': 47074, 'node_id': 'a6dd76c71c04c32df5e009bc951165e1b0e85486a8a75d23fb5ab9ed'} (Worker pid=1704437) 2022-06-01 10:01:19.629608: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. (Worker pid=1704437) 2022-06-01 10:01:19.634737: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/cv2/../../lib64: (Worker pid=1704437) 2022-06-01 10:01:19.634753: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. (Worker pid=1704437) WARNING:tensorflow:From /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/tf_runner.py:317: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version. (Worker pid=1704437) Instructions for updating: (Worker pid=1704437) use distribute.MultiWorkerMirroredStrategy instead (Worker pid=1704437) 2022-06-01 10:01:21.270040: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/cv2/../../lib64: (Worker pid=1704437) 2022-06-01 10:01:21.270095: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) (Worker pid=1704437) 2022-06-01 10:01:21.270135: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (816d2073a24f): /proc/driver/nvidia/version does not exist (Worker pid=1704437) 2022-06-01 10:01:21.271364: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA (Worker pid=1704437) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (Worker pid=1704437) 2022-06-01 10:01:21.297690: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 172.27.0.2:53169} (Worker pid=1704437) 2022-06-01 10:01:21.297883: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 172.27.0.2:53169} (Worker pid=1704437) 2022-06-01 10:01:21.299556: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://172.27.0.2:53169 (raylet) /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/dashboard/agent.py:152: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (raylet) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"): (raylet) /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/dashboard/agent.py:152: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (raylet) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"): Traceback (most recent call last): File "yolov3.py", line 656, in main() File "yolov3.py", line 643, in main trainer = Estimator.from_keras(model_creator=model_creator) File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/estimator.py", line 69, in from_keras cpu_binding=cpu_binding) File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/ray_estimator.py", line 96, in init for i, worker in enumerate(self.remote_workers)]) File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper (Worker pid=1704437) 2022-06-01 10:01:27.086318: W tensorflow/core/util/tensor_slice_reader.cc:96] Could not open ./yolov3/yolov3.weights: DATA_LOSS: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? return func(*args, **kwargs) File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/worker.py", line 1713, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OSError): ray::Worker.setup_distributed() (pid=1704437, ip=172.27.0.2, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7faab3e7fcd0>) File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/tf_runner.py", line 321, in setup_distributed self.model = self.model_creator(self.config) File "yolov3.py", line 571, in model_creator model_pretrained.load_weights(options.weights) File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in init swmr=swmr) File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/h5py/_hl/files.py", line 170, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 85, in h5py.h5f.open OSError: Unable to open file (file signature not found) Stopping orca context

the code i used is pasted here: yolov3.py.zip

conda list:

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.0.0 pypi_0 pypi aiohttp 3.8.1 pypi_0 pypi aiohttp-cors 0.7.0 pypi_0 pypi aioredis 1.3.1 pypi_0 pypi aiosignal 1.2.0 pypi_0 pypi anyio 3.6.1 pypi_0 pypi astunparse 1.6.3 pypi_0 pypi async-timeout 4.0.1 pypi_0 pypi asynctest 0.13.0 pypi_0 pypi attrs 21.4.0 pypi_0 pypi bigdl 2.1.0b202205302 pypi_0 pypi bigdl-chronos 2.1.0b202205302 pypi_0 pypi bigdl-core 2.1.0b20220321 pypi_0 pypi bigdl-dllib 2.1.0b202205302 pypi_0 pypi bigdl-friesian 2.1.0b202205302 pypi_0 pypi bigdl-math 0.14.0.dev1 pypi_0 pypi bigdl-nano 2.1.0b202205302 pypi_0 pypi bigdl-orca 2.1.0b202205302 pypi_0 pypi bigdl-serving 2.1.0b202205302 pypi_0 pypi bigdl-tf 0.14.0.dev1 pypi_0 pypi blessed 1.19.1 pypi_0 pypi ca-certificates 2022.4.26 h06a4308_0
cachetools 5.2.0 pypi_0 pypi certifi 2022.5.18.1 py37h06a4308_0
chardet 3.0.4 pypi_0 pypi charset-normalizer 2.0.12 pypi_0 pypi click 8.1.3 pypi_0 pypi cloudpickle 2.1.0 pypi_0 pypi colorful 0.5.4 pypi_0 pypi conda-pack 0.3.1 pypi_0 pypi deprecated 1.2.13 pypi_0 pypi filelock 3.7.1 pypi_0 pypi flatbuffers 1.12 pypi_0 pypi frozenlist 1.3.0 pypi_0 pypi fsspec 2022.5.0 pypi_0 pypi future 0.18.2 pypi_0 pypi gast 0.4.0 pypi_0 pypi google-api-core 2.8.1 pypi_0 pypi google-auth 2.6.6 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi googleapis-common-protos 1.56.2 pypi_0 pypi gpustat 1.0.0b1 pypi_0 pypi grpcio 1.46.3 pypi_0 pypi h11 0.12.0 pypi_0 pypi h5py 3.7.0 pypi_0 pypi hiredis 2.0.0 pypi_0 pypi httpcore 0.13.7 pypi_0 pypi httpx 1.0.0b0 pypi_0 pypi idna 3.3 pypi_0 pypi importlib-metadata 4.11.4 pypi_0 pypi importlib-resources 5.7.1 pypi_0 pypi intel-openmp 2022.1.0 pypi_0 pypi joblib 1.1.0 pypi_0 pypi jsonschema 4.5.1 pypi_0 pypi kafka-python 2.0.2 pypi_0 pypi keras 2.9.0 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1
libclang 14.0.1 pypi_0 pypi libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_0
libgomp 11.2.0 h1234567_0
libstdcxx-ng 11.2.0 h1234567_0
markdown 3.3.7 pypi_0 pypi msgpack 1.0.3 pypi_0 pypi multidict 4.7.6 pypi_0 pypi ncurses 6.3 h7f8727e_2
numpy 1.21.6 pypi_0 pypi nvidia-ml-py3 7.352.0 pypi_0 pypi oauthlib 3.2.0 pypi_0 pypi onnx 1.11.0 pypi_0 pypi onnxruntime 1.11.1 pypi_0 pypi opencensus 0.9.0 pypi_0 pypi opencensus-context 0.1.2 pypi_0 pypi opencv-python 4.5.5.64 pypi_0 pypi opencv-python-headless 4.5.5.64 pypi_0 pypi opencv-transforms 0.0.6 pypi_0 pypi openssl 1.1.1o h7f8727e_0
opt-einsum 3.3.0 pypi_0 pypi packaging 21.3 pypi_0 pypi pandas 1.2.5 pypi_0 pypi patsy 0.5.2 pypi_0 pypi pillow 9.1.1 pypi_0 pypi pip 21.2.2 py37h06a4308_0
prometheus-client 0.14.1 pypi_0 pypi protobuf 3.19.4 pypi_0 pypi psutil 5.9.1 pypi_0 pypi py-spy 0.3.12 pypi_0 pypi py4j 0.10.7 pypi_0 pypi pyarrow 8.0.0 pypi_0 pypi pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pydeprecate 0.3.1 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi pyrsistent 0.18.1 pypi_0 pypi pyspark 2.4.6 pypi_0 pypi python 3.7.13 h12debd9_0
python-dateutil 2.8.2 pypi_0 pypi pytorch-lightning 1.4.2 pypi_0 pypi pyturbojpeg 1.6.6 pypi_0 pypi pytz 2022.1 pypi_0 pypi pyyaml 6.0 pypi_0 pypi pyzmq 23.0.0 pypi_0 pypi ray 1.9.2 pypi_0 pypi readline 8.1.2 h7f8727e_1
redis 4.1.4 pypi_0 pypi requests 2.27.1 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi rfc3986 1.5.0 pypi_0 pypi rsa 4.8 pypi_0 pypi scikit-learn 1.0.2 pypi_0 pypi scipy 1.7.3 pypi_0 pypi setproctitle 1.2.3 pypi_0 pypi setuptools 61.2.0 py37h06a4308_0
six 1.16.0 pypi_0 pypi smart-open 6.0.0 pypi_0 pypi sniffio 1.2.0 pypi_0 pypi sqlite 3.38.3 hc218d9a_0
statsmodels 0.13.2 pypi_0 pypi tensorboard 2.9.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.1 pypi_0 pypi tensorflow 2.9.1 pypi_0 pypi tensorflow-estimator 2.9.0 pypi_0 pypi tensorflow-io-gcs-filesystem 0.26.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi threadpoolctl 3.1.0 pypi_0 pypi tk 8.6.11 h1ccaba5_1
torch 1.9.0 pypi_0 pypi torchmetrics 0.7.2 pypi_0 pypi torchvision 0.10.0 pypi_0 pypi tqdm 4.64.0 pypi_0 pypi typing-extensions 4.2.0 pypi_0 pypi urllib3 1.26.9 pypi_0 pypi wcwidth 0.2.5 pypi_0 pypi werkzeug 2.1.2 pypi_0 pypi wheel 0.37.1 pyhd3eb1b0_0
wrapt 1.14.1 pypi_0 pypi xz 5.2.5 h7f8727e_1
yarl 1.7.2 pypi_0 pypi zipp 3.8.0 pypi_0 pypi zlib 1.2.12 h7f8727e_2

thank you for help!

shanyu-sys commented 2 years ago

It seems you may try to load the wrong weights:

./yolov3/yolov3.weights: DATA_LOSS: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

You may need to convert the pre-trained darknet weights first, as does in yolo v3 example.

And you could always refer to our Yolov3 example in BigDL. Hope that helps.

shanyu-sys commented 2 years ago

May I ask whether you met the same error with your TensorFlow code (without using bigdl), i.e with your tflocal mode?

xunaichao commented 2 years ago

we use, https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html, this example to save the model. and change the save module to : 4061654161657_ pic we now get the .pb file sucessfully, but have an exception when i use model optimizer of openvino to convert the model format to IR. the error is like this:

Model Optimizer arguments: Common parameters:

Make sure that --input_model_is_text is provided for a model in text format. By default, a model is interpreted in binary format. Framework error details: Error parsing message. For more information please refer to Model Optimizer FAQ, question intel-analytics/analytics-zoo#43. (https://docs.openvinotoolkit.org/latest/openvino_docs_MO_DG_prepare_model_Model_Optimizer_FAQ.html?question=43#question-43) can you help us, thank you very much! @yushan111 thank you for the example, it helps a lot!

shanyu-sys commented 2 years ago

You will get a tf.keras model with est.get_model(), and you could successfully save the model with tf.saved_model API.

After that, it depends on you how you would like to use your tensorflow model.

About using Openvino to convert your tensorflow model, maybe you could open an issue in the Openvino project.

xunaichao commented 2 years ago

@yushan111 thanks for your help