intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
16 stars 3 forks source link

An error occurred while trying to connect to the Java server #912

Closed kaiseu closed 3 years ago

kaiseu commented 4 years ago

when trying to run the object detection jupyter demo with the latest version under apps, below error occurs, can anybody help on this? Thanks!

ERROR:root:Exception while sending command. Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command response = connection.send_command(command) File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:41803) Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/pyspark/rdd.py", line 816, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/opt/work/spark-2.4.3/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name)) py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection connection = self.deque.pop() IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1067, in start self.socket.connect((self.address, self.port)) ConnectionRefusedError: [Errno 111] Connection refused ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:41803) Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/pyspark/rdd.py", line 816, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/opt/work/spark-2.4.3/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name)) py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection connection = self.deque.pop() IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/work/spark-2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1067, in start self.socket.connect((self.address, self.port)) ConnectionRefusedError: [Errno 111] Connection refused

hkvision commented 4 years ago

Hi @kaiseu

Probably it is an issue due to memory? Are you using the video we provide? Could you try to expand driver memory and executor memory to have a try? Thanks.

magic20191 commented 4 years ago

I had the same problem with the document case environment: centos7, memery:4G pytorch1.4.0 cpu zoo:0.9.0.dev0 ps:Virtual machine environment , With only one node, Zoo is PIP installed image

import torch import torch.nn as nn from bigdl.optim.optimizer import Adam from zoo.common.nncontext import from zoo.pipeline.api.net.torch_net import TorchNet from zoo.pipeline.api.net.torch_criterion import TorchCriterion from zoo.pipeline.nnframes import from pyspark.ml.linalg import Vectors from pyspark.sql import SparkSession

class SimpleTorchModel(nn.Module): def init(self): super(SimpleTorchModel, self).init() self.dense1 = nn.Linear(2, 4) self.dense2 = nn.Linear(4, 1) def forward(self, x): x = self.dense1(x) x = torch.sigmoid(self.dense2(x)) return x

if name == 'main': sparkConf = init_spark_conf().setAppName("example_pytorch").setMaster('local[1]') sc = init_nncontext(sparkConf) spark = SparkSession \ .builder \ .getOrCreate() df = spark.createDataFrame( [(Vectors.dense([2.0, 1.0]), 1.0), (Vectors.dense([1.0, 2.0]), 0.0), (Vectors.dense([2.0, 1.0]), 1.0), (Vectors.dense([1.0, 2.0]), 0.0)], ["features", "label"]) torch_model = SimpleTorchModel() torch_criterion = nn.MSELoss() az_model = TorchNet.from_pytorch(torch_model, [1, 2]) az_criterion = TorchCriterion.from_pytorch(torch_criterion, [1, 1], [1, 1]) classifier = NNClassifier(az_model, az_criterion) \ .setBatchSize(4) \ .setOptimMethod(Adam()) \ .setLearningRate(0.01) \ .setMaxEpoch(10) nnClassifierModel = classifier.fit(df) print("After training: ") res = nnClassifierModel.transform(df) res.show(10, False)

pyspark_submit_args is: --driver-class-path /root/anaconda3/lib/python3.6/site-packages/bigdl/share/lib/bigdl-0.10.0-jar-with-dependencies.jar:/root/anaconda3/lib/python3.6/site-packages/zoo/share/lib/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/anaconda3/lib/python3.6/site-packages/zoo/share/lib/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/anaconda3/lib/python3.6/site-packages/pyspark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2020-08-19 00:01:41 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

User settings:

KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=0 KMP_SETTINGS=1 OMP_NUM_THREADS=1

Effective settings:

KMP_ABORT_DELAY=0 KMP_ADAPTIVE_LOCK_PROPS='1,1024' KMP_ALIGN_ALLOC=64 KMP_ALL_THREADPRIVATE=128 KMP_ATOMIC_MODE=2 KMP_BLOCKTIME=0 KMP_CPUINFO_FILE: value is not defined KMP_DETERMINISTIC_REDUCTION=false KMP_DEVICE_THREAD_LIMIT=2147483647 KMP_DISP_HAND_THREAD=false KMP_DISP_NUM_BUFFERS=7 KMP_DUPLICATE_LIB_OK=false KMP_FORCE_REDUCTION: value is not defined KMP_FOREIGN_THREADS_THREADPRIVATE=true KMP_FORKJOIN_BARRIER='2,2' KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper' KMP_FORKJOIN_FRAMES=true KMP_FORKJOIN_FRAMES_MODE=3 KMP_GTID_MODE=3 KMP_HANDLE_SIGNALS=false KMP_HOT_TEAMS_MAX_LEVEL=1 KMP_HOT_TEAMS_MODE=0 KMP_INIT_AT_FORK=true KMP_INIT_WAIT=2048 KMP_ITT_PREPARE_DELAY=0 KMP_LIBRARY=throughput KMP_LOCK_KIND=queuing KMP_MALLOC_POOL_INCR=1M KMP_NEXT_WAIT=1024 KMP_NUM_LOCKS_IN_BLOCK=1 KMP_PLAIN_BARRIER='2,2' KMP_PLAIN_BARRIER_PATTERN='hyper,hyper' KMP_REDUCTION_BARRIER='1,1' KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper' KMP_SCHEDULE='static,balanced;guided,iterative' KMP_SETTINGS=true KMP_SPIN_BACKOFF_PARAMS='4096,100' KMP_STACKOFFSET=64 KMP_STACKPAD=0 KMP_STACKSIZE=4M KMP_STORAGE_MAP=false KMP_TASKING=2 KMP_TASKLOOP_MIN_TASKS=0 KMP_TASK_STEALING_CONSTRAINT=1 KMP_TEAMS_THREAD_LIMIT=1 KMP_TOPOLOGY_METHOD=all KMP_USER_LEVEL_MWAIT=false KMP_VERSION=false KMP_WARNINGS=true OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}' OMP_ALLOCATOR=omp_default_mem_alloc OMP_CANCELLATION=false OMP_DEFAULT_DEVICE=0 OMP_DISPLAY_AFFINITY=false OMP_DISPLAY_ENV=false OMP_DYNAMIC=false OMP_MAX_ACTIVE_LEVELS=2147483647 OMP_MAX_TASK_PRIORITY=0 OMP_NESTED=false OMP_NUM_THREADS='1' OMP_PLACES: value is not defined OMP_PROC_BIND='intel' OMP_SCHEDULE='static' OMP_STACKSIZE=4M OMP_TARGET_OFFLOAD=DEFAULT OMP_THREAD_LIMIT=2147483647 OMP_TOOL=enabled OMP_TOOL_LIBRARIES: value is not defined OMP_WAIT_POLICY=PASSIVE KMP_AFFINITY='noverbose,warnings,respect,granularity=fine,compact,1,0'

cls.getname: com.intel.analytics.bigdl.python.api.Sample BigDLBasePickler registering: bigdl.util.common Sample cls.getname: com.intel.analytics.bigdl.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.util.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.python.api.JTensor BigDLBasePickler registering: bigdl.util.common JTensor cls.getname: com.intel.analytics.bigdl.python.api.JActivity BigDLBasePickler registering: bigdl.util.common JActivity creating: createTorchNet creating: createTorchCriterion creating: createSeqToTensor creating: createScalarToTensor creating: createFeatureLabelPreprocessing creating: createNNClassifier creating: createAdam TorchNet loading in TorchNet[bed9d76] loading libgomp-8bba0e50.so.1 loading libc10.so loading libcaffe2.so loading libtorch.so.1 loading libpytorch-engine.so terminate called after throwing an instance of 'c10::Error' what(): [enforce fail at inline_container.cc:173] . file not found: model/model.json frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f4bf508b441 in /tmp/dlNativeLoader2178169977227902467libc10.so) frame intel-analytics/analytics-zoo#1: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const*) + 0x49 (0x7f4bf508b259 in /tmp/dlNativeLoader2178169977227902467libc10.so) frame intel-analytics/analytics-zoo#2: caffe2::serialize::PyTorchStreamReader::getFileID(std::string const&) + 0x52e (0x7f4bec339e5e in /tmp/dlNativeLoader1803792900704059554libcaffe2.so) frame intel-analytics/analytics-zoo#3: caffe2::serialize::PyTorchStreamReader::getRecord(std::string const&) + 0x20 (0x7f4bec33a020 in /tmp/dlNativeLoader1803792900704059554libcaffe2.so) frame intel-analytics/analytics-zoo#4: + 0xa7dc03 (0x7f4be9fdac03 in /tmp/dlNativeLoader5113198754580178383libtorch.so.1) frame intel-analytics/analytics-zoo#5: torch::jit::load(std::unique_ptr<caffe2::serialize::ReadAdapterInterface, std::default_delete >, c10::optional, std::unordered_map<std::string, std::string, std::hash, std::equal_to, std::allocator<std::pair<std::string const, std::string> > >&) + 0x10d (0x7f4be9fdd0cd in /tmp/dlNativeLoader5113198754580178383libtorch.so.1) frame intel-analytics/analytics-zoo#6: torch::jit::load(std::string const&, c10::optional, std::unordered_map<std::string, std::string, std::hash, std::equal_to, std::allocator<std::pair<std::string const, std::string> > >&) + 0x68 (0x7f4be9fdd1f8 in /tmp/dlNativeLoader5113198754580178383libtorch.so.1) frame intel-analytics/analytics-zoo#7: Java_com_intel_analytics_zoo_pipeline_api_net_PytorchModel_loadModelNative + 0x9c (0x7f4be923c85b in /tmp/dlNativeLoader4490405095027738473libpytorch-engine.so) frame intel-analytics/analytics-zoo#8: [0x7f4c29018667]

ERROR:root:Exception while sending command. Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 985, in send_command response = connection.send_command(command) File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 1164, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving Traceback (most recent call last): File "", line 22, in File "/root/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py", line 132, in fit return self._fit(dataset) File "/root/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 295, in _fit java_model = self._fit_java(dataset) File "/root/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 292, in _fit_java return self._java_obj.fit(dataset._jdf) File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/root/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/root/anaconda3/lib/python3.6/site-packages/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name)) py4j.protocol.Py4JError: An error occurred while calling o84.fit

hkvision commented 4 years ago

You need to enlarge your memory, refer to here to set a larger memory: https://analytics-zoo.github.io/master/#PythonUserGuide/run/#run-after-pip-install

helenlly commented 4 years ago

@kaiseu thanks for your question. pls let us know if any more questions or we may go ahead to close it.

helenlly commented 3 years ago

@kaiseu we 'll close the issue and you have re-open if need.thanks