tobegit3hub / simple_tensorflow_serving

Generic and easy-to-use serving service for machine learning models
https://stfs.readthedocs.io
Apache License 2.0
757 stars 195 forks source link

Unable to run on Kubernetes Cluster #42

Open randomthought opened 5 years ago

randomthought commented 5 years ago

Firstly thanks for the great work!

I am having difficulties trying to get simple_tensorflow_serving working on a Kubernetes cluster. Seems to be something with H20, logs are not descriptive enough for me to pinpoint it. It just keeps hanging on the connection refused error below.

01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO: H2O started in 2983ms
01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO:
01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO: Open H2O Flow in your web browser: http://10.0.0.41:54321
01-15 20:11:07.287 10.0.0.41:54321       180    main      INFO:
01-15 20:11:09.699 10.0.0.41:54321       180    FJ-126-3  INFO: Cloud of size 2 formed [/10.0.0.5:54321, /10.0.0.41:54321]
2019-01-15 20:11:14 INFO     Try to get function from file: ./models/h2o_prostate_model/preprocess_function.marshal
2019-01-15 20:11:14 INFO     Try to get function from file: ./models/h2o_prostate_model/postprocess_function.marshal
2019-01-15 20:11:14 INFO     Try to initialize and connect the h2o server
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Warning: Your H2O cluster version is too old (8 months and 27 days)! Please download and install the latest version from http://h2o.ai/download/
01-15 20:11:14.371 10.0.0.41:54321       180    #28758-13 INFO: POST /4/sessions, parms: {}
01-15 20:11:14.377 10.0.0.41:54321       180    #28758-13 INFO: Locking cloud to new members, because water.api.schemas4.SessionIdV4
01-15 20:11:14.414 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:14.717 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.020 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.322 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.625 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.928 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
tobegit3hub commented 5 years ago

Thanks for reporting.

Have you setup the H2O cluster to run with one H2O instance? It seems to be the problem of network but I'm not sure why it fails to connect with localhost service.

DivyaMereddy007 commented 4 years ago

I also got the same issue. Error Log:11-19 14:40:18.737 10.237.73.201:54321 18656 #80:54323 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused. Below is the config I tried conf$spark.executor.instances <- 171 spark.yarn.executor.memoryOverhead<- 2048 conf$spark.executor.memory <- "18g" conf$spark.executor.cores <- 5

spark.yarn.driver.memoryOverhead<- 39936 conf$spark.driver.memory<-"57.6g" conf$spark.driver.cores<- 5

conf$'sparklyr.shell.executor-memory' <- "32g" conf$'sparklyr.shell.driver-memory' <- "32g" conf$spark.yarn.am.memory <- "32g" conf$spark.dynamicAllocation.enabled <- "false"