Open stenwt opened 5 years ago
Can you check the hive metastore, hive server, and presto pod logs to see if there are any errors from them? You may also want to increase metering logging to debug by setting spec.reporting-operator.spec.config.logLevel
to debug
in your Metering CR.
If I'm reading this right, reporting-operator is trying to connect to service presto:8080, which fails. If I rsh to the reporting-operator pod and curl presto:8080 I get a http 301 response, which sounds like it's working. That may have been transient, while the presto pod came up.
There are several failures in the hive-server-0 log as it tries and fails to connect to hive-metastore, but, likely after the metastore pod comes up, those messages stop and the last message looks like a successful connection.
I note this in the hive-metastore-0 log:
19/02/04 15:14:12 [pool-7-thread-7]: WARN metastore.HiveMetaStore: Location: hdfs://hdfs-namenode-proxy:9820/operator_metering/storage/metering_health_check specified for non-external table:operator_health_check
that seems related to the test that's failing from the reporting-operator.
It also looks like metastore is successfylly connected to postgres:
19/02/04 15:19:49 [pool-7-thread-25]: INFO metastore.ObjectStore: ObjectStore, initialize called 19/02/04 15:19:49 [pool-7-thread-25]: INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is POSTGRES 19/02/04 15:19:49 [pool-7-thread-25]: INFO metastore.ObjectStore: Initialized ObjectStore
This makes it sound like an issue with my config? But if that's the case, I'm not sure how to correct it.
$ oc get pods
NAME READY STATUS RESTARTS AGE
hdfs-datanode-0 1/1 Running 0 30m
hdfs-namenode-0 1/1 Running 0 30m
hive-metastore-0 1/1 Running 0 30m
hive-server-0 1/1 Running 0 30m
metering-operator-558d78b4cf-7jq29 2/2 Running 0 31m
postgresql-95-centos7-2-zjmsb 1/1 Running 0 1h
presto-coordinator-66b7fc44fc-xh9mg 1/1 Running 0 30m
reporting-operator-655889659d-979mn 0/1 Running 3 30m
$ oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
glusterfs-dynamic-hdfs-datanode-data-hdfs-datanode-0 ClusterIP 172.31.171.154 <none> 1/TCP 59m
glusterfs-dynamic-hdfs-namenode-data-hdfs-namenode-0 ClusterIP 172.31.237.198 <none> 1/TCP 59m
glusterfs-dynamic-hive-metastore-db-data ClusterIP 172.31.57.27 <none> 1/TCP 59m
glusterfs-dynamic-pvc-t8q8l ClusterIP 172.31.154.179 <none> 1/TCP 1h
hdfs-datanode ClusterIP None <none> 9866/TCP 31m
hdfs-datanode-web ClusterIP 172.31.103.5 <none> 9864/TCP 31m
hdfs-namenode ClusterIP None <none> 9820/TCP 31m
hdfs-namenode-proxy ClusterIP 172.31.233.164 <none> 9820/TCP 31m
hdfs-namenode-web ClusterIP 172.31.108.176 <none> 9870/TCP 31m
hive-metastore ClusterIP 172.31.215.221 <none> 9083/TCP 31m
hive-server ClusterIP 172.31.3.254 <none> 10000/TCP,10002/TCP 31m
postgresql-95-centos7 ClusterIP 172.31.231.134 <none> 5432/TCP 1h
presto ClusterIP 172.31.115.137 <none> 8080/TCP 31m
presto-metrics ClusterIP 172.31.1.156 <none> 8082/TCP 31m
reporting-operator ClusterIP 172.31.196.87 <none> 8080/TCP 31m
reporting-operator-metrics ClusterIP 172.31.7.50 <none> 8082/TCP 31m
in case that helps line up pod/ port/ip/service
According to the logs this is hitting a known issue where hive-server is getting a "No route to host" error when talking to hive-metastore. Try deleting the hive-server-0 pod, and it will automatically be re-created and that typically allows it to connect. We're still investigating the underlying cause for this.
I deleted the hive-server-0 pod, and after a couple restarts it came back, but it looks like it's still having issues talking to the metastore:
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO thrift.ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V8
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Created local directory: /tmp/hadoop/0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/0991ec47-aa95-47b0-9384-a12b51a3b555/_tmp_space.db
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.HiveSessionImpl: Operation log session directory is created: /tmp/hadoop/operation_logs/0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO service.CompositeService: Session opened, SessionHandle [0991ec47-aa95-47b0-9384-a12b51a3b555], current sessions:85
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO conf.HiveConf: Using the default value passed in for log id: 0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Updating thread name to 0991ec47-aa95-47b0-9384-a12b51a3b555 HiveServer2-Handler-Pool: Thread-456
19/02/05 14:11:49 [0991ec47-aa95-47b0-9384-a12b51a3b555 HiveServer2-Handler-Pool: Thread-456]: INFO operation.OperationManager: Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=ec67f250-4fa9-47e5-9d34-008e5104f81e]
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO thrift.ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V8
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Created local directory: /tmp/hadoop/ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/ea92f71c-52d6-4287-a0ba-f1f80b939d0a/_tmp_space.db
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.HiveSessionImpl: Operation log session directory is created: /tmp/hadoop/operation_logs/ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO service.CompositeService: Session opened, SessionHandle [ea92f71c-52d6-4287-a0ba-f1f80b939d0a], current sessions:86
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO conf.HiveConf: Using the default value passed in for log id: ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Updating thread name to ea92f71c-52d6-4287-a0ba-f1f80b939d0a HiveServer2-Handler-Pool: Thread-459
19/02/05 14:19:50 [ea92f71c-52d6-4287-a0ba-f1f80b939d0a HiveServer2-Handler-Pool: Thread-459]: INFO operation.OperationManager: Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=8793e0b9-5a38-40f2-bf5f-cfe5fe21a060]
19/02/05 14:21:20 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getTable
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1563) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1550) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1344) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:127) ~[hive-exec-2.3.3.jar:2.3.3]
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_191]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_191]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-exec-2.3.3.jar:2.3.3]
at com.sun.proxy.$Proxy37.getTable(Unknown Source) ~[?:?]
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_191]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_191]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2330) ~[hive-exec-2.3.3.jar:2.3.3]
at com.sun.proxy.$Proxy37.getTable(Unknown Source) ~[?:?]
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1274) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getTable(BaseSemanticAnalyzer.java:1713) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getTable(BaseSemanticAnalyzer.java:1697) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:12014) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:11020) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11133) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:506) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:280) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:531) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[hive-exec-2.3.3.jar:2.3.3]
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) ~[hive-service-2.3.3.jar:2.3.3]
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[hive-exec-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_191]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_191]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_191]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_191]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_191]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_191]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_191]
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[hive-exec-2.3.3.jar:2.3.3]
... 46 more
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Closed a connection to metastore, current connections: 1
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Trying to connect to metastore with URI thrift://hive-metastore:9083
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Opened a connection to metastore, current connections: 2
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Connected to metastore.
The hive-server pod was re-started Mon, 04 Feb 2019 16:16:32 -0600, and those errors are from just now, 5 Feb, 08:21 utc-6, so it's not transient errors that sort themselves out as the operator brings up services, which is what I would have expected.
So now it's getting read timeouts. You can't currently adjust the timeout for hive-server talking to metastore (not configurable yet), but you can likely increase the amount of resources you give to hive-metastore.
https://github.com/operator-framework/operator-metering/blob/master/Documentation/tuning.md covers increasing resource limits.
I had applied my own tunings:
apiVersion: metering.openshift.io/v1alpha1
kind: Metering
metadata:
name: "operator-metering"
spec:
reporting-operator:
spec:
resources:
requests:
memory: "300Mi"
cpu: "50m"
limits:
memory: "300Mi"
cpu: "100m"
presto:
spec:
hive:
config:
dbConnectionURL: "jdbc:postgresql://postgresql-95-centos7:5432/metering"
dbConnectionDriver: "org.postgresql.Driver"
dbConnectionUsername: "metering"
dbConnectionPassword: "my-password-string"
presto:
coordinator:
resources:
requests:
memory: "2Gi"
cpu: "2"
limits:
memory: "2Gi"
cpu: "2"
worker:
resources:
requests:
memory: "2Gi"
cpu: "2"
limits:
memory: "2Gi"
cpu: "2"
Now I'm adding my postgres config parameters to the recommended-resource-limits, and setting datanode replicas to 1 since we only have one compute node available in our lab, but otherwise I'm using it verbatim:
I tore down my metering project and re-created it with that config. I'm still getting timeouts from hive-metastore: hive-server.log
Is it worth it to embiggen hive-metastore even more?
You're currently not increasing hive limits at all, only presto.
I would suggest 1-2GB memory for hive metastore to start, potentially higher depending on if it seems to be helping. I'm also looking at making it possible to tune the timeout.
sorry, my last comment wasn't clear. I'm now using all the tuning values here:
except hdfs replicas is only 1 not 3, since I don't have the compute nodes needed to run 3.
At 1.5Gi for hive metastore, I'm still getting the read errors in the log above.
As I suggested before, I would try to increase the value further to see if it helps. It could also be a networking issue, since the default timeout seems to be 5 minutes I believe (trying to verify) so it seems strange to be getting a timeout like that between hive-server and hive-metastore. You may also want to increase hive-server memory to see if it helps.
Per the documentation, I am configuring a new install of metering to use a pgsql pod deployed to the same OpenShift project:
I then created this config:
Pointed to it with the METERING_CR_FILE environment variable, and ran hack/openshift-install.sh.
After the install runs, the reporting-operator starts up but never shows ready. The logs for that pod show:
I encountered this with release-0.11, then noticed .13 was released, and I replicated the problem with 0.13 after deleting and re-creating my project. The only difference is I don't see the timeout message in .13:
If I shell into the postgres pod, I do see that a schema was populated:
So I at least got my DB connection parameters right.