kube-reporting / metering-operator

The Metering Operator is responsible for collecting metrics and other information about what's happening in a Kubernetes cluster, and providing a way to create reports on the collected data.
Apache License 2.0
337 stars 90 forks source link

reporting operator not connecting to hive backed by pgsql #566

Open stenwt opened 5 years ago

stenwt commented 5 years ago

Per the documentation, I am configuring a new install of metering to use a pgsql pod deployed to the same OpenShift project:

$ oc get svc | grep postgres
postgresql-95-centos7                      ClusterIP   172.31.55.232    <none>        5432/TCP              5d

I then created this config:

apiVersion: metering.openshift.io/v1alpha1
kind: Metering
metadata:
  name: "operator-metering"
spec:
  presto:
    spec:
      hive:
        config:
          dbConnectionURL: "jdbc:postgresql://postgresql-95-centos7:5432/metering"
          dbConnectionDriver: "org.postgresql.Driver"
          dbConnectionUsername: "metering"
          dbConnectionPassword: "my-password-string"

Pointed to it with the METERING_CR_FILE environment variable, and ran hack/openshift-install.sh.

After the install runs, the reporting-operator starts up but never shows ready. The logs for that pod show:

time="2019-01-16T15:49:45Z" level=info msg="adding ReportGenerationQuery pod-memory-request" app=metering
time="2019-01-16T15:49:45Z" level=info msg="adding ReportGenerationQuery pod-memory-request-raw" app=metering
time="2019-01-16T15:49:45Z" level=info msg="adding ReportGenerationQuery pod-memory-request-vs-node-memory-allocatable" app=metering
time="2019-01-16T15:49:45Z" level=info msg="adding ReportGenerationQuery pod-memory-usage" app=metering
time="2019-01-16T15:49:45Z" level=info msg="adding ReportGenerationQuery pod-memory-usage-raw" app=metering
time="2019-01-16T15:49:46Z" level=info msg="starting HTTP server" app=metering
time="2019-01-16T15:49:46Z" level=info msg="testing ability to write to Presto" app=metering
time="2019-01-16T15:49:46Z" level=info msg="HTTP API server listening with TLS on 127.0.0.1:8080" app=metering
time="2019-01-16T15:50:34Z" level=error msg="cannot create Presto table operator_health_check" app=metering component=testWriteToPresto error="timed out waiting for the condition"
time="2019-01-16T15:51:41Z" level=info msg="\"GET https://10.250.10.3:8080/ready HTTP/2.0\" from 10.250.10.1:56882 - 500 50B in 594.958µs" app=metering component=api
time="2019-01-16T15:52:41Z" level=info msg="\"GET https://10.250.10.3:8080/ready HTTP/2.0\" from 10.250.10.1:57172 - 500 50B in 49.564µs" app=metering component=api
time="2019-01-16T15:53:41Z" level=info msg="\"GET https://10.250.10.3:8080/ready HTTP/2.0\" from 10.250.10.1:57456 - 500 50B in 43.356µs" app=metering component=api
time="2019-01-16T15:54:41Z" level=info msg="\"GET https://10.250.10.3:8080/ready HTTP/2.0\" from 10.250.10.1:57754 - 500 50B in 33.625µs" app=metering component=api

I encountered this with release-0.11, then noticed .13 was released, and I replicated the problem with 0.13 after deleting and re-creating my project. The only difference is I don't see the timeout message in .13:

time="2019-01-16T16:13:06Z" level=info msg="testing ability to write to Presto" app=metering
time="2019-01-16T16:13:06Z" level=info msg="HTTP API server listening with TLS on 127.0.0.1:8080" app=metering
time="2019-01-16T16:14:26Z" level=info msg="\"GET https://10.250.10.25:8080/ready HTTP/2.0\" from 10.250.10.1:33732 - 500 50B in 373.63µs" app=metering component=api
time="2019-01-16T16:15:26Z" level=info msg="\"GET https://10.250.10.25:8080/ready HTTP/2.0\" from 10.250.10.1:34030 - 500 50B in 49.415µs" app=metering component=api
time="2019-01-16T16:16:26Z" level=info msg="\"GET https://10.250.10.25:8080/ready HTTP/2.0\" from 10.250.10.1:34332 - 500 50B in 38.505µs" app=metering component=api
time="2019-01-16T16:17:26Z" level=info msg="\"GET https://10.250.10.25:8080/ready HTTP/2.0\" from 10.250.10.1:34626 - 500 50B in 40.387µs" app=metering component=api
time="2019-01-16T16:18:26Z" level=info msg="\"GET https://10.250.10.25:8080/ready HTTP/2.0\" from 10.250.10.1:34922 - 500 50B in 42.805µs" app=metering component=api
time="2019-01-16T16:19:26Z" level=info msg="\"GET https://10.250.10.25:8080/ready HTTP/2.0\" from 10.250.10.1:35224 - 500 50B in 43.796µs" app=metering component=api
time="2019-01-16T16:20:26Z" level=info msg="\"GET https://10.250.10.25:8080/ready HTTP/2.0\" from 10.250.10.1:35530 - 500 50B in 43.522µs" app=metering component=api
time="2019-01-16T16:20:49Z" level=info msg="got signal terminated, performing shutdown"

If I shell into the postgres pod, I do see that a schema was populated:

postgres=# \c metering
You are now connected to database "metering" as user "postgres".
metering=# \dt
                   List of relations
 Schema |           Name            | Type  |  Owner
--------+---------------------------+-------+----------
 public | BUCKETING_COLS            | table | metering
 public | CDS                       | table | metering
 public | COLUMNS_V2                | table | metering
 public | DATABASE_PARAMS           | table | metering
 [snip]

So I at least got my DB connection parameters right.

chancez commented 5 years ago

Can you check the hive metastore, hive server, and presto pod logs to see if there are any errors from them? You may also want to increase metering logging to debug by setting spec.reporting-operator.spec.config.logLevel to debug in your Metering CR.

stenwt commented 5 years ago

reporting-logs-debug.txt

If I'm reading this right, reporting-operator is trying to connect to service presto:8080, which fails. If I rsh to the reporting-operator pod and curl presto:8080 I get a http 301 response, which sounds like it's working. That may have been transient, while the presto pod came up.

There are several failures in the hive-server-0 log as it tries and fails to connect to hive-metastore, but, likely after the metastore pod comes up, those messages stop and the last message looks like a successful connection.

I note this in the hive-metastore-0 log:

19/02/04 15:14:12 [pool-7-thread-7]: WARN metastore.HiveMetaStore: Location: hdfs://hdfs-namenode-proxy:9820/operator_metering/storage/metering_health_check specified for non-external table:operator_health_check

that seems related to the test that's failing from the reporting-operator.

It also looks like metastore is successfylly connected to postgres:

19/02/04 15:19:49 [pool-7-thread-25]: INFO metastore.ObjectStore: ObjectStore, initialize called 19/02/04 15:19:49 [pool-7-thread-25]: INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is POSTGRES 19/02/04 15:19:49 [pool-7-thread-25]: INFO metastore.ObjectStore: Initialized ObjectStore

This makes it sound like an issue with my config? But if that's the case, I'm not sure how to correct it.

stenwt commented 5 years ago
$ oc get pods
NAME                                  READY     STATUS    RESTARTS   AGE
hdfs-datanode-0                       1/1       Running   0          30m
hdfs-namenode-0                       1/1       Running   0          30m
hive-metastore-0                      1/1       Running   0          30m
hive-server-0                         1/1       Running   0          30m
metering-operator-558d78b4cf-7jq29    2/2       Running   0          31m
postgresql-95-centos7-2-zjmsb         1/1       Running   0          1h
presto-coordinator-66b7fc44fc-xh9mg   1/1       Running   0          30m
reporting-operator-655889659d-979mn   0/1       Running   3          30m
$ oc get services
NAME                                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)               AGE
glusterfs-dynamic-hdfs-datanode-data-hdfs-datanode-0   ClusterIP   172.31.171.154   <none>        1/TCP                 59m
glusterfs-dynamic-hdfs-namenode-data-hdfs-namenode-0   ClusterIP   172.31.237.198   <none>        1/TCP                 59m
glusterfs-dynamic-hive-metastore-db-data               ClusterIP   172.31.57.27     <none>        1/TCP                 59m
glusterfs-dynamic-pvc-t8q8l                            ClusterIP   172.31.154.179   <none>        1/TCP                 1h
hdfs-datanode                                          ClusterIP   None             <none>        9866/TCP              31m
hdfs-datanode-web                                      ClusterIP   172.31.103.5     <none>        9864/TCP              31m
hdfs-namenode                                          ClusterIP   None             <none>        9820/TCP              31m
hdfs-namenode-proxy                                    ClusterIP   172.31.233.164   <none>        9820/TCP              31m
hdfs-namenode-web                                      ClusterIP   172.31.108.176   <none>        9870/TCP              31m
hive-metastore                                         ClusterIP   172.31.215.221   <none>        9083/TCP              31m
hive-server                                            ClusterIP   172.31.3.254     <none>        10000/TCP,10002/TCP   31m
postgresql-95-centos7                                  ClusterIP   172.31.231.134   <none>        5432/TCP              1h
presto                                                 ClusterIP   172.31.115.137   <none>        8080/TCP              31m
presto-metrics                                         ClusterIP   172.31.1.156     <none>        8082/TCP              31m
reporting-operator                                     ClusterIP   172.31.196.87    <none>        8080/TCP              31m
reporting-operator-metrics                             ClusterIP   172.31.7.50      <none>        8082/TCP              31m

in case that helps line up pod/ port/ip/service

chancez commented 5 years ago

According to the logs this is hitting a known issue where hive-server is getting a "No route to host" error when talking to hive-metastore. Try deleting the hive-server-0 pod, and it will automatically be re-created and that typically allows it to connect. We're still investigating the underlying cause for this.

stenwt commented 5 years ago

I deleted the hive-server-0 pod, and after a couple restarts it came back, but it looks like it's still having issues talking to the metastore:

19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO thrift.ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V8
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Created local directory: /tmp/hadoop/0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/0991ec47-aa95-47b0-9384-a12b51a3b555/_tmp_space.db
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.HiveSessionImpl: Operation log session directory is created: /tmp/hadoop/operation_logs/0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO service.CompositeService: Session opened, SessionHandle [0991ec47-aa95-47b0-9384-a12b51a3b555], current sessions:85
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO conf.HiveConf: Using the default value passed in for log id: 0991ec47-aa95-47b0-9384-a12b51a3b555
19/02/05 14:11:49 [HiveServer2-Handler-Pool: Thread-456]: INFO session.SessionState: Updating thread name to 0991ec47-aa95-47b0-9384-a12b51a3b555 HiveServer2-Handler-Pool: Thread-456
19/02/05 14:11:49 [0991ec47-aa95-47b0-9384-a12b51a3b555 HiveServer2-Handler-Pool: Thread-456]: INFO operation.OperationManager: Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=ec67f250-4fa9-47e5-9d34-008e5104f81e]
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO thrift.ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V8
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Created local directory: /tmp/hadoop/ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/ea92f71c-52d6-4287-a0ba-f1f80b939d0a/_tmp_space.db
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.HiveSessionImpl: Operation log session directory is created: /tmp/hadoop/operation_logs/ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO service.CompositeService: Session opened, SessionHandle [ea92f71c-52d6-4287-a0ba-f1f80b939d0a], current sessions:86
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO conf.HiveConf: Using the default value passed in for log id: ea92f71c-52d6-4287-a0ba-f1f80b939d0a
19/02/05 14:19:50 [HiveServer2-Handler-Pool: Thread-459]: INFO session.SessionState: Updating thread name to ea92f71c-52d6-4287-a0ba-f1f80b939d0a HiveServer2-Handler-Pool: Thread-459
19/02/05 14:19:50 [ea92f71c-52d6-4287-a0ba-f1f80b939d0a HiveServer2-Handler-Pool: Thread-459]: INFO operation.OperationManager: Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=8793e0b9-5a38-40f2-bf5f-cfe5fe21a060]
19/02/05 14:21:20 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getTable
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1563) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1550) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1344) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:127) ~[hive-exec-2.3.3.jar:2.3.3]
        at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_191]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_191]
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-exec-2.3.3.jar:2.3.3]
        at com.sun.proxy.$Proxy37.getTable(Unknown Source) ~[?:?]
        at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_191]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_191]
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2330) ~[hive-exec-2.3.3.jar:2.3.3]
        at com.sun.proxy.$Proxy37.getTable(Unknown Source) ~[?:?]
        at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1274) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getTable(BaseSemanticAnalyzer.java:1713) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getTable(BaseSemanticAnalyzer.java:1697) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:12014) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:11020) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11133) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:506) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:280) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:531) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[hive-exec-2.3.3.jar:2.3.3]
        at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) ~[hive-service-2.3.3.jar:2.3.3]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[hive-exec-2.3.3.jar:2.3.3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_191]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_191]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_191]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_191]
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_191]
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_191]
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_191]
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[hive-exec-2.3.3.jar:2.3.3]
        ... 46 more
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Closed a connection to metastore, current connections: 1
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Trying to connect to metastore with URI thrift://hive-metastore:9083
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Opened a connection to metastore, current connections: 2
19/02/05 14:21:21 [1e6c027d-b0df-4f14-bbe5-c20747588eee HiveServer2-Handler-Pool: Thread-166]: INFO hive.metastore: Connected to metastore.

The hive-server pod was re-started Mon, 04 Feb 2019 16:16:32 -0600, and those errors are from just now, 5 Feb, 08:21 utc-6, so it's not transient errors that sort themselves out as the operator brings up services, which is what I would have expected.

chancez commented 5 years ago

So now it's getting read timeouts. You can't currently adjust the timeout for hive-server talking to metastore (not configurable yet), but you can likely increase the amount of resources you give to hive-metastore.

https://github.com/operator-framework/operator-metering/blob/master/Documentation/tuning.md covers increasing resource limits.

stenwt commented 5 years ago

I had applied my own tunings:

apiVersion: metering.openshift.io/v1alpha1
kind: Metering
metadata:
  name: "operator-metering"
spec:
  reporting-operator:
    spec:
      resources:
        requests:
          memory: "300Mi"
          cpu: "50m"
        limits:
          memory: "300Mi"
          cpu: "100m"
  presto:
    spec:
      hive:
        config:
          dbConnectionURL: "jdbc:postgresql://postgresql-95-centos7:5432/metering"
          dbConnectionDriver: "org.postgresql.Driver"
          dbConnectionUsername: "metering"
          dbConnectionPassword: "my-password-string"
      presto:
        coordinator:
          resources:
            requests:
              memory: "2Gi"
              cpu: "2"
            limits:
              memory: "2Gi"
              cpu: "2"
        worker:
          resources:
            requests:
              memory: "2Gi"
              cpu: "2"
            limits:
              memory: "2Gi"
              cpu: "2"

Now I'm adding my postgres config parameters to the recommended-resource-limits, and setting datanode replicas to 1 since we only have one compute node available in our lab, but otherwise I'm using it verbatim:

https://raw.githubusercontent.com/operator-framework/operator-metering/master/manifests/metering-config/recommended-resource-limits.yaml

I tore down my metering project and re-created it with that config. I'm still getting timeouts from hive-metastore: hive-server.log

Is it worth it to embiggen hive-metastore even more?

chancez commented 5 years ago

You're currently not increasing hive limits at all, only presto.

chancez commented 5 years ago

I would suggest 1-2GB memory for hive metastore to start, potentially higher depending on if it seems to be helping. I'm also looking at making it possible to tune the timeout.

stenwt commented 5 years ago

sorry, my last comment wasn't clear. I'm now using all the tuning values here:

https://raw.githubusercontent.com/operator-framework/operator-metering/master/manifests/metering-config/recommended-resource-limits.yaml

except hdfs replicas is only 1 not 3, since I don't have the compute nodes needed to run 3.

At 1.5Gi for hive metastore, I'm still getting the read errors in the log above.

chancez commented 5 years ago

As I suggested before, I would try to increase the value further to see if it helps. It could also be a networking issue, since the default timeout seems to be 5 minutes I believe (trying to verify) so it seems strange to be getting a timeout like that between hive-server and hive-metastore. You may also want to increase hive-server memory to see if it helps.