pangeo-forge / pangeo-forge-cloud-federation

Infrastructure for running pangeo-forge across multiple bakeries
Apache License 2.0
3 stars 6 forks source link

Using yamlencode() to define flink-conf.yaml might be broken #13

Open thodson-usgs opened 7 months ago

thodson-usgs commented 7 months ago

I'm starting to suspect that the output of yamlencode() to flink-conf.yaml is not being parsed correctly. When I look at the logs, values set my yamlencode() appear as: "number": "12345" whereas other values appear as number: 12345

In fact, both values can occur in the same deployment, which make me think they aren't equivalent.

yuvipanda commented 7 months ago

Numbers are the tricky bit, but usually yamlencode just passes these through - so "12345" gets encoded to "12345" while 12345 gets encoded to 12345. Do you have an example config to show?

The other option is to switch to jsonencode, because json is a subset of YAML and anything that accepts yaml should accept JSON (with the exception of the presence of hard tabs). This is actually generally what I do, because JSON isn't space significant while YAML is.

thodson-usgs commented 7 months ago

I believe I've verified this while testing restart strategies. Viewing the job manager config in Flink's dashboard: I pass the config through terraform and I get "restart-strategy.type": "exponential-delay" "restart-strategy.exponential-delay.max-backoff": "20 min" and the restart fails.

Whereas if I pass the config through the recipe's config.py I get restart-strategy.type: exponential-delay restart-strategy.exponential-delay.max-backoff : 20 min and the restart works.

If I pass both, both appear in the config, which may also indicate they are not equivalent.

I'll report more examples as I come across them, but I'm feeling confident that our terraform-flink-config is broken!

thodson-usgs commented 7 months ago

Also checking the job manager logs, all the quoted parameters appear red. Unclear what red indicates, but I would guess an error (color not shown).

INFO  [] - Loading configuration property: blob.server.port, 6124
INFO  [] - Loading configuration property: kubernetes.jobmanager.annotations, flinkdeployment.flink.apache.org/generation:2
INFO  [] - Loading configuration property: kubernetes.jobmanager.replicas, 1
INFO  [] - Loading configuration property: "kubernetes.operator.metrics.reporter.prom.port", "9999"
INFO  [] - Loading configuration property: taskmanager.memory.task.off-heap.size, 256m
INFO  [] - Loading configuration property: jobmanager.rpc.address, gh-2dhytest-2dfeedstocks-2dgpcp-2dfrom-408fbc.default
INFO  [] - Loading configuration property: kubernetes.taskmanager.cpu, 1.0
INFO  [] - Loading configuration property: "prometheus.io/port", "9999"
INFO  [] - Loading configuration property: kubernetes.service-account, flink
INFO  [] - Loading configuration property: kubernetes.cluster-id, gh-2dhytest-2dfeedstocks-2dgpcp-2dfrom-408fbc
INFO  [] - Loading configuration property: "restart-strategy.exponential-delay.initial-backoff", "10 s"
INFO  [] - Loading configuration property: kubernetes.internal.taskmanager.replicas, 5
INFO  [] - Loading configuration property: taskmanager.memory.flink.size, 1536m
INFO  [] - Loading configuration property: kubernetes.container.image, flink:1.16
INFO  [] - Loading configuration property: parallelism.default, 1
INFO  [] - Loading configuration property: kubernetes.namespace, default
INFO  [] - Loading configuration property: taskmanager.numberOfTaskSlots, 1
INFO  [] - Loading configuration property: taskmanager.memory.jvm-overhead.max, 1024m
INFO  [] - Loading configuration property: kubernetes.rest-service.exposed.type, ClusterIP
thodson-usgs commented 7 months ago

6 may close this. (...maybe not, did another pull and I see this branch moved back to yamlencode)