radanalyticsio / spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.
Apache License 2.0
157 stars 61 forks source link

downloadData leads to CrashLoopBackOff #266

Open eosantigen opened 5 years ago

eosantigen commented 5 years ago

Description:

I went on to use this bit in my SparkCluster template. A spark cluster is deployed perfectly without it, however, when I use the following downloadData block the init containers all fail into a CrashLoopBackOff.

downloadData:
- url: https://bitbucket.org/metiscybertech/configuration-templates/raw/dac5298e5a83cfaa9ac07dc50f56cd255130faa5/core-site.xml
  to: /tmp/

Followed the examples in the repository and the block is added at the final lines, exactly beginning on the same row as the worker, with indentation 2 spaces from spec: .

Maybe it's worth mentioning that I also have mavenDependencies as well, in the template.

Thanks.

elmiko commented 5 years ago

thanks for opening this @eosantigen

i can see that the link is good, are you able to grab the logs for the init container?

i'm curious if there is anything suspicious in there. downloadData and mavenDependencies shouldn't interfere with each other.

elmiko commented 5 years ago

one more thing, would it be possible to share the manifest you used to spawn the spark cluster, i could try to repeat your process to see if i can also get this bug.

eosantigen commented 5 years ago

Sure, here is the manifest.

apiVersion: radanalytics.io/v1
kind: SparkCluster
metadata:
  name: spark-cluster
  namespace: sparkop
spec:
  worker:
    instances: '2'
    resources:
      limits:
        memory: 4Gi
      requests:
        memory: 400Mi
  master:
    instances: '1'
    resources:
      limits:
        memory: 4Gi
      requests:
        memory: 400Mi
  sparkWebUI: "true"
  mavenDependencies:
  - org.apache.hadoop:hadoop-azure:2.7.1
  - org.apache.hadoop:hadoop-common:2.7.1
  - org.apache.hadoop:hadoop-client:2.7.1
  - org.apache.hadoop:hadoop-auth:2.7.1
  - org.apache.hadoop:hadoop-hdfs:2.7.1
  env:
  - name: HADOOP_CLASSPATH
    value: /tmp/jars/*
  - name: HADOOP_OPTIONAL_TOOLS
    value: hadoop-azure
  - name: HADOOP_CONF_DIR
    value: /tmp/
  downloadData:
  - url: https://bitbucket.org/metiscybertech/configuration-templates/raw/dac5298e5a83cfaa9ac07dc50f56cd255130faa5/core-site.xml
    to: /tmp/core-site.xml

The logs from the init container of name "downloader" are:

wget: note: TLS certificate validation not implemented
wget: TLS error from peer (alert code 40): handshake failure
wget: error getting response: Connection reset by peer

For now, I have found a workaround in passing the directives included in core-site.xml in another way .

elmiko commented 5 years ago

thanks for sharing this, i'll see if i can replicate the issue.

For now, I have found a workaround in passing the directives included in core-site.xml in another way .

glad to hear you have a workaround =)

jkremser commented 5 years ago

we should be adding --no-check-certificate for the wget and/or do not fail the whole cluster deployment if the wget fails (or at least make it configurable)

elmiko commented 5 years ago

@jkremser that's what i was thinking after seeing the tls error, just hadn't confirmed yet.

i think adding a flag for insecure download is probably the best solution, there is something similar in the s2i tooling and i like the idea of making it explicit.