Focus on Spark - Githubissues

zmerlynn commented 8 years ago

I have #16320/#16498 out to bring Spark up to existing standards for our examples. This bug is both a friction log of things I ran into while working on that PR, and a set of anticipated issues with Spark.

I organized this into two sections, Spark-specific-ish and Kubernetes-general. Some of these bullets have open issues and I'm documenting them as part of the vertical slice involved.

Please feel free to correct me if I messed something up obvious about Spark or Kubernetes. I was a former Spark user in a bygone era, but it's moved quite quickly since then so any expertise has rusted.

Spark specific

[x] Where?: #17027 talks about whether this even belongs in examples/, since it works over multiple versions, and isn't a teaching example.
[ ] Config: #16320 puts up a static configuration of Spark. The Spark configuration alone is pretty large, and that's not including the Hadoop configuration that also influences it. It's difficult to say what a customer would actually want to tune in production, but it's clear that treating Spark as a fully black-box is probably wrong (but has legs for a while). Don't get me wrong, it's not unreasonable to offer an out-of-the-box configuration that just works, but we're going to need a way to inload configuration. In this case, the container is also going to need to compose configuration with the configuration that "just works" (i.e. as a simple example, if we took in a secret for additional spark-defaults.conf, we should probably just splat the existing configuration together, or compose them to allow the additional config to override KVs in the base image). See #6477, #4822.
[ ] Jar Files: We're going to need a way to quickly compose base images and other modules. See #831.
[ ] Parameterized Resources: It's not clear how to merge spark and spark-gluster. The long term answer is probably things like #10527, #11492, etc. Short term, one option might be to assume that you have some storage available that can be mounted as ReadWriteMany and instead of assuming gluster, we rewrite the example as either gluster or NFS and use a PV claim. But that's not the only example of a resource that needs to be parameterized / templatized in some fashion.
[ ] Pod Local Storage: The existing example doesn't override SPARK_LOCAL_DIRS, so it's only useful for certain workloads, namely in-memory only workloads. This used to be the bread-and-butter of Spark because it was all it could do, but it proved rather limiting not to spill. However, using network or distributed storage for Spark intermediate storage is somewhat naive, but is the most convenient way on our system to configure a ReplicationController for storage. But, for the short term, given that the replication controller itself is <N> wide, for anyone in GCE we could add a script to this directory to provision a set of <N> volumes and use a claim to resolve it.
[ ] Storage Testing: Spark doesn't necessarily need storage that will outlive the life of the pod (if you blow away a worker on a running job it will recompute the RDD around it, as you'd expect, since storage within Spark is largely annotated caching with .persist() and shuffle-spill, both of which are losable). The pod case is different than e.g. Netflix's Chaos Monkey with Spark tests, though, because the pod comes back with a different name.
[ ] Scheduling: I kept Spark configured in Standalone mode with a fixed width ReplicationController for the worker pool. There's a lot of really long, interesting conversations to have how to handle this correctly. Mesos does a lot here with fine-grained scheduling to pick up slack capacity on the cluster, YARN integrates slightly differently, we could still take yet another. (Obviously a large TBD bullet.)
[ ] Resiliency: I put some basic liveness probes on the master/workers, but there's some obvious stuff missing: (a) The master either needs to be running in FILESYSTEM mode (easiest) or (b) in multi-master backed by ZooKeeper (hardest) in order to withstand a restart with a running application. Similarly, the driver pod itself isn't protect at all right now, but before we protect that, we just want to look at the driver mode, how we want to support Spark app submission in the cluster, and the Zeppelin bullet below.
[ ] Test: If we actually want to productionize an app, we really should have tests of some sort.
[x] Zeppelin: We should definitely get Zeppelin working. I morphed the current example into an example familiar to Googlers, a Shakespeare word count example, but it's using the pyspark driver and lacks "pop".
[ ] Spark UI: The Spark WebUI worker links won't work because they're cluster IPs. Spark presents it's worker IPs as if they're globally accessible, not anticipating that we're proxying the connection to the master's WebUI (which seems like an utterly sane). We could obviously proxy all of the worker webuis as well, but Spark doesn't seem to support a URL-rewrite scheme for the master WebUI page, so we'd probably have to slap an nginx proxy in front of it. (See #16949 for one way to maybe solve it.)
Kubernetes general
[ ] Pod Hostname/Service Mismatch: During the conversion from a spark-master pod to a spark-master single-pod-replication-controller, the Spark master objected to the slaves because the master started with a hostname that didn't match the service name the slaves were contacting it on. In this case, the slaves connected to the DNS spark-master, and spark-master saw messages for spark-master and said "nope, that ain't me, I'm spark-master-a1b2d3, you must have me confused for someone else". (c.f. #386)
[ ] Environment Variable Injection Issues: If you're not actually using the service environment variables, injection can actually throw you off when an environment variable from Docker or Kubernetes overrides something that a vendor script was expecting. For example, I had a service named spark-master, which resulted in SPARK_MASTER_PORT, which is an environment variable that the Spark start-master.sh script will happily pick up. Unfortunately, the script was expecting a single integer, not tcp://<ip>:7070. It would be nice if there was a way to disable service env variable injection. See #1768 (which seems to cover that possibility, maybe).
[ ] Best Practices Friction: Config best practices is at odds with how we test examples currently: The resources probably belong in the same file, but I didn't do it that way because of the example schema test. See #16444, just filed.
[ ] Slow Dev Iteration: This was probably the longest I've spent iterating on an ensemble of Docker containers, and I found it somewhat painful (until I automated it slightly) to bump the tag in the Makefile, bumping it in the .yamls, etc. Our best practices seem to suggest using image hardcoded tags (which make sense), and I probably would've scripted it further if I had to go much longer. The process itself before presenting a PR is interesting because prior to a PR you basically end up "hiding" the entire thing on a private project, then you end up displaying one "bump" to the world. Do we have any issues discussing how to iterate on "the package" (the set of .yaml, Dockerfiles, etc.)? As we get anyhere near packaging, that's going to be one killer feature - the ability to rapidly iterate on actually developing package blobs for k8s (the resources and images both).

cc @davidopp @timstclair @timothysc @wattsteve

zmerlynn commented 8 years ago

cc @aronchick

wattsteve commented 8 years ago

@zmerlynn nice write up! WRT the separate glusterfs example, I agree and I like the suggestion of just moving to a single example that uses a ReadWriteMany PV claim. We only created the glusterfs example so that the community would have an example of how to connect the spark example with the volume plugins, but using a ReadWriteMany PV claim achieves the same goals in a more widely applicable manner.

bparees commented 8 years ago

With respect to injecting configuration, we've taken the following approach for other images that need config that goes beyond something a user could provide via a few environment variables:

Use the source-to-image project (https://github.com/openshift/source-to-image) to author a "builder" image which takes configuration as source and produces a customized image. The user can then run that image. In addition at startup time, we copy the configuration from the image to a VOLUME path and configuration the framework to use that as the configuration location, thus allowing the configuration to be edited dynamically by the user, assuming the framework tolerates such things.

You can see an example of the changes involved in this PR which adds s2i functionality to our existing jenkins image: https://github.com/openshift/jenkins/pull/36

there's a lot going on in that PR beyond the s2i enablement, but the main thing you need to do to implement the s2i builder spec is to provide an assemble script (knows how to consume the "source" being provided by the user and put it where it belongs inside the image) and a run script (knows how to run the framework when the image is started).

Users can then build customized images with their config injected by running "s2i build file:///some/dir/with/config kubernetes/spark:latest myorg/mycustomsparkimage"

zmerlynn commented 8 years ago

@bparees That project looks really neat. One of my biggest concerns about doing something like that, though, is that it's pretty heavy to build/push a new image. It may be naive, but I would really like to be able to ship composable units that can be configured so an end-user doesn't have to re-build the container.

The place I definitely see source-to-image working is actually the package manager workflow, since it seems to simplify that workflow fairly well as well.

bparees commented 8 years ago

building new/custom images is definitely a trade-off. The cost is out of the box time/effort, the pay off is you end up with an immutable image you can move between environments without risking losing configuration details. At a minimum maybe it's something to consider enabling even if it's not the only way to allow config injection. (ie allow both a lightweight runtime way to inject config, and provide s2i enablement for users that want to have a reproducible way to construct custom images with config baked in)

zmerlynn commented 8 years ago

@bparees: Agreed. And it loops back on the config parameterization, because you (obviously) need to override the entire set of image names in the resources for Spark if you end up re-baking the configs as you suggest. I think the next level of something like source-to-image is one that's a little more k8s aware: if we had a concept of "package source" where the docker images were built with source-to-image and the tagged image names stuffed into the .yaml, I'd actually be a little less ambivalent and more positive on the side of just rebuild/push.

zmerlynn commented 8 years ago

@bparees: (And to be fair, that could be done with a minor amount of, say, Jinja templating right now if someone wanted to do a one-off solution until we could settle on something better. So I might be letting perfect get in the way of good enough. I'll think about it s'more.)

smarterclayton commented 8 years ago

if we had a concept of "package source" where the docker images were built with source-to-image and the tagged image names stuffed into the .yaml

There's definitely a local iteration flow that could be amped up. Build from a repo, push, test. Rapid rebuild of one of the images is the most painful part of a lot of these flows.

zmerlynn commented 8 years ago

There's definitely a local iteration flow that could be amped up. Build from a repo, push, test. Rapid rebuild of one of the images is the most painful part of a lot of these flows.

Yeah, and I suspect in a lot of cases you want to build/test in a private project and publish in a real project. At least, larger organizations probably want this option available (google-containers, I presume OpenShift).

There's also a sub-bullet here I didn't blow out yet: We need a separate repo for "official" images, because we probably need to start doing "central" build on them that actually hooks into the main build: e.g. we have a number of Makefiles that actually push to gcr.io/google-containers, but pretty much no way to send a PR for the image alone, get that approved and have a central thing do source-to-image post-approval. So right now there's actually an awkward dance with any image changes for "official" apps where you put the PR up and either push prior to LGTM (thwarting :latest, which is fine, it's considered harmful), or race the submit queue.

(We can also discuss how to get images directed to other places besides gcr.io/google-containers, too.)

zmerlynn commented 8 years ago

cc @kubernetes/sig-big-data

bgrant0607 commented 8 years ago

Hostname discussion: https://github.com/kubernetes/kubernetes/issues/260#issuecomment-133318903

bgrant0607 commented 8 years ago

cc @viglesiasce

chrislovecnm commented 8 years ago

what are the plans for this? Kafka, Spark and Elastic are on my radar, besides Cassandra

timothysc commented 8 years ago

/cc @mattf @willb

goltermann commented 8 years ago

@erictune @zmerlynn was this in plan/scope for v1.4? Should this be bumped out to v1.5?

timothysc commented 8 years ago

imho we should close, as this should now be federated to the other repos. It no longer applies to master.

chrislovecnm commented 8 years ago

@timothysc good call push them to charts

erictune commented 8 years ago

I'd like to get the the point where spark-submit and spark-shell know how to talk to your active kubernetes cluster (not standalone mode), at which point, you don't need an examples or a Chart.

At any rate, it is not clear yet what repo or repos will host the kubernetes-on-spark code. This issue has useful backlog for whoever does that work. So, I am inclined to keep it open until someone can move these things to another backlog.

erictune commented 8 years ago

@foxish

foxish commented 8 years ago

Working on a detailed proposal for running spark natively. I'll be detailing our steps shortly in a new issue, but this rubric is helpful in evaluating what we need.

ursuad commented 8 years ago

I also ran into an issue with Spark on Kubernetes with the executors having wrong IP Addresses which might be relevant here. https://github.com/kubernetes-incubator/application-images/issues/10

foxish commented 6 years ago

This is being upstreamed in Spark 2.3. The discussions can move to the Spark JIRA (https://issues.apache.org/jira/browse/SPARK-18278) following that.

kubernetes / kubernetes

Focus on Spark #16517

Spark specific

Kubernetes general