jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
616 stars 223 forks source link

Use Mesos as resource manager #564

Open saketanisha opened 5 years ago

saketanisha commented 5 years ago

Feature request: Use Mesos as resource manager

saketanisha commented 5 years ago

Had a connect with @kevin-bates @lresende on this topic at apache/toree gitter and seems this may be a good add-on to current enterprise_gateway process proxy support.

Over last week spent sometime understanding the Jupyter Gateway eco system and based on the helpful pointers shared by Luciano and Kevin, have pulled up the set of tasks here

I see that the official Python Mesos client is not available as of now pypi / MESOS-946. So as a baby step went ahead and mashed up the mesos python and MesosRestServer.scala into a util client in my fork (commit). Seems pymesos is also a choice - tried locally but couldn't get it work out-of-box ( will try afresh sometime later today )

Before I proceed any further I may seek inputs regarding whats a best route to plumb-in a mesos python Rest client into the Enterprise Gateway code line. Also seeking feedback around the task list.

kevin-bates commented 5 years ago

@saketanisha - this is great! Let's use this issue to improve our documentation relative to writing other process proxies.

Your task list looks very complete. However, as you know, we must first investigate how we're going to communicate with the resource manager. I think we should also try to locate an existing or create a new external package that houses the Mesos API functionality we need. I agree that pymesos looks promising as well.

Last week, I spent a couple hours trying to understand the Mesos REST API and architecture. I still don't quite understand the frameworks, executors, tasks relationships, but I'm hoping you can navigate that space. Here's what we need to be able to do using the API and submission.

  1. Most importantly, we need to be able to set a tag relative to the spark-submit call that contains the kernel ID. In YARN we have the ability to specify the application name. This is crucial for the next step.
  2. We need the ability to discover the submitted task using the kernel ID (or some kind of unique identifier). This usually amounts to fetching an entire set of things (executors? frameworks?, not sure), locating the desired entity, then picking off the canonical identifier used in subsequent calls. In YARN, this is the application ID. In playing around with mesos, I noticed there's some kind of submission ID thing, not sure if that's useful or not for the subsequent requests.
  3. Once we have the primary identifier used by Mesos (which should be persisted via the load_process_info stuff), we then need the ability to do the following: a. Get status - primarily to determine if it's running or terminated b. Termination - this is necessary as a fallback for when the normal message-based termination to the kernel does not work. c. Obtain the node information on which the kernel landed. This isn't completely necessary, but very helpful from a diagnostic standpoint.

Once these areas of the API are determined, we can stitch them into the process proxy. If you wanted to use pure REST code in the process proxy, that's fine as well, but, yeah, leveraging the work that others have started and/or contributing to those areas, is probably preferred.

Thank you for taking this on - it is much appreciated!

kevin-bates commented 5 years ago

@saketanisha - how is this coming? Is there anything you're stuck on? I think we'd like to keep the rest client module out of EG - similar to what we do with the yarn-api-client module - in case that helps.

saketanisha commented 5 years ago

@kevin-bates - should be offering PR by 03/31. In case of Yarn we'd a standard ResourceManager provided OOB. For Mesos route coming up with a custom Rest impl very specific to this use case. Still not very clear as to where to position this custom client ( since its not a fully baked in Mesos Rest client, don't feel it apt to offer this as PR to Apache Mesos project )

The api call may look something like below ( java version i'm leveraging for some other project )

HttpUrl requestUrl = httpUrl.newBuilder()
    .addPathSegments("tasks")
    .addQueryParameter("task_id", taskId)
    .addQueryParameter("framework_id", frameworkId)
    .build();

Request request = new Request.Builder().url(requestUrl)
    .addHeader("Accept", String.valueOf(MEDIA_TYPE_JSON))
    .addHeader("Content-Type", String.valueOf(MEDIA_TYPE_JSON)).build();

try (Response response = okHttpClient.newCall(request).execute()) {
    if (!response.isSuccessful()) {
        throw convertException(response);
    }

    Tasks tasks = mapper.readValue(response.body().string(), Tasks.class);
    return tasks;
}

This return the over all status of the submitted task. task.state is the field that affirms the current state ( TASK_RUNNING, TASK_FINISHED or TASK_STARTING )

When we submit task to mesos we get back the driverId ( a.k.a. taskId in mesos ligua franca ). So net we can affirmatively and in an optimal way get the process proxy health check call back implemented. The only open question is where to place this custom rest client.

Another approach that I experimented around was -

Orchestrate spark mesos dispatcher on mesos cluster -> leverage dispatcher rest APIs to submit the framework requests (RestSubmissionClient.createSubmission) and then monitor the framework health via RestSubmissionClient. requestSubmissionStatus OR mesos /tasks?task_id=xxx&framework_id=yyy end point. i.e. no Toree in the flow

lresende commented 5 years ago

@saketanisha In case of YARN we actually used an existing python YARN client. If there isn't anything already available, and it's as simple as that, it can resides as a utility class.

saketanisha commented 5 years ago

This helps @lresende - thanks for the note. As of now I hosted the mesos rest api's ( specific to our use case ) under a util placeholder. Shall finalize first cut of PR soon and get back here.

saketanisha commented 5 years ago

Can someone help me understand how we IT k8 integration. For mesos integration I may need to spin single node mesos cluster inside docker ( kinda know the mechanics but not sure where to host the docker image. Is there a project placeholder in docker hub where I can push it to? )

kevin-bates commented 5 years ago

The integration tests, when invoked via the Makefile, target the enterprise-gateway-demo image to run the tests against. However, each of the applicable pieces can be overridden via environment variables. The primary envs involved are GATEWAY_HOST, the various expected values, and overrides of the kernelspec names.

Here's a script I use to target various gateway hosts, utilizing different expected values and kernelspecs. The k8s stanza is associated with the k8s-omg target.

#!/bin/bash

cwd=`pwd`
(

target=yarn-eg

if [ $# -eq 1 ]; then
    target=$1
fi

cd /Users/kbates/repos/oss/jupyter/enterprise_gateway

export KERNEL_USERNAME=itest$$

if [ "$target" == "yarn-eg" ]; then
    echo "Using target $target ..."

    export PYTHON_KERNEL_LOCAL_NAME=python3

    # itests default to yarn kernelspecs, no need to set
    export EXPECTED_APPLICATION_ID="application_*"
    export EXPECTED_DEPLOY_MODE="cluster"
    export EXPECTED_SPARK_VERSION="2.3.*"
    export EXPECTED_SPARK_MASTER="yarn"

    export ITEST_HOSTNAME_PREFIX=yarn-eg
    export GATEWAY_HOST=yarn-eg-node-1.fyre.ibm.com:8888

elif [ "$target" == "elyra-omg" ]; then
    echo "Using target $target ..."

    # itests default to yarn kernelspecs, no need to set
    export EXPECTED_APPLICATION_ID="application_*"
    export EXPECTED_DEPLOY_MODE="cluster"
    export EXPECTED_SPARK_VERSION="2.1.1.*"
    export EXPECTED_SPARK_MASTER="yarn"

    export ITEST_HOSTNAME_PREFIX=elyra-omg
    export GATEWAY_HOST=elyra-omg-node-1.fyre.ibm.com:8888

elif [ "$target" == "k8s-omg" ]; then
    echo "Using target $target ..."

    export PYTHON_KERNEL_LOCAL_NAME=python_kubernetes
    export R_KERNEL_LOCAL_NAME=R_kubernetes
    export SCALA_KERNEL_LOCAL_NAME=scala_kubernetes

    export PYTHON_KERNEL_CLUSTER_NAME=spark_python_kubernetes
    export R_KERNEL_CLUSTER_NAME=spark_R_kubernetes
    export SCALA_KERNEL_CLUSTER_NAME=spark_scala_kubernetes

    export PYTHON_KERNEL_CLIENT_NAME=spark_python_kubernetes
    export R_KERNEL_CLIENT_NAME=spark_R_kubernetes
    export SCALA_KERNEL_CLIENT_NAME=spark_scala_kubernetes

    export EXPECTED_APPLICATION_ID="spark-application-*"
    export EXPECTED_DEPLOY_MODE="client"
    export EXPECTED_SPARK_VERSION="2.4.*"
    export EXPECTED_SPARK_MASTER="k8s:*"

    export ITEST_HOSTNAME_PREFIX=${KERNEL_USERNAME}
    export GATEWAY_HOST=k8s-omg-node-1.fyre.ibm.com/gateway
else
    echo "Unrecognized target: $target"
    exit 1
fi

#-------------  TESTS
#TESTS=enterprise_gateway.itests.test_scala_kernel:TestScalaKernelCluster.test_get_hostname
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster.test_get_hostname
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster.test_restart
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster.test_hello_world
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster.test_get_resource_manager
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster.test_get_spark_version
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster.test_get_application_id
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster.test_run_pi_example
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelCluster
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelLocal
#TESTS=enterprise_gateway.itests.test_python_kernel:TestPythonKernelLocal.test_restart
TESTS=enterprise_gateway.itests.test_r_kernel:TestRKernelCluster.test_restart

nosetests -v $TESTS

cd $cwd
)

I just then uncomment the kind of test I want to run against that target.

Regarding a dockerhub location, the project uses the elyra organization. Please pass along your docker hub ID, and I'll give you sufficient privileges to administer the mesos-based image.

saketanisha commented 5 years ago

Thank you @kevin-bates for the information.

My docker-hub account is saketanisha.

lresende commented 5 years ago

Can someone help me understand how we IT k8 integration. For mesos integration I may need to spin single node mesos cluster inside docker ( kinda know the mechanics but not sure where to host the docker image. Is there a project placeholder in docker hub where I can push it to? )

As a placeholder, while this feature is in development, I would encourage you using your own dockerhub account (e.g. saketanisha/mesos-demo) and we will push to elyra once it gets merged into master.