jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters
Other
1.32k stars 444 forks source link

Support connect to existing livy session #286

Closed hegand closed 7 years ago

hegand commented 7 years ago

It would be nice if there will be a command to connect to an existing livy session.

For example connecting to livy session with id=4 and kind=pyspark and naming to pyspark-test %spark connect -id 4 -l python -s pyspark-test -u http:/host:8998 -a u -p

If the session does not exist or it is not pyspark session an error message have to be given, else connect to that livy session.

aggFTW commented 7 years ago

We did not add this because of questions on who owns managing the lifetime of that session. Can you describe your scenario please?

hegand commented 7 years ago

Thanks for the answer. I am just thinking about this use case: let's assume I use my local jupyter notebook to connect with sparkmagic to a remote livy server (which configured running at yarn-cluster mode). I leave the office and plan to work from home. When I arrive home I would like to reconnect to that spark livy session that I used before, and continue working (with for example all the cached dataframes / RDDs) where i left off.

aggFTW commented 7 years ago

And you are using two different computers, one at work and one at home?

The thing about that is that you could re-gain access to the livy context, but not to the python kernel context...

hegand commented 7 years ago

I (we) use the same notebook at home. What I would like (actually what the data scientists of my group would like) to to do is to continue work where I left off with all the cashed data and defined variable (of course just at spark side). Is it a valid use case, what is your opinion about it?

msftristew commented 7 years ago

RE: ownership, Livy sessions only properly support one owner at a time, so it seems sensible that connecting to an existing Livy session would take management ownership of that session.

Livy doesn't properly support "locking" of resources or anything like that, so users would be responsible for making sure two notebooks don't try to work with the same session at the same time or things won't work. That seems like a sensible restriction though.

aggFTW commented 7 years ago

so users would be responsible for making sure two notebooks don't try to work with the same session at the same time or things won't work

Would your users try to work with data on the session at the same time? Livy sessions can only execute one statement at a time for now.

Would they know what the other user did previously (names of RDDs, tables, etc)? Could one user accidentally over-write some variable?

What @msftristew and I are trying to say, is that there's some/a lot of coordination that users would have to do so as to make that scenario work.

hegand commented 7 years ago

Thanks for the answers @aggFTW and @msftristew. The concerns that you pointed out are fully valid, so till livy can not lock one session to one user this option not gonna work safely.

edoardovivo commented 7 years ago

Hello, sorry to comment on a closed issue but I would like your opinion on the following scenario:

I have created a new spark session in a jupyter notebook, cached some dataframes and performed some kind of analysis. Now, I want to perform a different analysis, but essentially on the same dataset. Since it is a different analysis, it would go in a different notebook. But I don't want to create another spark session, because it takes up resources and I have already cached most of the data I need.

So what I would like to do is to connect to the existing livy session and keep on working. In this case, there would be no issue of locking, since I am the owner of both notebooks, right? Thank you in advance!

aggFTW commented 7 years ago

We feel like the ifs/buts on how to use this would lead to bad UX for other sparkmagic users. What you want to do makes sense, but I think the right place to add this support is not on sparkmagic, but on Livy.

You can see some of these discussions here: https://issues.cloudera.org/browse/LIVY-194?jql=text%20~%20%22shared%22

Tagar commented 5 years ago

On Livy side this feature is tracked at https://jira.apache.org/jira/browse/LIVY-395

wyhhyw123 commented 1 year ago

A easy way to share existed livy session among notebooks. Mockey Patch post_session method in sparkmagic.livyclientlib.livyreliablehttpclient.LivyReliableHttpClient, detail code as follows.

def post_session(self, properties):
    # set session id to preview session id if session exists, enable shared spark session among notebooks

    """
    this two variable could be set in another python file, 
    SHARE_NOTEBOOKS is entrypoint of enable sharing livy session
    CACHE_LIVY_SESSION_PATH used to save the preview livy session information
    """
    SHARE_NOTEBOOKS = True
    CACHE_LIVY_SESSION_PATH = "~/livy_resp.json"

    import os
    import stat
    import json

    if SHARE_NOTEBOOKS:
        resp = {}
        if os.path.exists(CACHE_LIVY_SESSION_PATH):
            with os.fdopen(os.open(CACHE_LIVY_SESSION_PATH, os.O_RDONLY,
                                   stat.S_IWUSR | stat.S_IRUSR), 'r') as file:
                resp = json.loads(file.read())
        session_id = resp.get(u"id")

        exist_session = False
        if session_id:
            try:
                existed_resp = self.get_session(session_id)
                if isinstance(existed_resp, dict) and existed_resp.get('state', '') in ["idle", "starting"]:
                    exist_session = True
            except Exception:
                print("Found no active or starting livy session, skip reuse livy session")

        if not exist_session or not resp or not resp.get(u"id"):
            resp = self._http_client.post("/sessions", [201], properties).json()

            with os.fdopen(os.open(CACHE_LIVY_SESSION_PATH, os.O_WRONLY | os.O_CREAT | os.O_TRUNC,
                                   stat.S_IWUSR | stat.S_IRUSR), 'w') as file:
                file.write(json.dumps(resp))
        return resp
    else:
        return self._http_client.post("/sessions", [201], properties).json()