jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters
Other
1.32k stars 445 forks source link

Send data from local to Spark when using IPython kernel #673

Open nickto opened 3 years ago

nickto commented 3 years ago

Is your feature request related to a problem? Please describe. It is not possible to send data from local to Spark, when using IPython (as opposed to PySpark) kernel.

Describe the solution you'd like Similar functionality as %%send_to_spark in PySpark kernel but for IPython kernel.

clayms commented 3 years ago

Same question.

How do I send local python variables to the Livy PySpark session when running an IPython kernel?

clayms commented 3 years ago

The IPython magics are defined in this file https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/magics/remotesparkmagics.py

It does not look like a send_to_spark has been implemented.

How to add the following to remotesparkmagics.py ?

https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/kernels/kernelmagics.py#L177

which refers to:
https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/magics/sparkmagicsbase.py#L51

My first guess is that it would be added as a subcommand to %%spark in remotesparkmagics.py with an additional @magic_arguments, perhaps @argument("-v", "--variable", type=str, default=None, help="local variable to send to remote pyspark session.") ?

then add the subcommand in another elif block like elif subcommand == "send_to_spark": similar to the following existing subcommand:

https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/magics/remotesparkmagics.py#L160

clayms commented 3 years ago

Until one of us can complete a pull request with the fix, you can use the work-around below.

Not as nice as a magic, but it works.

import json, requests

host = 'http://000.000.000.000:8998'
headers = {'Content-Type': 'application/json'}
sessions_url = f"{host}/sessions"
r1 = requests.get(sessions_url, headers=headers)

session_id = r1.json().get('sessions')[0].get('id')
statements_url = f"{sessions_url }/{session_id}/statements"

my_var = "test string to send"

var_name = "my_var "
var_val = repr(my_var )
pyspark_code = u'{} = {}'.format(var_name, var_val)

r2 = requests.post(statements_url, data=json.dumps({'code': pyspark_code}), headers=headers)
r2.json()

Then check from a %%spark cell.

%%spark
my_var 

output:

'test string to send'

Whereas before sending via post request, the output would have been

An error was encountered:
name 'my_var' is not defined
Traceback (most recent call last):
NameError: name 'my_var' is not defined
lanisyutin commented 2 years ago

Any solutions for sending pandas dataframe?

jonathansymonsdv commented 1 year ago

Any solutions for sending pandas dataframe?

Clunky and not sure if it'd work, but if you can send a string, you should be able to convert the dataframe to json and then send the json output as a string. You can then reverse process on the other side. Again, clunky and you wouldn't want to do it for anything of substance.