n-riesco / ijavascript

IJavascript is a javascript kernel for the Jupyter notebook
Other
2.19k stars 185 forks source link

Asynchronous Writes #58

Closed rgbkrk closed 8 years ago

rgbkrk commented 8 years ago

When a user is working with the notebook, if a user has background code that writes to stdout or stderr, the output should stay with the cell that invoked it.

The simplest example I could come up with is

var interval = setInterval(()=> {console.log('from cell 1')}, 500)

in the first cell, followed by

var interval2 = setInterval(()=> {console.log('from cell 2')}, 500)

in the second cell. Here's what the output looks like right now:

screenshot 2016-01-23 11 27 31

Another strangeness I noticed - if I start doing tab completion somewhere else in the code it will stop the setIntervals from occurring. This was surprising up until I realized that completions have to remain in scope (code in the background could have side effects).

rgbkrk commented 8 years ago

I can tab complete while the node console is running, so maybe it would be worth it for us to check that out.

> interval.from cell 2
from cell 1
from cell 2
from cell 1

interval.__defineGetter__      interval.__defineSetter__      interval.__lookupGetter__      interval.__lookupSetter__      interval.__proto__             interval.constructor           interval.hasOwnProperty        interval.isPrototypeOf
interval.propertyIsEnumerable  interval.toLocaleString        interval.toString              interval.valueOf

interval.close                 interval.ref                   interval.unref

interval._called               interval._idleNext             interval._idlePrev             interval._idleStart            interval._idleTimeout          interval._onTimeout            interval._repeat               interval.domain

> interval.from cell 2
from cell 1

interval.__defineGetter__      interval.__defineSetter__      interval.__lookupGetter__      interval.__lookupSetter__      interval.__proto__             interval.constructor           interval.hasOwnProperty        interval.isPrototypeOf
interval.propertyIsEnumerable  interval.toLocaleString        interval.toString              interval.valueOf

interval.close                 interval.ref                   interval.unref

interval._called               interval._idleNext             interval._idlePrev             interval._idleStart            interval._idleTimeout          interval._onTimeout            interval._repeat               interval.domain

> interval.from cell 2
from cell 1
from cell 2
from cell 1
from cell 2
n-riesco commented 8 years ago

Note to myself: Last time we discussed this problem, Kyle mentioned we could use msg_id.

rgbkrk commented 8 years ago

Oh ha, I didn't realize I already said that.

Yeah, basically (and without my knowledge of the code base here), I'd assume for an async kernel like this, every run has its own context for what process.stdout and process.stderr is (mapping to particular parent_header.msg_id).

rgbkrk commented 8 years ago

I'll certainly dive back into this as I dive deep on getting this integrated as a first class kernel in nteract.

n-riesco commented 8 years ago

On 25/01/16 02:11, Kyle Kelley wrote:

Oh ha, I didn't realize I already said that.

Did I misunderstand? Doesn't the frontend use parent_header.msg_id to determine what cell a stream.stdout message belongs to?

Yeah, basically (and without my knowledge of the code base here), I'd assume for an async kernel like this, every run has its own context for what |process.stdout| and |process.stderr| is (mapping to particular |parent_header.msg_id|.

I already realised that. My first, naive thoughts were to run each cell within a context that redefines process.stdout. This wouldn't work for any functions defined outside the cell context (e.g. console.log).

Let me write down my current thoughts here:

All the above points make me think that the solution is not the creation of multiple cell contexts. Instead, I think IJavascript needs to ensure that process.stdout and process.stderr have been flushed onto stream.stdout and stream.stderr before executing the code in the next cell.

rgbkrk commented 8 years ago

Whoa, just noticed I missed out on responding here

Did I misunderstand? Doesn't the frontend use parent_header.msg_id to determine what cell a stream.stdout message belongs to?

Yeah, the frontend relies on parent_header.msg_id to determine what cell the display data, execute result, streams, and errors will end up in.

minrk commented 8 years ago

Some relevant details of the spec:

  1. output is associated with a given cell by parent_header.msg_id
  2. the status: idle message is used to indicate that the output from a given cell is 'complete', and handlers can be cleaned up.
  3. the handling of output associated with a given parent produced after status: idle is technically undefined
  4. ...however, given that async output can and does get produced after status: idle, the current behavior is to associate it with the most recent cell
  5. There is no restriction about multiple outstanding requests. It is AOK for two cells to be running concurrently, and if they keep their parent's in order, everything should be fine.

So general recommendations for kernels:

  1. do your best to ensure that output is complete before sending the idle message, but it's not always feasible and that's okay
  2. long-running async output (e.g. setInterval(console.log...) should be considered detached from its cell if the cell's idle message isn't being held
n-riesco commented 8 years ago

@minrk and @rgbkrk Thanks for all the info.

This info makes me think that if we make msg_id available to the IJavascript session (e.g. as $$msg_id$$), or alternatively, we make available a stream that pipes its data using the right msg_id (e.g. as $$stdout$$), we could have multiple long-running async functions associated with multiple cells. For example:

In[1]: (function() {
    var currentStdout = $$stdout$$;
    setInterval(
        function() { currentStdout.write("1 second\n"); },
        1000
    );
})();
Out[1]: 1 second
1 second
[...]

In[2]: (function() {
    var currentStdout = $$stdout$$;
    setInterval(
        function() { currentStdout.write("1 minute\n"); },
        60000
    );
})();
Out[2]: 1 minute
n-riesco commented 8 years ago

This is issue is fixed in NEL@0.4.0 (which I will be publishing soon after I clean up and secure a few bits and bobs).

See the async-stdio branch, here and here.

In the end, I've wrapped console, process.stdout and process.stderr to ensure they direct their output to the right cell. The following code should now work as expected:

In[1]:
(function(console) {
    setInterval(
        function() { console.log("1"); },
        1000
    );
})(console);
$$async$$ = true;

Out[1]:
1
1
1
[...]

In[2]:
(function(console) {
    setInterval(
        function() { console.log("2"); },
        1000
    );
})(console);
$$async$$ = true;

Out[2]:
2
2
2
[...]
rgbkrk commented 8 years ago

That's awesome!

n-riesco commented 8 years ago

I've made a pre-release that fixes this issue. I will advertise the changes when I make the release.

neil-s commented 8 years ago

Async writes still don't seem to be working for me. Is there something special I need to do, since the first half of http://n-riesco.github.io/ijavascript/doc/async.ipynb.html states that it should just work.

EDIT: Removed names from code and made a gist: https://gist.github.com/neil-s/abf76e50f7e8588ab7f7 I have a readStream from a csv file, being read by the csvtojson package. The bulk of the work is done in an event handler provided by csvtojson that is called with each record parsed. In there, I'm doing some console.logs and console.errors, but that output never shows up. I know that the code is running in the background because I keep running a different cell which just outputs the variable representing the number of records parsed so far, and that keeps incrementing correctly.

Using v 5.0.11-beta.0

n-riesco commented 8 years ago

@neil-s

I've opened issue #64 to update the documentation.

In the meantime, I'll try to describe the usage here.


Synchronous vs asynchronous execution

By asynchronous code, I mean code that doesn't run immediately (i.e. code in callback functions). For example, this is asynchronous code:

function sayHello() {
    console.log("Hello, World!");
}
setTimeout(sayHello, 1000);

whereas this is not:

function sayHello() {
    console.log("Hello, World!");
}
sayHello();

The output of asynchronous code

When code run asynchronously and uses global.console, process.stdout or process.stderr, IJavascript doesn't know which cell to send the output to.

A solution to this issue is that the asynchronous makes a copy of the global.console, process.stdout or process.stderr associated with a cell. There are several ways to achieve this:

Make a copy into a local variable

I don't recommend this way, because is prone to error (each cell launching asynchronous code would require its own copy of global.console, process.stdout or process.stderr):

$console = console;
function sayHello() {
    $console.log("Hello, World!");
}
setTimeout(sayHello, 1000);

Use a function environment

This is a pattern often used in Javascript code:

var sayHello = (function(console) {
    return function sayHello() {
        console.log("Hello, World!");
    };
})(console);
setTimeout(sayHello, 1000);

The trick in this pattern is that (function(console) { ... })(console); creates a copy of console that is only accessible within the function body.

A nice feature of this pattern is that the asynchronous code runs unmodified (unlike the pattern in the previous section, where the local variable had to be renamed $console, the name of the local variable in this pattern can be console).

Use a block-scoped let (requires ES6)

If the Node.js version in your system supports ES6, you can use let to define a block-scoped local variable named console:

{
    let console = global.console;
    function sayHello() {
        console.log("Hello, World!");
    }
    setTimeout(sayHello, 1000);
}

Coming back to your code in https://gist.githubusercontent.com/neil-s/abf76e50f7e8588ab7f7/raw/37d5990944b505ee9dcbf8ddcabd051d60871e8a/issue.ipynb , I suggest the following change:

var insertedRecords = 0;

var insertRecords = (function(console) {
    return function(db) {
        fs.createReadStream("./some_file.csv").pipe(converter);

        converter.on("record_parsed", function(jsonRecord) {
            try {
                mongoRecord = db.collection('some_collection')
                    .update({
                            "some_key": parsedRecord["some_key"]
                        },
                        parsedRecord, {
                            upsert: true
                        }
                    );
            } catch (e) {
                console.error("Error with record: " + jsonRecord.some_key);
                console.error(e);
            }

            insertedRecords += 1;
            if (insertedRecords % 100 === 0) {
                console.log("Inserted " + insertedRecords);
            }

        });

        converter.on("end_parsed", function(jsonObject) {
            console.log("Finished parsing");
        });
    };
})(console);

try {
    insertRecords(db);
} catch (e) {
    console.error(e);
}