jupyterlab / jupyterlab-hdf5

Open and explore HDF5 files in JupyterLab. Can handle very large (TB) sized files, and datasets of any dimensionality
BSD 3-Clause "New" or "Revised" License
115 stars 27 forks source link

Produce valid JSON in the presence of NaN or INF values #22

Closed JonjonHays closed 3 years ago

JonjonHays commented 4 years ago

Currently the backend server extension is using Python's native JSON package to serialize the payload. Python's JSON library will produce invalid JSON in the presence of NaN or Inf values. The JSON library will add values equivalent to Javascript's NaN and Inf values, but unfortunately these values are not considered valid JSON, and we fail on the frontend when parsing the JSON payload. While this could be handled on the frontend, I think it makes more sense to handle this server-side. Unfortunately, I wasn't able to find a solution for this without bringing in a third party dependency. The simplejson library provides an ignore_nan flag as input to its dump function that encodes NaN values as None/null. While Javascript data rendering and plotting libraries typically prefer null values, Python libraries (e.g, Matplotlib) tend to prefer NaN values, and may produce errors in the presence of null values. Is it likely this extension will ever be called from a non-Javascript context? If not, it might make sense to simply encode NaN values as null by default. However, if it is likely that this extension will be called from a non-Javascript context, we could add an ignoreNan flag to the query parameters. In my PR for n-dim support, I added an implementation of the former. Here's a discussion on this topic:

https://stackoverflow.com/questions/15228651/how-to-parse-json-string-containing-nan-in-node-js

telamonian commented 4 years ago

Erg. I'd classify this as a major issue. I'm gonna do a quick whip-around of low-level javascript experts and see what they think

kgryte commented 4 years ago

The standard approach for dealing with custom values in serialized JSON-like blobs in JavaScript is to use revivers. For example, supposing that Python inserts NaN and Infinity into a JSON blob, thus generating something like

var str = '{"a":NaN,"b":Infinity,"c":[NaN,Infinity,-Infinity]}';

then you'd create the following reviver function

function reviver( key, value ) {
    if ( value === '***NaN***' ) {
        return NaN;
    }
    if ( value === '***Infinity***' ) {
        return Infinity;
    }
    if ( value === '***-Infinity***' ) {
        return -Infinity;
    }
    return value;
}

which can then be applied as follows:

var tmp = str.replace( /(NaN|-?Infinity)/g, '***$1***' );
var obj = JSON.parse( tmp, reviver );
// returns { a: NaN, b: Infinity, c: [ NaN, Infinity, -Infinity ] }

thus, behaving as expected.

Two caveats:

  1. You should expect a perf hint on the frontend in JupyterLab, as applying a custom reviver can be relatively expensive for larger JSON blobs.
  2. If your serialized JSON blob can contain the character sequences NaN and Infinity in places other than values, such as in a string of text, you'll need to improve the regular expression provided to replace.

The other alternative is to use a custom JavaScript JSON parser which can handle NaN and Infinity, which is straightforward based on Crockford's work.

Hope this helps.

kgryte commented 4 years ago

One additional addendum:

One can avoid the string replace if the serialization code properly encodes special values, such as NaN and Infinity. For example, in Python, one can provide a parse_constant callback.

Doing this work upon serialization, while not obviating the need for a custom reviver, should improve performance upon parsing the blob in JavaScript and circumvents tricky edge cases such as character sequences appearing in places other than as JSON values (e.g., such as in the middle of a string).

telamonian commented 4 years ago

Thanks @kgryte for the detailed info.

I feel like I missed a beat somewhere. If we ignore the possibility of NaN or Infinity occurring within normal strings (which I think will be safe for this particular application), why can't the reviver just be:

function reviver( key, value ) {
    if ( value === 'NaN' ) {
        return NaN;
    }
    if ( value === 'Infinity' ) {
        return Infinity;
    }
    if ( value === '-Infinity' ) {
        return -Infinity;
    }
    return value;
}

?

Or is this exactly what you were referring to in your addendum?

kgryte commented 4 years ago

Because value arguments are deserialized values. If you attempt to parse a JSON-like string with bare NaN and Infinity values, you should get a parse error if a parser conforms to the JSON-specification. E.g.,

JSON.parse('{"a":NaN,"b":Infinity}')
// throws SyntaxError: Unexpected token N in JSON at position 5

If the serialization logic happens to encode NaN and Infinity as string values, e.g.,

var str = '{"a":"NaN","b":"Infinity"}';

then, yes, you can simplify the reviver implementation to what you describe.

kgryte commented 4 years ago

I chose to replace with '***$1***', but this could be anything to serve as a sentinel, including just '$1'.

telamonian commented 3 years ago

This issue should be handled by the work in #41, so I'm closing this