Open polinaeterna opened 2 months ago
So apparently it's just that orjson
serializes float("nan")
as null
so it doesn't differentiate between NaN
and null
:
orjson.dumps([float("nan"), None])
>>> b'[null,null]'
and there is no option to force it to do the opposite. To compare,json.dumps()
does serialize NaN
s as a dedicated value but orjson
is strictly JSON conformant in this.
I don't see an easy solution here, do you have any ideas @huggingface/dataset-viewer ?
it's not possible to override this behavior here?
I am afraid the approach above will not work...
Note that float("nan")
is an instance of float
, which is a supported type by orjson
. Supported types are not passed through the default
function...
yes, i didn't manage to make it work. i think it's not possible and this is intentional, this is from orjson
's readme:
has strict JSON conformance in not supporting Nan/Infinity/-Infinity
should we use ujson instead of orjson as in datasets?
Also, in pyarrow doc: https://arrow.apache.org/docs/python/data.html#none-values-and-nan-handling
None values and NAN handling
As mentioned in the above section, the Python object None is always converted to an Arrow null element on the conversion to pyarrow.Array. For the float NaN value which is either represented by the Python object float('nan') or numpy.nan we normally convert it to a valid float value during the conversion. If an integer input is supplied to pyarrow.array that contains np.nan, ValueError is raised.
To handle better compatibility with Pandas, we support interpreting NaN values as null elements. This is enabled automatically on all from_pandas function and can be enabled on the other conversion functions by passing from_pandas=True as a function parameter.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Currently, we don't do this and display and return in response
null
in both cases. From the discussion in https://github.com/huggingface/dataset-viewer/pull/2797, this is agreed that it's important to let users know how to correctly treat data with these values. This would require:nan
values are somehow replaced withnull
.nan_count
, for other columns renamenan_count
tonull_count
:/// (my bad with the original naming)