Differentiate between `NaN` and `null` in the viewer

huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub

https://huggingface.co/docs/datasets-server

Apache License 2.0

656 stars 67 forks source link

Differentiate between `NaN` and `null` in the viewer #2828

Open polinaeterna opened 2 months ago

polinaeterna commented 2 months ago

Currently, we don't do this and display and return in response null in both cases. From the discussion in https://github.com/huggingface/dataset-viewer/pull/2797, this is agreed that it's important to let users know how to correctly treat data with these values. This would require:

[ ] Change in how we transform parquet to /first-rows and /rows response. I haven't figured out where exactly, but apparently nan values are somehow replaced with null.
[ ] Change in response structure and field names in /statistics - for float columns add field nan_count, for other columns rename nan_count to null_count :/// (my bad with the original naming)

polinaeterna commented 1 month ago

So apparently it's just that orjson serializes float("nan") as null so it doesn't differentiate between NaN and null:

orjson.dumps([float("nan"), None])
>>> b'[null,null]'

and there is no option to force it to do the opposite. To compare,json.dumps() does serialize NaNs as a dedicated value but orjson is strictly JSON conformant in this.

I don't see an easy solution here, do you have any ideas @huggingface/dataset-viewer ?

severo commented 1 month ago

it's not possible to override this behavior here?

https://github.com/huggingface/dataset-viewer/blob/b2c7c3665f7c428510991e877355db00f230071f/libs/libcommon/src/libcommon/utils.py#L24-L32

albertvillanova commented 1 month ago

I am afraid the approach above will not work...

Note that float("nan") is an instance of float, which is a supported type by orjson. Supported types are not passed through the default function...

polinaeterna commented 1 month ago

yes, i didn't manage to make it work. i think it's not possible and this is intentional, this is from orjson's readme:

has strict JSON conformance in not supporting Nan/Infinity/-Infinity

severo commented 1 month ago

should we use ujson instead of orjson as in datasets?

severo commented 1 month ago

Also, in pyarrow doc: https://arrow.apache.org/docs/python/data.html#none-values-and-nan-handling

None values and NAN handling

As mentioned in the above section, the Python object None is always converted to an Arrow null element on the conversion to pyarrow.Array. For the float NaN value which is either represented by the Python object float('nan') or numpy.nan we normally convert it to a valid float value during the conversion. If an integer input is supplied to pyarrow.array that contains np.nan, ValueError is raised.

To handle better compatibility with Pandas, we support interpreting NaN values as null elements. This is enabled automatically on all from_pandas function and can be enabled on the other conversion functions by passing from_pandas=True as a function parameter.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.