TypeError when using nullable pandas types in DataFrame

DorianCzichotzki commented 1 year ago

Itables fails to display DataFrames that use nullable pandas types like Int64.

To Reproduce

Running the following code results in the TypeError.

import pandas as pd
import itables

df = pd.DataFrame([1,2,pd.NA], dtype="Int64")
itables.show(df)

TypeError                                 Traceback (most recent call last)
Input In [5], in <cell line: 5>()
      2 import itables
      4 df = pd.DataFrame([1,2,pd.NA], dtype="Int64")
----> 5 itables.show(df)

File ~/Workspace/smfpy/smfpy-client/.dev_venv/lib/python3.9/site-packages/itables/javascript.py:343, in show(df, **kwargs)
    341 def show(df=None, **kwargs):
    342     """Show a dataframe"""
--> 343     html = to_html_datatable(df, connected=_CONNECTED, **kwargs)
    344     display(HTML(html))

File ~/Workspace/smfpy/smfpy-client/.dev_venv/lib/python3.9/site-packages/itables/javascript.py:335, in to_html_datatable(df, tableId, connected, **kwargs)
    333 # Export the table data to JSON and include this in the HTML
    334 data = _formatted_values(df.reset_index() if showIndex else df)
--> 335 dt_data = json.dumps(data)
    336 output = replace_value(output, "const data = [];", f"const data = {dt_data};")
    338 return output

File ~/.pyenv/versions/3.9.9/lib/python3.9/json/__init__.py:231, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    226 # cached encoder
    227 if (not skipkeys and ensure_ascii and
    228     check_circular and allow_nan and
    229     cls is None and indent is None and separators is None and
    230     default is None and not sort_keys and not kw):
--> 231     return _default_encoder.encode(obj)
    232 if cls is None:
    233     cls = JSONEncoder

File ~/.pyenv/versions/3.9.9/lib/python3.9/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File ~/.pyenv/versions/3.9.9/lib/python3.9/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

File ~/.pyenv/versions/3.9.9/lib/python3.9/json/encoder.py:179, in JSONEncoder.default(self, o)
    160 def default(self, o):
    161     """Implement this method in a subclass such that it returns
    162     a serializable object for ``o``, or calls the base implementation
    163     (to raise a ``TypeError``).
   (...)
    177 
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type NAType is not JSON serializable

Maybe the issue is in _formatted_values. Nullable strings are handled there by callingastype('str')on them. I can try to open a PR but am not sure what the best way to represent pd.NA values is (None, "<NA>" like for strings or something else?).

mwouts commented 1 year ago

Hi @DorianCzichotzki , thank you for opening this issue, that's an interesting question! And yes it would be great to add support for these nullable types!

I do not know at the moment what is the best way to represent these NA values, but a few trial and errors are likely to tell us. I would first try a float NaN, and then None if the NaN does not work. We need to pass through the JSON encoding (both None and NaN should pass), and then we'll need to check that the JSON value makes sense for https://datatables.net (i.e. test that sorting works as expected).

It would be great if you could propose a PR that would add support for all the nullable types. Maybe the best way to make sure these new types work and keep working in the future will be to add examples with nullable types just after the int and float table examples? (these examples are then tested and shown in the documentation https://github.com/mwouts/itables/blob/e04df1eca13ff32672861ffb44773c045428494b/itables/sample_dfs.py#L46-L56

mwouts commented 1 year ago

Hi @DorianCzichotzki , I have added support and tests for nullable bool and int (I've just had to replace pd.NA with None before exporting the data to JSON) in itables==1.3.0.

Missing values are represented by an empty cell, see for instance the sample dataframes in the documentation.

Thanks for reporting the issue!

mwouts / itables

TypeError when using nullable pandas types in DataFrame #98