influxdata / influxdb-client-python

InfluxDB 2.0 python client
https://influxdb-client.readthedocs.io/en/stable/
MIT License
724 stars 187 forks source link

WriteApi.write does not support pandas' nullable integer #590

Closed yannsartori closed 7 months ago

yannsartori commented 1 year ago

Specifications

If you have a dataframe with Pandas' nullable integer as one of the column datatypes, and a row includes a pd.NA value, you get the following traceback:

Traceback (most recent call last):
    write_api.write(
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write_api.py", line 366, in write
    return self._write_batching(bucket, org, record,
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write_api.py", line 469, in _write_batching
    serializer.serialize(chunk_idx),
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 270, in serialize
    return list(lp)
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 268, in <genexpr>
    lp = (re.sub('^(( |[^ ])* ),([a-zA-Z0-9])(.*)', '\\1\\3\\4', self.f(p))
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 269, in <lambda>
    for p in filter(lambda x: _any_not_nan(x, self.field_indexes), _itertuples(chunk)))
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 27, in _any_not_nan
    return any(map(lambda x: _not_nan(p[x]), indexes))
  File "pandas/_libs/missing.pyx", line 388, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

However, if your change your column datatype to a float (which has a native NaN encoding), it works

Code sample to reproduce problem

import pandas as pd

df = pd.DataFrame({"x": [1, pd.NA], "time": [0, 1]}).astype({"x": "Int64"})
with get_client() as client:
    with client.write_api() as write_api:
        write_api.write(BUCKET, record=df, data_frame_measurement_name="test", data_frame_timestamp_column="time")

Expected behavior

I would anticipate that this behaves the same as if it were a float. My current work around is to use floats.

If the code is too complicated to fix/would incur significant slowdown for other users, I think at minimum, raising a cleaner exception would be reasonable.

Actual behavior

I get an exception:

Traceback (most recent call last):
    write_api.write(
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write_api.py", line 366, in write
    return self._write_batching(bucket, org, record,
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write_api.py", line 469, in _write_batching
    serializer.serialize(chunk_idx),
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 270, in serialize
    return list(lp)
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 268, in <genexpr>
    lp = (re.sub('^(( |[^ ])* ),([a-zA-Z0-9])(.*)', '\\1\\3\\4', self.f(p))
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 269, in <lambda>
    for p in filter(lambda x: _any_not_nan(x, self.field_indexes), _itertuples(chunk)))
  File "venv/lib/python3.9/site-packages/influxdb_client/client/write/dataframe_serializer.py", line 27, in _any_not_nan
    return any(map(lambda x: _not_nan(p[x]), indexes))
  File "pandas/_libs/missing.pyx", line 388, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

Additional info

My knee-jerk reaction is I saw is in client/write/dataframe_serializer.py, there is a function:

def _not_nan(x):
    return x == x

which I think can just be

def _not_nan(x):
    from ...extras import pd
    return pd.isna(x)    

However, I saw this block of code:

                if null_columns[index]:
                    key_value = f"""{{
                            '' if {val_format} == '' or type({val_format}) == float and math.isnan({val_format}) else
                            f',{key_format}={{str({val_format}).translate(_ESCAPE_STRING)}}'
                        }}"""

which looks pretty crazy, and I am not sure how the data would look at that point?

ianog-eng commented 8 months ago

I have exactly the same issue