pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.57k stars 17.57k forks source link

Type conversions are skipped in 'to_dict' on single column dataframes #21256

Open hodossy opened 6 years ago

hodossy commented 6 years ago

Code to reproduce the error:

import pandas as pd
from datetime import datetime

dfs = {
    'full_df': pd.DataFrame([
        {'int': 1, 'date': datetime.now(), 'str': 'foo', 'float': 1.0, 'bool': True},
    ]),
    'int_df': pd.DataFrame([
        {'int': 1},
    ]),
    'date_df': pd.DataFrame([
        {'date': datetime.now()},
    ]),
    'str_df': pd.DataFrame([
        {'str': 'foo'},
    ]),
    'float_df': pd.DataFrame([
        {'float': 1.0},
    ]),
    'bool_df': pd.DataFrame([
        {'bool': True},
    ])
}

for name, frame in dfs.items():
    print('Types in ' + name)
    for k, v in frame.to_dict('records')[0].items():
        print(type(v))

Output:

Types in full_df
<class 'bool'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'float'>
<class 'int'>
<class 'str'>
Types in int_df
<class 'numpy.int64'>
Types in date_df
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Types in str_df
<class 'str'>
Types in float_df
<class 'numpy.float64'>
Types in bool_df
<class 'numpy.bool_'>

Problem description

One would expect that the to_dict() function returns python native types, or at least does the same to the same type of columns, however it behaves differently as shown above. It seems that type conversion is not invoked when a single column is present in the dataframe.

Expected Output

Python native types where it is possible for int, float, bool and str types, and if possible, a python datetime object instead of pandas.Timestamp

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.0 pytest: None pip: 9.0.1 setuptools: 38.2.4 Cython: None numpy: 1.14.3 scipy: None pyarrow: None xarray: None IPython: 6.3.1 sphinx: 1.7.4 patsy: None dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: 1.2.2 pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
hodossy commented 5 years ago

I have a temporary solution until is is fixed:

class NativeDict(dict):
    """
        Helper class to ensure that only native types are in the dicts produced by
        :func:`to_dict() <pandas.DataFrame.to_dict>`

        .. note::

            Needed until `#21256 <https://github.com/pandas-dev/pandas/issues/21256>`_ is resolved.
    """
    def __init__(self, *args, **kwargs):
        super().__init__(((k, self.convert_if_needed(v)) for row in args for k, v in row), **kwargs)

    @staticmethod
    def convert_if_needed(value):
        """
            Converts `value` to native python type.

            .. warning::

                Only :class:`Timestamp <pandas.Timestamp>` and numpy :class:`dtypes <numpy.dtype>` are converted.
        """
        if pd.isnull(value):
            return None
        if isinstance(value, pd.Timestamp):
            return value.to_pydatetime()
        if hasattr(value, 'dtype'):
            mapper = {'i': int, 'u': int, 'f': float}
            _type = mapper.get(value.dtype.kind, lambda x: x)
            return _type(value)
        return value

This also replaces NaN and NaT objects with native python None. Please note that it only intended use is to convert into, I have not tested elsewhere. It can be used like so:

df.to_dict(orient='records', into=NativeDict)
arw2019 commented 3 years ago

This is fixed on 1.2 master. Running the OP:


In [3]: import pandas as pd 
   ...: from datetime import datetime 
   ...:  
   ...: dfs = { 
   ...:     'full_df': pd.DataFrame([ 
   ...:         {'int': 1, 'date': datetime.now(), 'str': 'foo', 'float': 1.0, 'bool': True}, 
   ...:     ]), 
   ...:     'int_df': pd.DataFrame([ 
   ...:         {'int': 1}, 
   ...:     ]), 
   ...:     'date_df': pd.DataFrame([ 
   ...:         {'date': datetime.now()}, 
   ...:     ]), 
   ...:     'str_df': pd.DataFrame([ 
   ...:         {'str': 'foo'}, 
   ...:     ]), 
   ...:     'float_df': pd.DataFrame([ 
   ...:         {'float': 1.0}, 
   ...:     ]), 
   ...:     'bool_df': pd.DataFrame([ 
   ...:         {'bool': True}, 
   ...:     ]) 
   ...: } 
   ...:  
   ...: for name, frame in dfs.items(): 
   ...:     print('Types in ' + name) 
   ...:     for k, v in frame.to_dict('records')[0].items(): 
   ...:         print(type(v)) 
   ...:                                                                                                 
Types in full_df
<class 'int'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'str'>
<class 'float'>
<class 'bool'>
Types in int_df
<class 'int'>
Types in date_df
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Types in str_df
<class 'str'>
Types in float_df
<class 'float'>
Types in bool_df
<class 'bool'>
hodossy commented 3 years ago

Hello! Thanks for fixing the integers, but it seems that date types are still using the internal type. Would it be possible to convert them to native type as well?

arw2019 commented 3 years ago

Do we want to reopen this?

xref https://github.com/pandas-dev/pandas/pull/37648#discussion_r571652150 I think we're not gonna act here but it does keep coming up