pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

read_sql should convert Numeric to decimals #20525

Open postelrich opened 6 years ago

postelrich commented 6 years ago

pd.read_sql converts columns of type sqlalchemy.Numeric to floats as the below selection appears to suggest. This seems counter intuitive since in the NUMERIC and DECIMAL types are interchangeable. Also the NUMERIC type has an option asdecimal that defaults to true (http://docs.sqlalchemy.org/en/latest/core/type_basics.html#sqlalchemy.types.Numeric). I would expect that sqlalchemy.Numeric types get converted to decimal.Decimal.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1187-L1190

postelrich commented 6 years ago

coerce_float=False skips the conversion so maybe a non-issue.

jreback commented 6 years ago

decimal.Decimal is not a first class type, so we generally don't do things like this. But I suppose it could be done if a SQL dtype is unambiguously a decimal type.

jorisvandenbossche commented 6 years ago

For the reason that @jreback mentions (decimals cannot be stored natively in a pandas dataframe, only as object dtype), I don't think we want to change this by default. But, it could be a good idea to at least enable the user the specify explicitly they want to keep the decimal objects. That would be a welcome contribution I think.

grofte commented 5 years ago

There is a very old, closed issue that raises some good points about dtypes:
https://github.com/pandas-dev/pandas/issues/6798

If you allow read_sql a parameter for specifying the dtypes of the resulting table then that could help people a lot.

Alternatively, add an example to the documentation where
read_sql has a chunk_size
the first chunk is read into a dataframe
the dtypes are specified
the rest of the data is added

It would very useful for large tables. Casting to categories and smaller bit dtypes saves so much memory which in turn decreases data processing run-time significantly.

TheFou commented 3 years ago

coerce_float=False skips the conversion so maybe a non-issue.

I just did a test, and whether I use True or False, NUMERIC columns in SQL tables get typed to float64.


I don't understand the logic here. In pd.read_csv(), It is possible to use converters= to get data loaded accurately. When exporting with to_sql(), one can use the dtype= parameter to ensure Decimal type gets converted to NUMERIC in database, once again keeping decimal accuracy. But no way to ensure that the data exported previously can be imported back while keeping decimal accuracy ? Honestly, this makes no sense to me.

Failing workaround : if I initialize the target dataframe of the import from db, using an emptied copy of an existing dataframe which is typed correctly, the import resets the types of the target dataframe. So it seems there is literally no option to keep accuracy when importing decimal data from a DB using pandas directly.

For the reason that @jreback mentions (decimals cannot be stored natively in a pandas dataframe, only as object dtype), I don't think we want to change this by default. But, it could be a good idea to at least enable the user the specify explicitly they want to keep the decimal objects. That would be a welcome contribution I think.

I'm currently learning Python, so I'm far from having a level good enough to do it (otherwise I would). But I really think this is as important as the counterpart parameters mentioned above which have been implemented already, and should be bumped up priority wise. Not only is it "a good idea", but it is needed for consistency in data accuracy.


Apart from that, thanks to all the developers contributing to this framework, which is a wonder even with these small quirks here and there ^^

felipead commented 2 years ago

👍 Please upvote this, I am having a very hard time with Pandas right now because of this issue.

jbrockmendel commented 1 year ago

@mroeschke possibly viable with pyarrow decimal?

mroeschke commented 1 year ago

Yeah this in theory should be better supported with pyarrow decimal. The sqlalchemy layer will need to be changed though because I think it immediately does the conversion of decimal to float