Open postelrich opened 6 years ago
coerce_float=False
skips the conversion so maybe a non-issue.
decimal.Decimal
is not a first class type, so we generally don't do things like this. But I suppose it could be done if a SQL dtype is unambiguously a decimal type.
For the reason that @jreback mentions (decimals cannot be stored natively in a pandas dataframe, only as object dtype), I don't think we want to change this by default. But, it could be a good idea to at least enable the user the specify explicitly they want to keep the decimal objects. That would be a welcome contribution I think.
There is a very old, closed issue that raises some good points about dtypes:
https://github.com/pandas-dev/pandas/issues/6798
If you allow read_sql a parameter for specifying the dtypes of the resulting table then that could help people a lot.
Alternatively, add an example to the documentation where
read_sql has a chunk_size
the first chunk is read into a dataframe
the dtypes are specified
the rest of the data is added
It would very useful for large tables. Casting to categories and smaller bit dtypes saves so much memory which in turn decreases data processing run-time significantly.
coerce_float=False
skips the conversion so maybe a non-issue.
I just did a test, and whether I use True or False, NUMERIC columns in SQL tables get typed to float64.
I don't understand the logic here.
In pd.read_csv()
, It is possible to use converters=
to get data loaded accurately.
When exporting with to_sql()
, one can use the dtype=
parameter to ensure Decimal type gets converted to NUMERIC in database, once again keeping decimal accuracy.
But no way to ensure that the data exported previously can be imported back while keeping decimal accuracy ?
Honestly, this makes no sense to me.
Failing workaround : if I initialize the target dataframe of the import from db, using an emptied copy of an existing dataframe which is typed correctly, the import resets the types of the target dataframe. So it seems there is literally no option to keep accuracy when importing decimal data from a DB using pandas directly.
For the reason that @jreback mentions (decimals cannot be stored natively in a pandas dataframe, only as object dtype), I don't think we want to change this by default. But, it could be a good idea to at least enable the user the specify explicitly they want to keep the decimal objects. That would be a welcome contribution I think.
I'm currently learning Python, so I'm far from having a level good enough to do it (otherwise I would). But I really think this is as important as the counterpart parameters mentioned above which have been implemented already, and should be bumped up priority wise. Not only is it "a good idea", but it is needed for consistency in data accuracy.
Apart from that, thanks to all the developers contributing to this framework, which is a wonder even with these small quirks here and there ^^
👍 Please upvote this, I am having a very hard time with Pandas right now because of this issue.
@mroeschke possibly viable with pyarrow decimal?
Yeah this in theory should be better supported with pyarrow decimal. The sqlalchemy layer will need to be changed though because I think it immediately does the conversion of decimal to float
pd.read_sql
converts columns of typesqlalchemy.Numeric
to floats as the below selection appears to suggest. This seems counter intuitive since in theNUMERIC
andDECIMAL
types are interchangeable. Also theNUMERIC
type has an optionasdecimal
that defaults to true (http://docs.sqlalchemy.org/en/latest/core/type_basics.html#sqlalchemy.types.Numeric). I would expect thatsqlalchemy.Numeric
types get converted todecimal.Decimal
.https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1187-L1190