pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

BUG: ExcelWriter set_column with num_format datetime doesn't work (but it works for xlsxwriter) #55196

Open lrisch opened 1 year ago

lrisch commented 1 year ago

Pandas version checks

Reproducible Example

from datetime import datetime
import pandas as pd

writer = pd.ExcelWriter('datetime_bug.xlsx', engine='xlsxwriter')
workbook = writer.book

# datetime from pandas is not converted with the specified num_format
data = pd.DataFrame([[0.05, datetime(year=2020, month=1, day=1)]], columns=["percentage", "datetime_date"])

data.to_excel(writer, sheet_name="test_sheet", index=False)

percentage_format = workbook.add_format({'num_format':'0.00%'})
date_format = workbook.add_format({'num_format':'dd.mm.yyyy'})

writer.sheets["test_sheet"].set_column('A:A', width=None, cell_format=percentage_format)
writer.sheets["test_sheet"].set_column('B:B', width=None, cell_format=date_format)

workbook.close()

Issue Description

I created an Excel sheet using pd.to_excel(). The data contains one column with a float number and one column with a datetime.datetime object.

Then, I want to set the column type using the num_format parameter in set_column.

This works for the float value, which I want in the percentage format. However, it does not work for the datetime value. The value is in the wrong format "01.01.2020 00:00:00" (German date). The value in the Excel file is in the same wrong format, whether I run set_column() for the date column or not.

I have tried to circle in the problem by trying other variations, which turns out that they do work. However, those would be only workarounds for our use case.

  1. Using the parameter datetime_format='dd.mm.yyyy' in the pd.ExcelWriter constructor puts the date into the right format.

  2. Instead of writing the sheet from a DataFrame, using the xlsxwriter functions works fine:

    
    import xlsxwriter
    from datetime import datetime

workbook = xlsxwriter.Workbook('datetime_bug.xlsx') worksheet = workbook.add_worksheet()

using the xlsxwriter functions works as intended

worksheet.write('A1', 0.05) worksheet.write('B1', datetime(year=2020, month=1, day=1))

percentage_format = workbook.add_format({'num_format':'0.00%'}) date_format = workbook.add_format({'num_format':'dd.mm.yyyy'})

worksheet.set_column('A:A', width=None, cell_format=percentage_format) worksheet.set_column('B:B', width=None, cell_format=date_format)

workbook.close()



### Expected Behavior

The value is in the format "01.01.2020" when using `set_column` with `pd.ExcelWriter`.

### Installed Versions

<details>
INSTALLED VERSIONS
------------------
commit              : ba1cccd19da778f0c3a7d6a885685da16a072870
python              : 3.9.11.final.0
python-bits         : 64
OS                  : Linux
OS-release          : 4.18.0-477.21.1.el8_8.x86_64
Version             : #1 SMP Thu Jul 20 08:38:27 EDT 2023
machine             : x86_64
processor           : 
byteorder           : little
LC_ALL              : None
LANG                : C.UTF-8
LOCALE              : en_US.UTF-8

pandas              : 2.1.0
numpy               : 1.26.0
pytz                : 2023.3
dateutil            : 2.8.2
setuptools          : 65.7.0
pip                 : 22.0.4
Cython              : None
pytest              : None
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : 1.3.3
lxml.etree          : None
html5lib            : 1.1
pymysql             : None
psycopg2            : None
jinja2              : 3.1.2
IPython             : 8.5.0
pandas_datareader   : None
bs4                 : 4.12.2
bottleneck          : None
dataframe-api-compat: None
fastparquet         : None
fsspec              : None
gcsfs               : None
matplotlib          : 3.5.1
numba               : None
numexpr             : None
odfpy               : None
openpyxl            : 3.0.9
pandas_gbq          : None
pyarrow             : 3.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : None
scipy               : 1.7.3
sqlalchemy          : None
tables              : None
tabulate            : 0.8.7
xarray              : None
xlrd                : None
zstandard           : None
tzdata              : 2023.3
qtpy                : None
pyqt5               : None
</details>
rhshadrach commented 1 year ago

set_column does not change the format; it changes the default format:

The cell_format parameter will be applied to any cells in the column that don’t have a format.

Since pandas is setting the format for date cells, it has no effect. As far as I can tell there is no easy way to overwrite an existing format with xlsxwriter. The best resolution I see for this within pandas is to use datetime_format as you mentioned.

Pavanmahaveer7 commented 1 year ago

So here if we are looking to change any format of date column, by using strftime function we could make the possible changes to the date values.

rhshadrach commented 1 year ago

@Pavanmahaveer7 - I think you are saying to modify the format prior calling pd.write_excel. Yes, that is an option, but then I believe the Excel file will not have the Date datatype.

lrisch commented 1 year ago

@rhshadrach Okay thanks for that information, I didn't know it only changed the default format. In that case we will keep using datetime_format 👍

rhshadrach commented 1 year ago

Currently the default value of datetime_formatstr is None, which gives you the format "YYYY-MM-DD HH:MM:SS". We could instead make the default value be ""YYYY-MM-DD HH:MM:SS" and then if None is passed, apply no formatting. It would allow you to then apply your own formatting with xlsxwriter. However, without applying any formatting I think the values would then appear as integers in the Excel sheet - and this would perhaps cause confusion.

I still think a better resolution is to have a way to override the existing format via xlsxwriter.

lrisch commented 1 year ago

@rhshadrach Im sorry, I dont really understand your last message. As I understand it, I cannot use set_column because pandas is overwriting whatever I set as a format there. Therefore I have to use datetime_format (which is fine, it would have been nicer to be able to use set_column since we use it for the other datatypes to, but it is ok as a workaround). But I dont understand, why do you want to change the default of datetime_format? And what do you mean by this: "I still think a better resolution is to have a way to override the existing format via xlsxwriter."

rhshadrach commented 1 year ago

As I understand it, I cannot use set_column because pandas is overwriting whatever I set as a format there.

Depending on interpretation, I think this is not correct - in particular pandas is not overwriting. pandas sets the format as "YYYY-MM-DD HH:MM:SS" first and then you are calling set_column. The issue is that calling set_column does nothing because it only modifies the default format. Modifying the default format has no impact because pandas has set a (non-default) format.

If users could tell pandas to not set any format for datetime (my proposal above - which I do not like but it is an option), then changing the default format would have an impact and so your call to set_column would have the desired effect.

It seems better to me if xlsxwriter gave you the ability to modify the format on an entire column, and not just the default format. I believe xlsxwriter has no such function.

rhshadrach commented 1 year ago

Perhaps my use of "default format" is confusing and I should be saying "no format" instead.

rhshadrach commented 9 months ago

Looking into this again, I'm now understanding that Excel has cell, row, and column formats. The set_column sets the column format, but priority is given to any cell in the column that has a cell format set.

Also, ExcelWriter does have date_format and datetime_format arguments that can be set!

rhshadrach commented 9 months ago

@jmcnamara - I was wondering if you could confirm my understanding here is correct. I believe this comment is self-contained, so hopefully you don't have to wade through the above comments :smile:.

pandas writes dates and datetimes by applying a cell format to particular cells. This makes xlxswriter's set_column (and set_row) ineffective on such cells - perhaps surprisingly until you understand how Excel has different formatters for rows, columns, and cells.

I was wondering if pandas should do something different here - but it appears there is no "default date format" in Excel. XlsxWriter does have a default_date_format, but it appears this is implemented by setting the cell format, similar to pandas:

https://github.com/jmcnamara/XlsxWriter/blob/cedf448ad2139ae3fb88dd618b0d78d67a39f2f5/xlsxwriter/worksheet.py#L1102-L1107

So users wanting to change the format of date and datetime columns that already have a format applied just need to loop over the cells in a column.

jmcnamara commented 9 months ago

I was wondering if you could confirm my understanding here is correct.

Yes that is correct. To reiterate: Pandas applies a cell format when writing a datetime and that cell format can't/won't be overwritten by a row or column format.

@rhshadrach I think the proposal to have change the behaviour of datetime_format=None and date_format=None in pd.ExcelWriter() so that they don't apply a default format is a good one. That would allow the user to apply a format via set_column(). In most cases applying the number format via datetime_format and date_format is sufficient for most users but sometimes they need to also apply font or border formatting as well and then the ability to turn off the Pandas formatting and use set_column() instead would be useful.

For reference here are the XlsxWriter docs where I try to point people in the right direction on Formatting of the Dataframe output.

rhshadrach commented 9 months ago

Thanks @jmcnamara!

@rhshadrach I think the proposal to have change the behaviour of datetime_format=None and date_format=None in pd.ExcelWriter() so that they don't apply a default format is a good one.

Makes sense - but I would think just not as the default. I'll put up a PR for this and improve some of the docs.