tfussell / xlnt

:bar_chart: Cross-platform user-friendly xlsx library for C++11+
Other
1.5k stars 423 forks source link

Reading date formats #208

Closed chris-b1 closed 7 years ago

chris-b1 commented 7 years ago

The python libraries I'm familiar with will automatically sniff formats and set cells as a date type if so. Would you have any objection to doing the same here? Otherwise date columns come through as floats on the arrow side. Change is something like this, possibly would want to cache the format parsing.

Side note, Excel seems to look at formats case-insensitively, I have a file with formats like 'YYYY-MM', spec seems to specify lowercase, but I think it would be nice to do the same, otherwise get an error on them.

diff --git a/source/cell/cell.cpp b/source/cell/cell.cpp
index 82743ae..2d66c6a 100644
--- a/source/cell/cell.cpp
+++ b/source/cell/cell.cpp
@@ -695,6 +695,10 @@ void cell::format(const class format new_format)
     {
         format().d_->references -= format().d_->references > 0 ? 1 : 0;
     }
+
+    if (new_format.number_format().is_date_format()) {
+        data_type(cell::type::date);
+    }

     ++new_format.d_->references;
     d_->format_ = new_format.d_;
tfussell commented 7 years ago

I have been putting off trying to figure out how to make the cell type and date number formats agree. This seems like a good approach, but I will have to see how other libraries/applications handle this. I'll report my findings here.

tfussell commented 7 years ago

I wanted to note here that I've just implemented upper-case date/time number format support in b2adee9fd315172066e380fcb6f8ddcb43bb59d7 since LibreOffice uses those by default.

tfussell commented 7 years ago

There's a nice discussion about the two types of dates (serial and ISO) in the js-xlsx repo: https://github.com/SheetJS/js-xlsx/issues/126

tfussell commented 7 years ago

I don't really feel comfortable changing the internal type of a cell based on its format because my goal since the beginning of this project has been perfect round-tripping of XLSX files. This would result in a file that is read from the filesystem and then written back having different cell types. I think it should be up to library users to check if the number format is a date format as you indicated in your example code. In the case of xlntpyarrow, I should be able to check if the cell type is a date or if the number format is a date format to infer the correct pyarrow.Field for the column. I'll try that now.

paulharris commented 7 years ago

Whatever you can do to make dates sane would be appreciated.

The main reason I want to read XLS files is because opening CSV files with dates in Excel is always an epic fail, thanks to the "helpful" and somewhat random format guessing.

On 13 September 2017 at 23:45, Thomas Fussell notifications@github.com wrote:

I don't really feel comfortable changing the internal type of a cell based on its format because my goal since the beginning of this project has been perfect round-tripping of XLSX files. This would result in a file that is read from the filesystem and then written back having different cell types. I think it should be up to library users to check if the number format is a date format as you indicated in your example code. In the case of xlntpyarrow, I should be able to check if the cell type is a date or if the number format is a date format to infer the correct pyarrow.Field for the column. I'll try that now.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tfussell/xlnt/issues/208#issuecomment-329210078, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkgS7Id1mA3yMFP9PChN5s5mr8gaemFks5sh_iDgaJpZM4PDlAC .

chris-b1 commented 7 years ago

Makes sense, I didn't know xml spec had a true date type - I was thinking your date type was simply an internal representation and it would always be numeric on serialization. So, I agree on not changing the type.

That said, still seems like it might be nice to provide some way to work with numeric dates as if they are dates that isn't entirely on the user, but for my purpose certainly enough to do that in the arrow logic. Thanks!

tfussell commented 7 years ago

I think this is basically resolved so I'm going to close this issue. Please open a new issue if you have suggestions for further improvements to date handling.