tidyverse / readxl

Read excel files (.xls and .xlsx) into R 🖇
https://readxl.tidyverse.org
Other
730 stars 194 forks source link

read_excel() converts date to number #716

Open xinyongtian opened 2 years ago

xinyongtian commented 2 years ago

test1.xlsx image

I try to covert test1.xlsx to .csv file by using read_excel(), then write.csv() The excel file test1.xlsx contains date column (DOB) with some typo. After read, the correct DOB date become number (in character type). The read_excel() should at least keep DOB in original text format. If I use MS excel to save as .csv the date format is preserved.

library(readxl)
#> Warning: package 'readxl' was built under R version 4.1.3
df=read_excel('c:/test1.xlsx')
print(df)
#> # A tibble: 2 x 2
#>   NAME  DOB     
#>   <chr> <chr>   
#> 1 Jack  17995   
#> 2 Tom   12/37/38
write.csv(df,'test1.csv',row.names = F)
print(read.csv('test1.csv'))
#>   NAME      DOB
#> 1 Jack    17995
#> 2  Tom 12/37/38
matthewjnield commented 1 year ago

I will attempt to answer this question, having done my best to recreate your circumstances. This is not ideal because I don't have a copy of your Excel file, and most importantly, I don't know the Excel data types of cells B2 and B3. Whether they are General, Date, or Text affects what data type they are imported into R as by read_excel(). Any of the above that you can provide will help me give a more accurate explanation of your output.

I created an imitation of your file by typing in the values that I see above by hand. I observed the following, and am making the assumption that these things are true for your file as well:

Moving now to what happens in R: Because column B contains at least one cell that cannot be interpreted as anything other than text (cell B3), read_excel() guessed that the entire column should be treated as character (text) data. This behavior avoids data loss; if it had guessed differently and applied a date type, cell B3 would have to be NA (a blank/missing value), and information that you had before would not have made it into your data.frame.

Cell B2 is imported as "17995" because that is what is actually stored in the Excel spreadsheet. When Excel interpreted what you typed in as a date, it implicitly converted it to the number 17995, which is the number of days between December 30, 1899 and the date entered (this is Microsoft's convention for storing date data). With the Date type applied, Excel knows to display the value as a date, while still storing the number "under the hood".

This article from the readxl site is highly relevant and is my go-to reference for understanding why my Excel data gets imported the way it does: https://readxl.tidyverse.org/articles/cell-and-column-types.html

Here is my personal approach for ensuring that date data from Excel spreadsheets ends up as the intended date values in R:

library(readxl)
library(dplyr)
library(stringr)
library(lubridate)

# a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately
convert_excel_dates <-
    function(x){
        case_when(
            str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x),   # handles values imported as text values in the format "MM/DD/YYYY"
            str_detect(x, "^[0-9]{5}$")                         ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")),  # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention)
            TRUE                                                ~ NA_Date_  # default case, no applicable date format, returns a missing date value
        )
    }

read_excel("test1.xlsx", col_types = "text") |>  # instruct read_excel() to simply import all columns as text, so that I can make tricky type conversions myself
    mutate(DOB_Converted_to_Date = convert_excel_dates(DOB))
#> # A tibble: 3 × 3
#>   NAME   DOB      DOB_Converted_to_Date  
#>   <chr>  <chr>    <date>    
#> 1 Jack   17995    1949-04-07
#> 2 Tom    12/37/38 NA        
#> 3 George 02/17/80 1980-02-17

Handling of additional date formats can be added to case_when() as needed. I added one more row for this example to show that a value that is imported as text but still follows the "MM/DD/YY" format representing a valid date still gets converted correctly. This can happen if someone types in a valid date like "02/17/80" when the cell's Excel data type is set to Text, and Excel makes no attempt to interpret the value as a Date or number.

xinyongtian commented 1 year ago

Thank you for the explanation and walk around. I wonder how MS excel keeps original character format for B2 ("4/7/1949") when saving test1.xslx as .csv file. "4/7/1949" should be stored somewhere in excel. It would be nice for R to retrieve it ("4/7/1949") when R realizes column B is character.

On Sun, Jan 8, 2023, 8:46 AM Matt Nield @.***> wrote:

I will attempt to answer this question, having done my best to recreate your circumstances. This is not ideal because I don't have a copy of your file, and most importantly, I don't know the Excel data types of cells B2 and B3. Whether they are General, Date, or Text affects what data type they are imported into R as by read_excel(). Any of the above that you can provide will help me give a more accurate explanation of your output.

I created an imitation of your file by typing in the values that I see above by hand and observed the following; I am making the assumption that these things are true for your file as well:

  • Cell B2 was interpreted by Excel as a valid date (April 7, 1949) and was automatically assigned the Date type
  • Cell B3 could not be interpreted by Excel as anything other than just text, since it does not represent a valid date or numeric value, and so was kept as the default General type

Moving now to what happens in R: Because column B contains at least one cell that cannot be interpreted as anything other than text (cell B3), read_excel() guessed that the entire column should be treated as character (text) data. This behavior avoids data loss; if it had guessed differently and applied a date type, cell B3 would have to be NA (a blank/missing value), and information that you had before would not make it into your data.frame.

This article from the readxl site is highly relevant and is my go-to reference for understanding why my Excel data gets imported the way it does: https://readxl.tidyverse.org/articles/cell-and-column-types.html

Here is my personal approach for ensuring that date data from Excel spreadsheets ends up as the intended date values in R:

library(readxl)

library(dplyr)

library(stringr)

library(lubridate)

a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately

convert_excel_dates <-

function(x){

    case_when(

        str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x),   # handles values imported as text values in the format "MM/DD/YYYY"

        str_detect(x, "^[0-9]{5}$")                         ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")),  # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention)

        TRUE                                                ~ NA_Date_  # default case, no applicable date format, returns a missing date value

    )

}

read_excel("test1.xlsx", col_types = "text") |> # instruct read_exce() to simply import all columns as text, so that I can make tricky type conversions myself

mutate(DOB_Converted_to_Date = convert_excel_dates(DOB))

> # A tibble: 3 × 3

> NAME DOB DOB_Converted_to_Date

>

> 1 Jack 17995 1949-04-07

> 2 Tom 12/37/38 NA

> 3 George 02/17/80 1980-02-17

Handling of additional date formats can be added to case_when() as needed. I added one more row for this example to show that a value that is imported as text but still follows the "MM/DD/YY" format but actually represents a valid date still gets converted correctly. This can happen if someone types in a valid date like "02/17/80" when the cell's Excel data type is set to Text, and Excel makes no attempt to interpret the value as a Date or number.

— Reply to this email directly, view it on GitHub https://github.com/tidyverse/readxl/issues/716#issuecomment-1374853170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC32KONZP6AFA36LCQX5RV3WRLHNBANCNFSM6AAAAAAR427YVI . You are receiving this because you authored the thread.Message ID: @.***>

iamio87 commented 7 months ago

read_excel() converts dates to numbers, because it's not reading the metadata in the Excel file.

read_excel() does properly interpret numbers and texts, but dates are not yet implemented. The comments in the code suggest that the maintainers reasonably believed that Excel used the ISO 8601 format, but it does not.

Excel Spreadsheets store Date and Time information in a numeric format. For example, “1/1/2024 15:30” is stored as the numeric value “45292.645833333336”. CleanShot 2024-03-28 at 10 28 07@2x

The whole number portion (xxxxxx.0) represents the number of days since the Excel epoch, which begins on December 30, 1899. (They intended it to be January 1, 1900, but they calculated the days wrong in the first version of Excel).

The decimal portion (0.xxxxx) represents the time of day. This applies even when the time of day is the only information in the cell. A cell that only had time information “15:30” would still be stored as “0.645833333336” in the Excel file.

To understand the basic data in an Excel spreadsheet, the parser must consult with 3 separate locations.

CleanShot 2024-03-28 at 10 52 31@2x

The worksheet xml files contain all numeric data, and references for text and date information for each worksheet in the Excel file. The text values for all worksheets are stored in sharedstrings.xml. Dates, times, currency, and other numeric-like formatting for all worksheets is stored in the styles.xml.

In order to properly parse dates or formats like currency, read_excel would have to reference the following data structures in the styles.xml:

Each xf::numFmtId corresponds to a different format type. For example:

An example cell in the worksheet XML might look like this:

45369

The s="1" attribute uses a 0-base index to indicate which cellXfs -> xf node controls the formatting for that cell. Assuming the example stylesheet above, s="1" would indicate the second node in - - would apply. So it would be a date formatted as MM/DD/YY.

Submitting a patch is outside of my scope of expertise, but I did want to provide an explanation for why read_excel() casts Dates to Numbers as a resource for possible future development.