nacnudus / tidyxl

Read untidy Excel files in R https://nacnudus.github.io/tidyxl/
https://nacnudus.github.io/tidyxl/
Other
246 stars 20 forks source link

Encoding issue when using cell references on Windows 10 #64

Closed patperu closed 3 years ago

patperu commented 3 years ago

Hi,

I'm having a problem with the encoding (german umlaute) on Windows 10 when using a cell reference. The test file contains two sheets, where "Sheet2" has references to "Sheet1". On Windows the conversion works for "Sheet1" but fails for "Sheet2". This problem does not occur when using Ubuntu.

Thanks for any tip and for this great package!! Patrick

On Windows it fails

library(tidyxl)
#> Warning: Paket 'tidyxl' wurde unter R Version 3.6.3 erstellt

s1 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet1')
s2 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet2_with_cell_ref_to_Sheet1')

s1[, c("sheet", "address", "character")]
#>    sheet address                            character
#> 1 Sheet1      B3                                    A
#> 2 Sheet1      B4             Einführung oder Änderung
#> 3 Sheet1      B5                            Kapazität
#> 4 Sheet1      B6                          Bauüberhang
#> 5 Sheet1      B7 (Zeilenumbrüche mittels Alt + Enter)
s2[, c("sheet", "address", "character")]
#>                            sheet address                             character
#> 1 Sheet2_with_cell_ref_to_Sheet1      B4            Einführung oder Änderung
#> 2 Sheet2_with_cell_ref_to_Sheet1      B5                            Kapazität
#> 3 Sheet2_with_cell_ref_to_Sheet1      B6                          Bauüberhang
#> 4 Sheet2_with_cell_ref_to_Sheet1      B7 (Zeilenumbrüche mittels Alt + Enter)
#> 5 Sheet2_with_cell_ref_to_Sheet1      B9                               Zuzüge

sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyxl_1.0.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.5.0
#>  [5] yaml_2.2.1      Rcpp_1.0.5      stringi_1.5.3   rmarkdown_2.5  
#>  [9] highr_0.8       knitr_1.30      stringr_1.4.0   xfun_0.19      
#> [13] digest_0.6.27   rlang_0.4.8     evaluate_0.14

On Ubuntu it works

library(tidyxl)

s1 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet1')
s2 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet2_with_cell_ref_to_Sheet1')

s1[, c("sheet", "address", "character")]
#> # A tibble: 5 x 3
#>   sheet  address character                           
#>   <chr>  <chr>   <chr>                               
#> 1 Sheet1 B3      A                                   
#> 2 Sheet1 B4      Einführung oder Änderung            
#> 3 Sheet1 B5      Kapazität                           
#> 4 Sheet1 B6      Bauüberhang                         
#> 5 Sheet1 B7      (Zeilenumbrüche mittels Alt + Enter)
s2[, c("sheet", "address", "character")]
#> # A tibble: 5 x 3
#>   sheet                          address character                           
#>   <chr>                          <chr>   <chr>                               
#> 1 Sheet2_with_cell_ref_to_Sheet1 B4      Einführung oder Änderung            
#> 2 Sheet2_with_cell_ref_to_Sheet1 B5      Kapazität                           
#> 3 Sheet2_with_cell_ref_to_Sheet1 B6      Bauüberhang                         
#> 4 Sheet2_with_cell_ref_to_Sheet1 B7      (Zeilenumbrüche mittels Alt + Enter)
#> 5 Sheet2_with_cell_ref_to_Sheet1 B9      Zuzüge

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
#>  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
#>  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyxl_1.0.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5           rstudioapi_0.13      knitr_1.30          
#>  [4] magrittr_2.0.1       rlang_0.4.8          fansi_0.4.1         
#>  [7] stringr_1.4.0        styler_1.3.2         highr_0.8           
#> [10] tools_3.6.3          xfun_0.19            utf8_1.1.4          
#> [13] cli_2.2.0            htmltools_0.5.0.9002 ellipsis_0.3.1      
#> [16] assertthat_0.2.1     yaml_2.2.1           digest_0.6.27       
#> [19] tibble_3.0.4         lifecycle_0.2.0.9000 crayon_1.3.4        
#> [22] purrr_0.3.4          vctrs_0.3.5          fs_1.5.0            
#> [25] glue_1.4.2           evaluate_0.14        rmarkdown_2.5       
#> [28] reprex_0.3.0.9001    stringi_1.5.3        compiler_3.6.3      
#> [31] pillar_1.4.7         backports_1.2.0      pkgconfig_2.0.3

Created on 2020-11-24 by the reprex package (v0.3.0)

Windows and Excel version

Edition Windows 10 Pro
Version 20H2
Betriebssystembuild 19042.630

Microsoft Excel 2016 (16.0.5083.100)

Test file 'german_umlaute_utf8.xlsx'

german_umlaute_utf8.xlsx

nacnudus commented 3 years ago

Hi, sorry you've had this problem, and thank you for providing a detailed report. I don't have access to a Windows machine, but I will see what I can do.

patperu commented 3 years ago

Hi, thanks for looking into it! Let me know, if you need more information or if I should test something.

nacnudus commented 3 years ago

This might be related to something in a readxl issue https://github.com/tidyverse/readxl/issues/125, which found that the problem was to do with R itself on Windows, rather than anything in the package. It might be worth trying some of the suggestions in that issue.

Unfortunately, I don't think it's a problem that can be fixed in tidyxl, because tidyxl uses the same code as readxl to parse strings, and readxl is believed to be correct. Sorry I can't be more help.

patperu commented 3 years ago

Hi! Thanks for finding an explanation! I didn't realise that readxl and tidyxl are using the same library. I'll check the readxl issue. Right now I found a simple solution with Encoding(). Calling Encoding on the character and formula column fix the issue.

Encoding(s2$character) <- "UTF-8"
Encoding(s2$formula) <- "UTF-8"

Now it works:

library(tidyxl)

s1 <- tidyxl::xlsx_cells("H:/test/german_umlaute_utf8.xlsx", sheets = 'Sheet1')
s2 <- tidyxl::xlsx_cells("H:/test/german_umlaute_utf8.xlsx", sheets = 'Sheet2_with_cell_ref_to_Sheet1')

s1[, c("sheet", "address", "character")]
#>    sheet address                            character
#> 1 Sheet1      B3                                    A
#> 2 Sheet1      B4             Einführung oder Änderung
#> 3 Sheet1      B5                            Kapazität
#> 4 Sheet1      B6                          Bauüberhang
#> 5 Sheet1      B7 (Zeilenumbrüche mittels Alt + Enter)
s2[, c("sheet", "address", "character", "formula")]
#>                            sheet address                             character
#> 1 Sheet2_with_cell_ref_to_Sheet1      B4            Einführung oder Änderung
#> 2 Sheet2_with_cell_ref_to_Sheet1      B5                            Kapazität
#> 3 Sheet2_with_cell_ref_to_Sheet1      B6                          Bauüberhang
#> 4 Sheet2_with_cell_ref_to_Sheet1      B7 (Zeilenumbrüche mittels Alt + Enter)
#> 5 Sheet2_with_cell_ref_to_Sheet1      B9                               Zuzüge
#>                            formula
#> 1                        Sheet1!B4
#> 2                        Sheet1!B5
#> 3                        Sheet1!B6
#> 4                        Sheet1!B7
#> 5 IF(Sheet1!B3="A","Zuzüge","d2")

Encoding(s2$character) <- "UTF-8"
Encoding(s2$formula) <- "UTF-8"

# Encoding fixed
s2[, c("sheet", "address", "character", "formula")]
#>                            sheet address                            character
#> 1 Sheet2_with_cell_ref_to_Sheet1      B4             Einführung oder Änderung
#> 2 Sheet2_with_cell_ref_to_Sheet1      B5                            Kapazität
#> 3 Sheet2_with_cell_ref_to_Sheet1      B6                          Bauüberhang
#> 4 Sheet2_with_cell_ref_to_Sheet1      B7 (Zeilenumbrüche mittels Alt + Enter)
#> 5 Sheet2_with_cell_ref_to_Sheet1      B9                               Zuzüge
#>                           formula
#> 1                       Sheet1!B4
#> 2                       Sheet1!B5
#> 3                       Sheet1!B6
#> 4                       Sheet1!B7
#> 5 IF(Sheet1!B3="A","Zuzüge","d2")

Created on 2020-12-09 by the reprex package (v0.3.0)

nacnudus commented 3 years ago

I think this is now fixed. If not, please reopen the issue.

patperu commented 3 years ago

Hi @nacnudus

thanks a lot for working on this issue! It is fixed for the character column, but unfortunately still exists with the formula column (row 10): You can see the difference between Windows and Linux.

PS: Sorry, I can't reopen the issue.

Windows

library(tidyxl)

packageVersion('tidyxl')
#> [1] '1.0.7.9000'
Sys.info()['sysname']
#>   sysname 
#> "Windows"

x <- xlsx_cells("german_umlaute_utf8.xlsx")

x[, c("sheet", "character", "formula")]
#>                             sheet                            character                          formula
#> 1                          Sheet1                                    A                             <NA>
#> 2                          Sheet1             Einführung oder Änderung                             <NA>
#> 3                          Sheet1                            Kapazität                             <NA>
#> 4                          Sheet1                          Bauüberhang                             <NA>
#> 5                          Sheet1 (Zeilenumbrüche mittels Alt + Enter)                             <NA>
#> 6  Sheet2_with_cell_ref_to_Sheet1             Einführung oder Änderung                        Sheet1!B4
#> 7  Sheet2_with_cell_ref_to_Sheet1                            Kapazität                        Sheet1!B5
#> 8  Sheet2_with_cell_ref_to_Sheet1                          Bauüberhang                        Sheet1!B6
#> 9  Sheet2_with_cell_ref_to_Sheet1 (Zeilenumbrüche mittels Alt + Enter)                        Sheet1!B7
#> 10 Sheet2_with_cell_ref_to_Sheet1                               Zuzüge IF(Sheet1!B3="A","Zuzüge","d2")

testthat::expect_equal(Encoding(x$character[-1]), rep("UTF-8", 9))
testthat::expect_equal(Encoding(x$formula[-1]), rep("UTF-8", 9))
#> Error: Encoding(x$formula[-1]) not equal to rep("UTF-8", 9).
#> 9/9 mismatches
#> x[1]: "unknown"
#> y[1]: "UTF-8"
#> 
#> x[2]: "unknown"
#> y[2]: "UTF-8"
#> 
#> x[3]: "unknown"
#> y[3]: "UTF-8"
#> 
#> x[4]: "unknown"
#> y[4]: "UTF-8"
#> 
#> x[5]: "unknown"
#> y[5]: "UTF-8"

Created on 2021-01-05 by the reprex package (v0.3.0)

Linux

library(tidyxl)
library(testthat)

packageVersion('tidyxl')
#> [1] '1.0.7.9000'
Sys.info()['sysname']
#> sysname 
#> "Linux"

x <- xlsx_cells("german_umlaute_utf8.xlsx")

x[, c("sheet", "character", "formula")]
#> # A tibble: 10 x 3
#>    sheet                          character                            formula                                
#>    <chr>                          <chr>                                <chr>                                  
#>  1 Sheet1                         A                                     <NA>                                  
#>  2 Sheet1                         Einführung oder Änderung              <NA>                                  
#>  3 Sheet1                         Kapazität                             <NA>                                  
#>  4 Sheet1                         Bauüberhang                           <NA>                                  
#>  5 Sheet1                         (Zeilenumbrüche mittels Alt + Enter)  <NA>                                  
#>  6 Sheet2_with_cell_ref_to_Sheet1 Einführung oder Änderung             "Sheet1!B4"                            
#>  7 Sheet2_with_cell_ref_to_Sheet1 Kapazität                            "Sheet1!B5"                            
#>  8 Sheet2_with_cell_ref_to_Sheet1 Bauüberhang                          "Sheet1!B6"                            
#>  9 Sheet2_with_cell_ref_to_Sheet1 (Zeilenumbrüche mittels Alt + Enter) "Sheet1!B7"                            
#> 10 Sheet2_with_cell_ref_to_Sheet1 Zuzüge                               "IF(Sheet1!B3=\"A\",\"Zuzüge\",\"d2\")"

testthat::expect_equal(Encoding(x$character[-1]), rep("UTF-8", 9))
testthat::expect_equal(Encoding(x$formula[-1]), rep("UTF-8", 9))
#> Error: Encoding(x$formula[-1]) not equal to rep("UTF-8", 9).
#> 9/9 mismatches
#> x[1]: "unknown"
#> y[1]: "UTF-8"
#> 
#> x[2]: "unknown"
#> y[2]: "UTF-8"
#> 
#> x[3]: "unknown"
#> y[3]: "UTF-8"
#> 
#> x[4]: "unknown"
#> y[4]: "UTF-8"
#> 
#> x[5]: "unknown"
#> y[5]: "UTF-8"

Created on 2021-01-05 by the reprex package (v0.3.0.9001)

nacnudus commented 3 years ago

Sorry about that. With the file you provided, the comment column still won't have UTF-8 encoding, because there aren't any comments. I've added some to the file for the sake of a test.

patperu commented 3 years ago

Hi @nacnudus,

thank you! Yes, I never touched the comment column - only character and formula. I created a new testfile utf8-cities.xlsx to test more languages and to simplify the structure.

AFAICS, at least the four columns sheet, character, comment and formula need the UTF-8 encoding.

For character and comment this is the case, but unfortunately not for the sheet and formula column.

Best Patrick


library(tidyxl)
# Version
# tidyxl* 1.0.7.9000 2021-01-06 [1] Github (nacnudus/tidyxl@8a1ceac)

Sys.info()['sysname']
#>   sysname 
#> "Windows"

x <- xlsx_cells("utf8-cities.xlsx")

x[, c("sheet", "address", "character", "comment", "formula")]
#>     sheet address  character            comment       formula
#> 1 Städte      A1      Nærbø   Norwegian, Nærbø     "Nærbø"
#> 2 Städte      A2       Köln       German, Köln       "Köln"
#> 3 Städte      A3       Chéu       French, Chéu       "Chéu"
#> 4 Städte      A4     España      Spain, España     "España"
#> 5 Städte      A5 Klitmøller Danish, Klitmøller "Klitmøller"

testthat::expect_equal(Encoding(x$sheet), rep("UTF-8", 5))
#> Error: Encoding(x$sheet) not equal to rep("UTF-8", 5).
#> 5/5 mismatches
#> x[1]: "unknown"
#> y[1]: "UTF-8"
#> 
#> x[2]: "unknown"
#> y[2]: "UTF-8"
#> 
#> x[3]: "unknown"
#> y[3]: "UTF-8"
#> 
#> x[4]: "unknown"
#> y[4]: "UTF-8"
#> 
#> x[5]: "unknown"
#> y[5]: "UTF-8"

testthat::expect_equal(Encoding(x$character), rep("UTF-8", 5))

testthat::expect_equal(Encoding(x$formula), rep("UTF-8", 5))
#> Error: Encoding(x$formula) not equal to rep("UTF-8", 5).
#> 5/5 mismatches
#> x[1]: "unknown"
#> y[1]: "UTF-8"
#> 
#> x[2]: "unknown"
#> y[2]: "UTF-8"
#> 
#> x[3]: "unknown"
#> y[3]: "UTF-8"
#> 
#> x[4]: "unknown"
#> y[4]: "UTF-8"
#> 
#> x[5]: "unknown"
#> y[5]: "UTF-8"

testthat::expect_equal(Encoding(x$comment), rep("UTF-8", 5))

Created on 2021-01-10 by the reprex package (v0.3.0)

utf8-cities.xlsx

nacnudus commented 3 years ago

Okay, this time .... this time I think it's finally fixed. Thank you for taking the trouble to write a minimal test file.