ropensci / git2rdata

An R package for storing and retrieving data.frames in git repositories.
https://ropensci.github.io/git2rdata/
GNU General Public License v3.0
99 stars 13 forks source link

Data hashes seem to differ between Windows and Linux #49

Closed florisvdh closed 5 years ago

florisvdh commented 5 years ago

This issue uses the reprex from issue #47 .

While not getting those errors, my output - in Linux - is always as:

3e6fbe383532f4312bd0f5c9f30976f64d00e9cc e5e6ed33018f669308297f2f3d66512b3fa8c1b6 
                     "../data/df_vc.tsv"                      "../data/df_vc.yml" 

Which is a different data_hash (stored in the yml file) than the Windows-generated one.

Session Info R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Linux Mint 18.1 Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=nl_BE.UTF-8 LC_NUMERIC=C LC_TIME=nl_BE.UTF-8 [4] LC_COLLATE=nl_BE.UTF-8 LC_MONETARY=nl_BE.UTF-8 LC_MESSAGES=nl_BE.UTF-8 [7] LC_PAPER=nl_BE.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] git2rdata_0.1 loaded via a namespace (and not attached): [1] drat_0.1.5 compiler_3.6.1 assertthat_0.2.1 tools_3.6.1 yaml_2.2.0 [6] git2r_0.26.1 packrat_0.5.0 fortunes_1.5-4
ThierryO commented 5 years ago

I can reproduce this. It seems like git2r::hashfile() yields a different output under Linux and Windows

filename <- tempfile("os-bug")
writeLines(
  c("x\ty", "1\t1", "2\t2", "3\t3", "4\t4", "5\t5", "6\t6", "7\t7", 
    "8\t8", "9\t9", "10\t10", "11\t11", "12\t12", "13\t13", "14\t14", 
    "15\t15", "16\t16", "17\t17", "18\t18", "19\t19", "20\t20", "21\t21", 
    "22\t22", "23\t23", "24\t24", "25\t25", "26\t26"),
  filename
)
git2r::hashfile(filename)

Output:

Session info on Windows

```r ─ Session info ────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os Windows >= 8 x64 system x86_64, mingw32 ui RStudio language (EN) collate Dutch_Belgium.1252 ctype Dutch_Belgium.1252 tz Europe/Paris date 2019-08-14 - Packages --------------------------------------------------------------------------------------------- package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3) cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.3) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.3) drat 0.1.4 2017-12-16 [1] CRAN (R 3.5.3) fortunes 1.5-4 2016-12-29 [1] CRAN (R 3.5.2) git2r * 0.25.2 2019-03-19 [1] CRAN (R 3.5.3) rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.5.3) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.3) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.3) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.2) [1] C:/R/library [2] C:/Program Files/R/R-3.5.2/library ```

Session info on Linux

```r ─ Session info ────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.6.1 (2019-07-05) os Ubuntu 18.04.3 LTS system x86_64, linux-gnu ui RStudio language nl:en collate nl_NL.UTF-8 ctype nl_NL.UTF-8 tz Europe/Brussels date 2019-08-14 ─ Packages ────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) cli 1.1.0 2019-03-19 [2] CRAN (R 3.5.3) crayon 1.3.4 2017-09-16 [2] CRAN (R 3.5.3) drat 0.1.5 2019-03-28 [1] CRAN (R 3.6.0) fortunes 1.5-4 2016-12-29 [1] CRAN (R 3.6.0) git2r 0.26.1 2019-06-29 [1] CRAN (R 3.6.0) packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0) rstudioapi 0.10 2019-03-19 [2] CRAN (R 3.5.3) sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 3.5.3) withr 2.1.2 2018-03-15 [2] CRAN (R 3.5.3) [1] /home/thierry_onkelinx/R/x86_64-pc-linux-gnu-library/3.5 [2] /usr/local/lib/R/site-library [3] /usr/lib/R/site-library [4] /usr/lib/R/library ```
ThierryO commented 5 years ago

According to @stewid, the difference in hash is due to the difference in line endings on Linux and Windows (ropensci/git2r#397).

Below is a reprex using write.table() on Linux.

library(git2r)
x <- seq(1:26)
y <- letters
df <- data.frame(x, y, stringsAsFactors = FALSE)
filename <- tempfile("os-bug")

# unix style line endings
write.table(
  x = df, file = filename, append = FALSE, quote = FALSE,
  sep = "\t", eol = "\n", na = "NA", dec = ".", row.names = FALSE,
  col.names = TRUE, fileEncoding = "UTF-8"
)
hashfile(filename) # "50aabdcd96bd742fdcc41edcc6b3efdf8e63f498"

# windows style line endings
write.table(
  x = df, file = filename, append = FALSE, quote = FALSE,
  sep = "\t", eol = "\r\n", na = "NA", dec = ".", row.names = FALSE,
  col.names = TRUE, fileEncoding = "UTF-8"
)
hashfile(filename) # "1783ed10fa5035a3963abf4202f42fe6ca88f046"
ThierryO commented 5 years ago

@florisvdh and @w-jan can you check if PR #53 solves this issue? use remotes::install_github("ropensci/git2rdata@datahash")

florisvdh commented 5 years ago

Didn't check Windows yet, but in Linux I now get a different hash than before, is this expected?

library(git2rdata)
x <- seq(1:26)
y <- letters
df <- data.frame(x,y)
write_vc(df, "df_vc", sorting = c("x"), strict =  FALSE)
# b2658819ed189ec4496b4b25c55404f7d0918b6a 3514e919bcca45b232268c650a04db36a18aa6b5
#                              "df_vc.tsv"                              #"df_vc.yml"
ThierryO commented 5 years ago

Yes. This is possible. The hashes are now calculated based on the content instead of the file.

florisvdh commented 5 years ago

I checked in Windows and the same datahash is produced. Good work! I think it's OK to close the issue.

See some further comments in PR #53 .