tidyverse / haven

Read SPSS, Stata and SAS files from R
https://haven.tidyverse.org
Other
424 stars 117 forks source link

dbl+lbl by default while int+lbl would be way more efficient in terms of memory usage (could explain #651) #677

Open maxecharel opened 2 years ago

maxecharel commented 2 years ago

Brief description of the problem

Trying to load a huge .dta file with read_dta, it was inflating like crazy when imported into R (around 6GB on disk, > 16GB of RAM used when loaded into R).

I realized that at least part of the problem comes from the fact that read_dta stores haven_labelled variable as dbl+lvl (i.e. using the double internal storage type, while it would be way less memory-consuming to store it as integer type.

Example

library(haven)

df <- read_dta('/path/to/two/variables/version/of/my/file.dta')

str(df)
## tibble [2,431,845 × 2] (S3: tbl_df/tbl/data.frame)
##  $ var1: dbl+lbl [1:2431845] NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
##    ..@ label       : chr "Q1"
##    ..@ format.stata: chr "%8.0g"
##    ..@ labels      : Named num [1:4] 1 2 3 4
##    .. ..- attr(*, "names")= chr [1:4] "Yes" "No" "(DK)" "(Refused)"
##  $ var2: dbl+lbl [1:2431845] NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
##    ..@ label       : chr "Q2"
##    ..@ format.stata: chr "%8.0g"
##    ..@ labels      : Named num [1:4] 1 2 3 4
##    .. ..- attr(*, "names")= chr [1:4] "Yes" "No" "(DK)" "(Refused)"

## ==> "%8.0g" is the default format for Stata 'byte' and 'int', but dbl+lbl so R type used is double

## Impact in terms of object size in memory?
object.size(df)
#38913416 bytes

storage.mode(df$var1) <- 'integer'
storage.mode(df$var2) <- 'integer'

object.size(df)
#19458664 bytes
## ==> half the size when stored as integer (and no loss of attributes)
gorcha commented 2 years ago

Hi @maxecharel,

Thanks for the feature request! haven defaults to double for all numeric variables for simplicity, and Stata is the only one of the stats packages that actually has in internal integer type (SPSS and SAS store all numeric variables as doubles).

This change makes sense to me though and should be relatively straightforward.

@hadley, any thoughts on this? The benefit seems clear but this is a breaking change (albeit unlikely to have a major impact) since some variables that previously read in from Stata as double would be integer instead.

maxecharel commented 2 years ago

Hi @gorcha , thanks for the feedback and for considering implementing this change! In which measure do you think this would also explain the issue described in #651 ?

hadley commented 2 years ago

@gorcha it's only going to affect a relatively small number of folks, and I suspect threading the logic through to the right place will be moderately complicated, so I'd only do it if you're really keen to tackle it.

@maxecharel can you please use lobstr::obj_size()? object.size() has a few bugs that lead to inaccurate reporting of sizes.

maxecharel commented 2 years ago

@hadley sure, here it is:

library(haven)
df <- read_dta('/path/to/two/variables/version/of/my/file.dta')

lobstr::obj_size(df)
## 38,912,520 B

storage.mode(df$var1) <- 'integer'
storage.mode(df$var2) <- 'integer'

lobstr::obj_size(df)
## 19,457,768 B

While I do not expect such a feature to affect a crazy number of people (not a truly educated guess from my side though), I really think that it would constitute an additional strong argument in favor of using R (including in statistical offices). If it is actually linked to issue #651 it is an even cooler feature to have. But I am aware that my opinion is certainly biased, especially since you are the ones who are going to do the work. For what it's worth, if I can somehow facilitate the process do not hesitate to tell me.

gorcha commented 2 years ago

Thanks @hadley, I'll have a look at what changes are needed.

@maxecharel, this is related but not the same as #651.

Reading and writing files are quite different (both the code base and requirements), so they need to be considered as separate requests.

umutatasever1990 commented 2 years ago

If this is updated and double-precision variables in SPSS datasets are read in R as integers, this causes an issue with the dplyr::filter function. The function cannot read it as a number and fails to filter cases. The solution to this problem is to convert it back to double (as.numeric function). But, this solution is cumbersome and unnecessary as the variable format was double-precision anyways.

gorcha commented 2 years ago

Hi @umutatasever1990, this would not affect SPSS file reading. Unlike SPSS and SAS (which store all numeric variables as double precision), Stata can store variables as one of a few integer types as well as double precision/floating point.

haven currently reads all numeric data from Stata files into double variables in R. The request here is to use an integer variable in R only when the corresponding variable in the Stata file is an integer, so it wouldn't have any effect on SPSS and SAS.