Open maxecharel opened 2 years ago
Hi @maxecharel,
Thanks for the feature request! haven defaults to double for all numeric variables for simplicity, and Stata is the only one of the stats packages that actually has in internal integer type (SPSS and SAS store all numeric variables as doubles).
This change makes sense to me though and should be relatively straightforward.
@hadley, any thoughts on this? The benefit seems clear but this is a breaking change (albeit unlikely to have a major impact) since some variables that previously read in from Stata as double would be integer instead.
Hi @gorcha , thanks for the feedback and for considering implementing this change! In which measure do you think this would also explain the issue described in #651 ?
@gorcha it's only going to affect a relatively small number of folks, and I suspect threading the logic through to the right place will be moderately complicated, so I'd only do it if you're really keen to tackle it.
@maxecharel can you please use lobstr::obj_size()
? object.size()
has a few bugs that lead to inaccurate reporting of sizes.
@hadley sure, here it is:
library(haven)
df <- read_dta('/path/to/two/variables/version/of/my/file.dta')
lobstr::obj_size(df)
## 38,912,520 B
storage.mode(df$var1) <- 'integer'
storage.mode(df$var2) <- 'integer'
lobstr::obj_size(df)
## 19,457,768 B
While I do not expect such a feature to affect a crazy number of people (not a truly educated guess from my side though), I really think that it would constitute an additional strong argument in favor of using R (including in statistical offices). If it is actually linked to issue #651 it is an even cooler feature to have. But I am aware that my opinion is certainly biased, especially since you are the ones who are going to do the work. For what it's worth, if I can somehow facilitate the process do not hesitate to tell me.
Thanks @hadley, I'll have a look at what changes are needed.
@maxecharel, this is related but not the same as #651.
Reading and writing files are quite different (both the code base and requirements), so they need to be considered as separate requests.
If this is updated and double-precision variables in SPSS datasets are read in R as integers, this causes an issue with the dplyr::filter function. The function cannot read it as a number and fails to filter cases. The solution to this problem is to convert it back to double (as.numeric function). But, this solution is cumbersome and unnecessary as the variable format was double-precision anyways.
Hi @umutatasever1990, this would not affect SPSS file reading. Unlike SPSS and SAS (which store all numeric variables as double precision), Stata can store variables as one of a few integer types as well as double precision/floating point.
haven currently reads all numeric data from Stata files into double variables in R. The request here is to use an integer variable in R only when the corresponding variable in the Stata file is an integer, so it wouldn't have any effect on SPSS and SAS.
Brief description of the problem
Trying to load a huge .dta file with
read_dta
, it was inflating like crazy when imported into R (around 6GB on disk, > 16GB of RAM used when loaded into R).I realized that at least part of the problem comes from the fact that
read_dta
storeshaven_labelled
variable asdbl+lvl
(i.e. using thedouble
internal storage type, while it would be way less memory-consuming to store it asinteger
type.Example