sjewo / readstata13

Package to read the Stata 13 (and newer) file format into a R data.frame
https://sjewo.github.io/readstata13/
GNU General Public License v2.0
41 stars 9 forks source link

Problem when importing factor variables with SOME missing labels #79

Open luisvalenzuelar opened 2 years ago

luisvalenzuelar commented 2 years ago

I have a problem when importing labels from Stata to R. Problem seems to occur because the variable has SOME missing labels. read.dta13 seems to assign values without labels first and put those with labels at the end. This seriously affects the consistency of the data (var = 3 in Stata is not var = 3 in R!). All the details, with images and data sample are here.

"Solution" is to not import labels, using convert.factors = FALSE option. But this is not a real solution to the problem. One would like to keep available labels.

Seems to be a serious problem. I wonder whether the problem is in the package itself or somewhere else.

JanMarvin commented 2 years ago

Hi @raspatan , does the generate.factor option solve your issues? Without any minimal reproduceable example it's hard to guess what you are looking for. R simply has no built in type similar to Statas (half) labeled variables. If you want numerical values to be identical, simply don't use factors. If you want factors, don't expect numerical values to be identical. I'd say that's a won't fix from our end. It's been like this for a very long time and beyond our control.

If I understand correctly, there is a tidyverse package that aims at labelled vectors and @sjewo was looking into it in #73 . The problem with this is, I don't want any tidyverse dependency.

luisvalenzuelar commented 2 years ago

Hi @JanMarvin

No, generate.factor does not solve the problem. I know I didn't offer a reproducible example but the Stata data is in the link.

I guess what I'm after is a third way. With convert.factor=FALSE, you get the Stata numeric values. With covert.factor=TRUE, you get the Stata factors, but with different values (starting at 1). generate.factor makes no difference to the latter. Perhaps the option of import.factors=TRUE or something like that would be ideal (but to me this should be default, with the option of creating new values).

For an example, run this in Stata:

webuse auto
keep foreign // variable is 0 for domestic, 1 for foreign
save test, replace

Now import the above in R with the different options. Either you get numeric variable with 0 and 1, or you get factor variable with 1 and 2. The third way would be to allow for factor but using values 0 and 1, as in Stata.

Personally, I don't see why a goal of the package should not be to reproduce exactly the characteristics of variables in Stata. Currently, it forces us to chose between factors or numerical. Without a warning that factors are recreated by R, I see this as very problematic. Fortunately I became aware of the problem. Otherwise my analysis would have led to wrong results.

JanMarvin commented 2 years ago

Hi @raspatan , the issue is that you cannot assume that when you import something from one statistical pacakge to another, that it provides all the same functionality. For this package it is similar with Statas support for variable labels or dataset labels. But it is true for every other conversion from one package to another, with SAS, SPSS or even something like Excel. There are always compromises one has to make and assuming that something works in one software and will work in another software just identical might be misleading.

Regarding the factors, the numerical value of a factor in R is an index beginning at 1. Factors are not just labeled numerics. Therefore what you suggested above is simply not possible in R and these R internals haven't changed for a long long time. I do not say that they are the best, but they have been in place since I assume the development of S. Of course the world has changed a lot and there are valid reasons why people nowadays like to use packages such as the tidyverse which replaces R internals with tibbles, new vector objects, new date vectors or labelled column vectors. Though this is just another cup of tea and not ours to mess with. This package was written to support base R and started as a drop in replacement for the foreign functions read.dta() and write.dta().

If you do not want to have value labels, we provide everything you need (I'm not a fan of factors myself, they are mostly a nuisance to work with and I prefer plain old numerics and characters). I have used the plain object, there might be helper functions we provide:

> auto <- readstata13::read.dta13("http://www.stata-press.com/data/r16/auto.dta", convert.factors = FALSE)
> 
> table(auto$foreign)

 0  1 
52 22 
> 
> lab_name <- which(names(auto) == "foreign")
> val_label <- attr(auto, "val.labels")[lab_name]
> lab_table <- attr(auto, "label.table")[[val_label]]
> lab_table
Domestic  Foreign 
       0        1 

PS: When I checked the issue yesterday, I remembered that the link was pointing at a different SO post. PPS: Maybe it was not meant that way, but it sounds like you are trying to blame us for either your lack of R experience or the way R behaves.

luisvalenzuelar commented 2 years ago

you cannot assume that when you import something from one statistical pacakge to another, that it provides all the same functionality

Yes, this is true. I think I got used to things working fine in the past. I just started to use factors in R.

I'm not expert in R or related languages so cannot really comment on the complexity of the issue. But I take your word for it.

And sorry for sounding aggressive. It was not my intention. It is of course on the part of the user to check things work but I still suggest you add a warning or message somewhere (perhaps the HELP file) to make sure people is aware of differences between Stata and R factors. Just my opinion.

JanMarvin commented 2 years ago

Well, don't worry about it, I guess we can always improve the documentation. But writing documentation is not the fun part of development :smile: