tidyverse / haven

Read SPSS, Stata and SAS files from R
https://haven.tidyverse.org
Other
423 stars 115 forks source link

Option to skip reading label and labels #706

Closed victorhartman closed 1 year ago

victorhartman commented 1 year ago

Currently I am working with many datasets for which I do not use the labels or labelled data. In the sense of improving reading performance, would it be possible to skip the labelled data on read? Also, I guess then user defined missing values will no longer be converted to NA? Would it actually improve performance?

Now I do this for each dataset (in a loop), which seems wasteful.

  df <- read_sav("df.sav")
  df <- df |> 
    zap_label() |> 
    zap_labels()

So something like df <- read_sav("df.sav", skip_labels = TRUE)

gorcha commented 1 year ago

Hi @VictorHartman,

Reading file metadata is a relatively tiny part of the reading process, so there wouldn't be any noticeable performance improvement from skipping the metadata, and the zap_*() functions provide the means to remove metadata as needed. These functions only modify attributes when they're removing the metadata, so should also have a very minimal performance impact.

One exception to this is when user defined missing values are converted to NA - since this is modifying values in the vector itself it will potentially have a noticeable impact on performance. Converting user defined missings to NA in this way is the default behaviour of read_sav() and zap_labels(), but if you'd prefer to keep the original values rather than converting to NA you can use the user_na argument for both functions:

df <- read_sav("df.sav", user_na = TRUE)
df <- df |> 
  zap_label() |> 
  zap_labels(user_na = TRUE)