tidyverse / haven

Read SPSS, Stata and SAS files from R
https://haven.tidyverse.org
Other
423 stars 115 forks source link

filter rows while loading .dta file #675

Closed maxecharel closed 2 years ago

maxecharel commented 2 years ago

I have a huge .dta file to open, something that is around 7GB (more than 2000 variables, and a huge number of rows). The laptop I use to open this file has 16GB of RAM, and a swap partition of 8GB. When I try to load this file using haven::read_dta, both RAM and swap memory are overwhelmed and the .R process ends up crashing.

I doubt this can be qualified as a bug, this is why I take this opportunity to introduce a feature request: would it be possible to filter the rows while opening the file?

Something like

df <- read_dta(file = '/path/to/file.dta', row_filter = <filtering condition>)
gorcha commented 2 years ago

Duplicate of #194

gorcha commented 2 years ago

Hi @maxecharel, thanks for the request - there's an open request for a "chunked reader" feature along these lines so I've marked this as a duplicate.

In the mean time, it is possible to read in a few lines at a time using the skip and n_max arguments, although this can be a bit slow because of the overhead of the read_dta() function parsing metadata for every chunk read from the file.

Here's a very simple implementation that can be used by passing a filtering function as the "callback" argument:

library(haven)
library(dplyr)

read_dta_callback <- function(..., callback, chunk_size = 1000) {
  skip <- 0
  out <- tibble()

  while (TRUE) {
    chunk <- read_dta(..., skip = skip, n_max = chunk_size)
    if (nrow(chunk) == 0) break

    out <- bind_rows(out, callback(chunk))
    skip <- skip + chunk_size
  }

  out
}

path <- system.file("examples", "iris.dta", package = "haven")

my_filter <- function(data) {
  data %>% filter(sepallength >= 5.5, sepallength <= 6.5)
}

read_dta_callback(path, callback = my_filter, chunk_size = 10)
#> # A tibble: 68 × 5
#>    sepallength sepalwidth petallength petalwidth species   
#>          <dbl>      <dbl>       <dbl>      <dbl> <chr>     
#>  1        5.80       4           1.20      0.200 setosa    
#>  2        5.70       4.40        1.5       0.400 setosa    
#>  3        5.70       3.80        1.70      0.300 setosa    
#>  4        5.5        4.20        1.40      0.200 setosa    
#>  5        5.5        3.5         1.30      0.200 setosa    
#>  6        6.40       3.20        4.5       1.5   versicolor
#>  7        5.5        2.30        4         1.30  versicolor
#>  8        6.5        2.80        4.60      1.5   versicolor
#>  9        5.70       2.80        4.5       1.30  versicolor
#> 10        6.30       3.30        4.70      1.60  versicolor
#> # … with 58 more rows

Created on 2022-04-11 by the reprex package (v2.0.1)

maxecharel commented 2 years ago

Nice, thx a lot.