Closed maxecharel closed 2 years ago
Duplicate of #194
Hi @maxecharel, thanks for the request - there's an open request for a "chunked reader" feature along these lines so I've marked this as a duplicate.
In the mean time, it is possible to read in a few lines at a time using the skip
and n_max
arguments, although this can be a bit slow because of the overhead of the read_dta()
function parsing metadata for every chunk read from the file.
Here's a very simple implementation that can be used by passing a filtering function as the "callback" argument:
library(haven)
library(dplyr)
read_dta_callback <- function(..., callback, chunk_size = 1000) {
skip <- 0
out <- tibble()
while (TRUE) {
chunk <- read_dta(..., skip = skip, n_max = chunk_size)
if (nrow(chunk) == 0) break
out <- bind_rows(out, callback(chunk))
skip <- skip + chunk_size
}
out
}
path <- system.file("examples", "iris.dta", package = "haven")
my_filter <- function(data) {
data %>% filter(sepallength >= 5.5, sepallength <= 6.5)
}
read_dta_callback(path, callback = my_filter, chunk_size = 10)
#> # A tibble: 68 × 5
#> sepallength sepalwidth petallength petalwidth species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.80 4 1.20 0.200 setosa
#> 2 5.70 4.40 1.5 0.400 setosa
#> 3 5.70 3.80 1.70 0.300 setosa
#> 4 5.5 4.20 1.40 0.200 setosa
#> 5 5.5 3.5 1.30 0.200 setosa
#> 6 6.40 3.20 4.5 1.5 versicolor
#> 7 5.5 2.30 4 1.30 versicolor
#> 8 6.5 2.80 4.60 1.5 versicolor
#> 9 5.70 2.80 4.5 1.30 versicolor
#> 10 6.30 3.30 4.70 1.60 versicolor
#> # … with 58 more rows
Created on 2022-04-11 by the reprex package (v2.0.1)
Nice, thx a lot.
I have a huge .dta file to open, something that is around 7GB (more than 2000 variables, and a huge number of rows). The laptop I use to open this file has 16GB of RAM, and a swap partition of 8GB. When I try to load this file using haven::read_dta, both RAM and swap memory are overwhelmed and the .R process ends up crashing.
I doubt this can be qualified as a bug, this is why I take this opportunity to introduce a feature request: would it be possible to filter the rows while opening the file?
Something like