Open muschellij2 opened 1 year ago
Hi @muschellij2, thanks for the feature request.
Adding progress bars is a bit of work since it would require call backs from the readstat code so I'm unlikely to look at this at least in the short term. I'll keep this issue open in case the opportunity presents itself though, and always happy to review a PR 🙂
Is it possible to get the number of rows from an xpt/sas7bdat/dta in order to create one ourselves?
I think this may be a quick function to determine the number of rows then I can simply wrap it all in all using https://cran.r-project.org/web/packages/progress/index.html.
library(haven)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
get_haven_nrow = function(
file, skip = 1e10,
n_check = 1000L,
max_iter = 1000L,
func = read_dta) {
original_n_check = n_check
x = func(
file,
n_max = n_check,
skip = skip
)
min_value = 0
max_value = 1e30
skip_table = dplyr::tibble(
skip = skip, nr = nrow(x),
max = max_value, min = min_value,
n_check = n_check
)
i = 1
while (TRUE) {
i <<- i + 1
if (i >= max_iter) {
warning("Ending early, ", i, " iterations")
}
# for (i in 1:40) {
x = func(
file,
n_max = n_check,
skip = skip
)
if (nrow(x) == 0) {
max_value = skip
} else {
min_value = skip
}
diff = (max_value - min_value)
n_check = min(n_check, diff)
skip = round(min_value + diff / 2)
skip_table = dplyr::bind_rows(
dplyr::tibble(
skip = skip, nr = nrow(x),
max = max_value, min = min_value,
n_check = n_check
),
skip_table
)
if (diff <= 1) {
break
}
}
list(
file = file,
skip_table = skip_table,
nrow = skip_table$max[1]
)
}
dta_url = "https://stats.idre.ucla.edu/stat/stata/dae/binary.dta"
file = tempfile(fileext = ".dta")
download.file(dta_url, file, mode = "wb")
res = get_haven_nrow(file)
res$nrow
#> [1] 400
res
#> $file
#> [1] "/var/folders/1s/wrtqcpxn685_zk570bnx9_rr0000gr/T//Rtmp83As8L/filefd201957e81d.dta"
#>
#> $skip_table
#> # A tibble: 35 × 5
#> skip nr max min n_check
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 400 1 400 399 1
#> 2 399 2 400 398 2
#> 3 398 4 400 396 4
#> 4 396 0 400 391 9
#> 5 400 9 410 391 19
#> 6 391 0 410 372 38
#> 7 410 28 447 372 75
#> 8 372 0 447 298 149
#> 9 447 102 596 298 298
#> 10 298 0 596 0 596
#> # ℹ 25 more rows
#>
#> $nrow
#> [1] 400
sas_url = "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.sas7bdat"
sas_file = tempfile(fileext = ".sas7bdat")
download.file(sas_url, sas_file, mode = "wb")
res = get_haven_nrow(sas_file, func = read_sas)
res$nrow
#> [1] 400
res
#> $file
#> [1] "/var/folders/1s/wrtqcpxn685_zk570bnx9_rr0000gr/T//Rtmp83As8L/filefd205a47d4ea.sas7bdat"
#>
#> $skip_table
#> # A tibble: 35 × 5
#> skip nr max min n_check
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 400 1 400 399 1
#> 2 399 2 400 398 2
#> 3 398 4 400 396 4
#> 4 396 0 400 391 9
#> 5 400 9 410 391 19
#> 6 391 0 410 372 38
#> 7 410 28 447 372 75
#> 8 372 0 447 298 149
#> 9 447 102 596 298 298
#> 10 298 0 596 0 596
#> # ℹ 25 more rows
#>
#> $nrow
#> [1] 400
Created on 2024-11-05 with reprex v2.1.1
Is it possible to have a progress bar for reading in files, specifically SAS XPORT files?
Here's an example of reading in a small file (6Mb) and everything works fine, but it'd be helpful, if possible, to have a progress bar for larger files (such as https://wwwn.cdc.gov/Nchs/Nhanes/2011-2012/PAXMIN_G.XPT, which is 7.6Gb).
Created on 2023-08-15 with reprex v2.0.2