tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
622 stars 60 forks source link

read_tsv() causes R to crash on some files #447

Closed BrianOB closed 2 years ago

BrianOB commented 2 years ago

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.


I'm updating a three-year-old analysis that downloads and reads six U.S. Bureau of Labor Standards files. The script reads the first four files without a problem but crashes R when it tries to read either of the last two: sm_state, sm_supersector. It does this in RStudio and the R GUI. Inspecting the two files in a text editor doesn't show any special characters.

When I tried to create the reprex, it gave me this message: This reprex appears to crash R. See standard output and standard error for more details.

The standard output and error simply shows that the script correctly downloaded the six files. I've inserted the relevant code below


# packages
library(tidyverse)

# target path
path_target =  "C:/Users/Brian/Documents/Projects/metro_analysis/bls_data/"

# file list
file_list = c('sm.data.0.current','sm.area','sm.data_type','sm.industry','sm.state','sm.supersector','sm.txt')

# download data
dl_main_path = 'https://download.bls.gov/pub/time.series/sm/'

file_list %>% 
  map(~ download.file(url=paste0(dl_main_path,.x),
                      destfile=paste0(path_target,gsub('\\.','_',.x))))

# read data
employment_raw <- read_tsv(paste0(path_target,'sm_data_0_current'), trim_ws=T)
code_area <- read_tsv(paste0(path_target,'sm_area'), trim_ws=T)
code_data_type <- read_tsv(paste0(path_target,'sm_data_type'), trim_ws=T)
code_industry <- read_tsv(paste0(path_target,'sm_industry'), trim_ws=T)
code_state <- read_tsv(paste0(path_target,'sm_state'), trim_ws=T)
code_supersector <- read_tsv(paste0(path_target,'sm_supersector'), trim_ws=T)
jennybc commented 2 years ago

We have fixed several segfaults recently in vroom, which is what read_tsv() now calls, by default, under-the-hood.

And these fixes seem to have fixed your two problematic examples:

code_state <- read_tsv(paste0(path_target,'sm_state'), trim_ws=T)
#> Rows: 55 Columns: 2
#> ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): state_code, state_name
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
code_supersector <- read_tsv(paste0(path_target,'sm_supersector'), trim_ws=T)
#> Rows: 22 Columns: 2
#> ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): supersector_code, supersector_name
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

In the meantime, you can either read them by explicitly using the "first edition" parsing engine of readr or by installing the dev version of vroom.

The "first edition parsing" approach looks like this:

with_edition(1, code_state <- read_tsv(paste0(path_target,'sm_state'), trim_ws=T))
#> 
#> ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> cols(
#>   state_code = col_character(),
#>   state_name = col_character()
#> )
with_edition(1, code_supersector <- read_tsv(paste0(path_target,'sm_supersector'), trim_ws=T))
#> 
#> ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> cols(
#>   supersector_code = col_character(),
#>   supersector_name = col_character()
#> )

We have medium term plans to release vroom, so the fixed version will be released in due course.