tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 285 forks source link

Support reading single line literal datasets #798

Closed richierocks closed 5 years ago

richierocks commented 6 years ago

If a literal dataset has only 1 line, it will be treated as a path to a file (rather than literal data), which causes an error to be thrown. This affects text data types with no header row.

My use case is that I have a lot of fixed width datasets extracted from a PDF document using pdftools::pdf_text(). Some of these datasets have only 1 row.

Reproducible examples

Fixed width format example.

library(readr)
library(purrr)
all_dataset_lines <- list(
  dataset_lines1 = c("1a", "2b"),
  dataset_lines2 = c("3c")
)
column_widths <- fwf_positions(start = c(1, 2), end = c(1, 2))
map(all_dataset_lines, read_fwf, column_widths)
## Error: '3c' does not exist in current working directory ('/path/to/wd').

This would also affect SQL Server-style CSV files that don't include a header row.

all_dataset_lines <- list(
  dataset_lines1 = c("1,a", "2,b"),
  dataset_lines2 = c("3,c")
)
map(all_dataset_lines, read_csv, col_names = FALSE)
## Error: '3,c' does not exist in current working directory ('/path/to/wd')

Ideas for a fix

If you could specify that the input is literal data, then datasource() could handle it accordingly.

For example, we could define

as.literal_data <- function(x) {
  class(x) <- "literal_data"
  x
}

then in datasource(), you could have a logical block like

if(inherits(file, "literal_data")) {
  datasource_string(paste(file, collapse = "\n"), skip, comment)
}

Then the datasets could be parsed using

library(purrr)
map(all_dataset_lines, compose(read_fwf, as.literal_data), column_widths)

or

map(all_dataset_lines, compose(read_csv, as.literal_data))
jimhester commented 6 years ago

A simple workaround would be for you to append a newline character to the end of lines that are of length 1.

library(readr)
all_dataset_lines <- list(
  dataset_lines1 = c("1a", "2b"),
  dataset_lines2 = c("3c")
)
single_lines <- lengths(all_dataset_lines) == 1

all_dataset_lines[single_lines] <- paste0(all_dataset_lines[single_lines], "\n")

column_widths <- fwf_positions(start = c(1, 2), end = c(1, 2))
lapply(all_dataset_lines, read_fwf, column_widths)
#> $dataset_lines1
#> # A tibble: 2 x 2
#>      X1 X2   
#>   <dbl> <chr>
#> 1    1. a    
#> 2    2. b    
#> 
#> $dataset_lines2
#> # A tibble: 1 x 2
#>      X1 X2   
#>   <dbl> <chr>
#> 1    3. c

Created on 2018-02-19 by the reprex package (v0.2.0).

Also unfortunately your examples are not reproducible. I would encourage you to use reprex or at least try running your examples in a new session with R --vanilla.

library(readr)
all_dataset_lines <- list(
  dataset_lines1 = c("1a", "2b"),
  dataset_lines2 = c("3c")
)
column_widths <- fwf_positions(start = c(1, 2), end = c(1, 2))
map(all_dataset_lines, read_fwf, column_widths)
#> Error in map(all_dataset_lines, read_fwf, column_widths): could not find function "map"
lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/