tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

Lines read and skip lines use different evaluation in read_lines #1500

Open pepijn-devries opened 1 year ago

pepijn-devries commented 1 year ago

Thanks for your work on readr! It's most helpful, but I did came across the following problem.

I have a very large ASCII file which is too large to load entirely into memory. Therefore, I use read_lines to read it in chunks using the skip and n_max arguments, process the chunks and write the results to a file. It turned out that a specific line in the file was read twice. First I assumed that this was an error in the ASCII file, but after some testing it turned out that read_lines had read the same line twice.

It turns out that the skip arguments uses a different way of evaluating the number of lines (to be skipped) than the actual reading algorithm. I've prepared the following reprex by simplifying my case:

First prepare a text file with some nasty UTF8 characters:

library(readr)

dummy_text <-
  data.frame(
    a = sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
    b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
    c = "\n"                                      ## add line end
  )

## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")

## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)

Next, let's read from the file, 5 lines at a time:

chunk_size <- 5
lines_read <- 0
result <- character(0)

repeat {
  lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
  print(problems(lines))
  if (length(lines) == 0) break
  lines_read <- lines_read + length(lines)
  result <- c(result, lines)
}

Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line:

result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
duplicated(result_check[result_check != ""])

It turns out the the line starting with 014 is read twice. I suspect that "\U000d" is treated as line feed while reading the file, but not when counting the number of lines to be skipped. This causes the same line to be read twice. Is this intended (then this should be documented), or not (can this be fixed)?

This is my sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_2.1.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       pillar_1.9.0     dbplyr_2.3.2     cellranger_1.1.0 compiler_4.1.1   tools_4.1.1      digest_0.6.31   
 [8] bit_4.0.4        tibble_3.2.1     jsonlite_1.8.4   evaluate_0.21    RSQLite_2.2.8    memoise_2.0.1    lifecycle_1.0.3 
[15] lattice_0.20-44  pkgconfig_2.0.3  rlang_1.1.0      DBI_1.1.3        cli_3.4.1        rstudioapi_0.13  parallel_4.1.1  
[22] fastmap_1.1.0    withr_2.5.0      dplyr_1.1.2      httr_1.4.6       stringr_1.5.0    xml2_1.3.2       hms_1.1.2       
[29] generics_0.1.3   vctrs_0.6.2      rappdirs_0.3.3   tidyselect_1.2.0 bit64_4.0.5      grid_4.1.1       glue_1.6.2      
[36] R6_2.5.1         fansi_1.0.3      readxl_1.3.1     vroom_1.6.0      tzdb_0.1.2       blob_1.2.3       magrittr_2.0.3  
[43] ellipsis_0.3.2   leaps_3.1        rvest_1.0.1      utf8_1.2.2       stringi_1.7.6    cachem_1.0.6     crayon_1.5.2    
hadley commented 1 year ago

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

pepijn-devries commented 1 year ago

Here's the reprex created with the reprex package. Hopefully this is more helpful...

library(readr)
#> Warning: package 'readr' was built under R version 4.1.3

dummy_text <-
  data.frame(
    a = sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
    b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
    c = "\n"                                      ## add line end
  )

## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")

## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)

chunk_size <- 5
lines_read <- 0
result <- character(0)

repeat {
  lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
  ## print(problems(lines)) ## commented out for brevity
  if (length(lines) == 0) break
  lines_read <- lines_read + length(lines)
  result <- c(result, lines)
}
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

## Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line
result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
table(duplicated(result_check[result_check != ""]))
#> 
#> FALSE  TRUE 
#>   127     1

## It turns out the the line starting with '014' is read twice.

Created on 2023-08-01 with reprex v2.0.2

hadley commented 1 year ago

I can't replicate it:

library(readr)

lines <- paste0(
  sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
  unlist(lapply(as.raw(1:255), rawToChar))  ## generate some nasty UTF8 characters
)
path <- tempfile()
writeLines(lines, path)

chunk_size <- 5
skips <- c(0, seq_len(length(lines) %/% chunk_size) * chunk_size)
chunks <- lapply(skips, \(skip) read_lines(path, skip = skip, n_max = chunk_size))
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

id <- stringr::str_sub(unlist(chunks), 1, 3)
id[duplicated(id)]
#> character(0)

Created on 2023-08-01 with reprex v2.0.2

But I'm suspicious that your example is just tripping up on "\013" which is the carriage return, and I see you are on windows.

pepijn-devries commented 1 year ago

You are right, after your comment I tried running my reprex on a Linux machine:

R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

There I don't get the issue described here. So it seems to be a Windows specific issue. For my specific case I have created a work-around. But I was hoping that readr would provide a platform-independent solution for reading files, i.e. produce the same result on each platform for the same file. If this is not possible than c'est la vie, but maybe this issue can be documented or produce a warning?

hadley commented 1 year ago

We can look into it, but would you mind having a go at a simpler reprex? I'm pretty sure the problem is related to having on line that uses \r\n where all the other lines use \n.

pepijn-devries commented 1 year ago

Sure, I will have a look whether I can pinpoint further what causes the issue. This might take some time...

pepijn-devries commented 1 year ago

It was easier to simpify the issue than I thought. You are right that \r is triggering funny behaviour on Windows:

library(readr)
#> Warning: package 'readr' was built under R version 4.1.3

text <- "001\n002\n\r003\n004"

read_lines(text, skip = 0)
#> [1] "001"   "002"   "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002"   "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"

Created on 2023-08-01 with reprex v2.0.2

On Windows reading data from the text comes to a halt when using skip=2 in the reprex, and just returns an empty string. On Linux the code above behaves as expected. On Linux \r is read as a separate line, whereas on Windows, it is considered the same line as where 003 is.

Preferably, the same text is interpreted the same on each platform, or the user should be able to indicate which characters should be interpreted as a line feed.

hadley commented 1 year ago

I get the same behaviour on my mac, so it's great that we have a platform independent reprex (possibly because you're no longer saving the string to disk, which can do weird things to newlines).

library(readr)

text <- "001\n002\n\r003\n004"

read_lines(text, skip = 0)
#> [1] "001"   "002"   "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002"   "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"

Created on 2023-08-01 with reprex v2.0.2

pepijn-devries commented 3 weeks ago

Is there any progress to report on this bug? It seems to be still present in the latest release...

hadley commented 3 weeks ago

@pepijn-devries if their was progress, you can assume it would be reported here...