Open pepijn-devries opened 1 year ago
Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.
Here's the reprex created with the reprex package. Hopefully this is more helpful...
library(readr)
#> Warning: package 'readr' was built under R version 4.1.3
dummy_text <-
data.frame(
a = sprintf("%03i", 1:255), ## add as identifier at the beginning of the line
b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
c = "\n" ## add line end
)
## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")
## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)
chunk_size <- 5
lines_read <- 0
result <- character(0)
repeat {
lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
## print(problems(lines)) ## commented out for brevity
if (length(lines) == 0) break
lines_read <- lines_read + length(lines)
result <- c(result, lines)
}
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
## Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line
result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
table(duplicated(result_check[result_check != ""]))
#>
#> FALSE TRUE
#> 127 1
## It turns out the the line starting with '014' is read twice.
Created on 2023-08-01 with reprex v2.0.2
I can't replicate it:
library(readr)
lines <- paste0(
sprintf("%03i", 1:255), ## add as identifier at the beginning of the line
unlist(lapply(as.raw(1:255), rawToChar)) ## generate some nasty UTF8 characters
)
path <- tempfile()
writeLines(lines, path)
chunk_size <- 5
skips <- c(0, seq_len(length(lines) %/% chunk_size) * chunk_size)
chunks <- lapply(skips, \(skip) read_lines(path, skip = skip, n_max = chunk_size))
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
id <- stringr::str_sub(unlist(chunks), 1, 3)
id[duplicated(id)]
#> character(0)
Created on 2023-08-01 with reprex v2.0.2
But I'm suspicious that your example is just tripping up on "\013" which is the carriage return, and I see you are on windows.
You are right, after your comment I tried running my reprex on a Linux machine:
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
There I don't get the issue described here. So it seems to be a Windows specific issue. For my specific case I have created a work-around. But I was hoping that readr
would provide a platform-independent solution for reading files, i.e. produce the same result on each platform for the same file. If this is not possible than c'est la vie, but maybe this issue can be documented or produce a warning?
We can look into it, but would you mind having a go at a simpler reprex? I'm pretty sure the problem is related to having on line that uses \r\n
where all the other lines use \n
.
Sure, I will have a look whether I can pinpoint further what causes the issue. This might take some time...
It was easier to simpify the issue than I thought. You are right that \r is triggering funny behaviour on Windows:
library(readr)
#> Warning: package 'readr' was built under R version 4.1.3
text <- "001\n002\n\r003\n004"
read_lines(text, skip = 0)
#> [1] "001" "002" "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002" "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"
Created on 2023-08-01 with reprex v2.0.2
On Windows reading data from the text comes to a halt when using skip=2
in the reprex, and just returns an empty string. On Linux the code above behaves as expected. On Linux \r is read as a separate line, whereas on Windows, it is considered the same line as where 003
is.
Preferably, the same text is interpreted the same on each platform, or the user should be able to indicate which characters should be interpreted as a line feed.
I get the same behaviour on my mac, so it's great that we have a platform independent reprex (possibly because you're no longer saving the string to disk, which can do weird things to newlines).
library(readr)
text <- "001\n002\n\r003\n004"
read_lines(text, skip = 0)
#> [1] "001" "002" "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002" "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"
Created on 2023-08-01 with reprex v2.0.2
Is there any progress to report on this bug? It seems to be still present in the latest release...
@pepijn-devries if their was progress, you can assume it would be reported here...
Thanks for your work on readr! It's most helpful, but I did came across the following problem.
I have a very large ASCII file which is too large to load entirely into memory. Therefore, I use
read_lines
to read it in chunks using theskip
andn_max
arguments, process the chunks and write the results to a file. It turned out that a specific line in the file was read twice. First I assumed that this was an error in the ASCII file, but after some testing it turned out thatread_lines
had read the same line twice.It turns out that the
skip
arguments uses a different way of evaluating the number of lines (to be skipped) than the actual reading algorithm. I've prepared the following reprex by simplifying my case:First prepare a text file with some nasty UTF8 characters:
Next, let's read from the file, 5 lines at a time:
Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line:
It turns out the the line starting with
014
is read twice. I suspect that"\U000d"
is treated as line feed while reading the file, but not when counting the number of lines to be skipped. This causes the same line to be read twice. Is this intended (then this should be documented), or not (can this be fixed)?This is my
sessionInfo()