tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

negative length vectors are not allowed [again] #663

Closed randomgambit closed 6 years ago

randomgambit commented 7 years ago

Hello!

Coming from https://github.com/tidyverse/readr/issues/403. I use the last version of dplyr, but I get

Error in .Call("readr_read_connection_", PACKAGE = "readr", con, chunk_size) : 
  negative length vectors are not allowed

when loading a large zipped csv. I cannot post the data, but I am happy to provide you with any info useful to find out what's going on. Some ideas

  1. The variable names contain spaces and square brackets (if that matters) such as Date[L] or Exch Time
  2. for some variables there are many, many missing variables at the beginning of the data.

Thanks!

jerryysw commented 7 years ago

I've been trying to troubleshoot a very similar issue to @randomgambit's for 2-3 days now, and have visited #403 and related threads a few times. Similar to him/her I am also working with tick by tick data from a major exchange and don't think I am quite allowed to post the data.

I am using read_delim_chunked and an associated callback which does nothing more than format a date column and filter for specific dates, and I have the exact same error code. My data is 4gb for a year's worth, and I have roughly 113000 rows for a sample day.

The only thing I could think of doing so far was to first simply take the first 500k lines, and write it back to a .csv.gz format. My code works when pointed at this smaller test file, but produces the above error when run on the larger input.

jimhester commented 7 years ago

Having the full traceback() would be nice, but I cannot reproduce the error so unless you are able to provide synthetic data that exhibits it we won't be able to find the cause.

RS-eco commented 7 years ago

I have the same problem when using read_delim_chunked and an associated callback, just like @jerryysw . traceback() gives me this:

5: .Call("readr_read_tokens_chunked_", PACKAGE = "readr", sourceSpec, 
       callback, chunkSize, tokenizerSpec, colSpecs, colNames, locale_, 
       progress)
4: read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, 
       col_names, locale_, progress)
3: read_tokens_chunked(ds, callback = callback, chunk_size = chunk_size, 
       tokenizer, spec$cols, names(spec$cols), locale_ = locale, 
       progress = progress)
2: read_delimited_chunked(file, callback = callback, chunk_size = chunk_size, 
       tokenizer, col_names = col_names, col_types = col_types, 
       locale = locale, skip = skip, comment = comment, guess_max = guess_max, 
       progress = progress)
1: read_delim_chunked(file = gbif_file, skip = 1000, callback = append_to_sqlite, 
       delim = ",", chunk_size = 1e+05, col_names = colnames_gbif, 
       col_types = cols(gbifid = col_integer(), datasetkey = col_integer(), 
           occurrenceid = col_character(), kingdom = col_character(), 
           phylum = col_character(), class = col_character(), order = col_character(), 
           family = col_character(), genus = col_character(), species = col_character(), 
           countrycode = col_character(), locality = col_character(), 
           decimallatitude = col_double(), decimallongitude = col_double(), 
           coordinateuncertaintyinmeters = col_double(), coordinateprecision = col_double(), 
           elevation = col_double(), elevationaccuracy = col_double(), 
           depth = col_double(), depthaccuracy = col_double(), eventdate = col_date(format = ""), 
           day = col_integer(), month = col_integer(), year = col_integer(), 
           taxonkey = col_integer(), specieskey = col_integer(), 
           basisofrecord = col_character(), typestatus = col_character(), 
           issue = col_character()))

Any suggestions what is going wrong here?

djvanderlaan commented 7 years ago

Somebody reported this error on r-devel (http://r.789695.n4.nabble.com/readLines-segfaults-on-large-file-amp-question-on-how-to-work-around-td4745206.html#a4745218). I have generated an example the reproduces this error.

Generate file with one long line (more than 2^31 characters):

l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE), collapse="")
con <- file("test.txt", "wt")
for (i in seq_len(2500)) {
  writeLines(l, con, sep ="")
}
close(con)

Reading that with read_file generates the error:

library(readr)
read_file("test.txt")

So, it seems that one cause of this error message are character strings longer than maxint, probably an integer overflow.

Sessioninfo:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1    tibble_1.3.3   Rcpp_0.12.12   rlang_0.1.2  
dtburk commented 6 years ago

Here's another reproducible example matching the situation in which I ran into this error (a fixed-width file with about 35 million lines):

library(readr)
library(magrittr)

N_LINES <- 3.5e7

gzf <- gzfile("fwf.dat.gz", "w")

sample(0:9, 70, replace = TRUE) %>%
  paste0(collapse = "") %>%
  rep(N_LINES) %>%
  write_lines(path = gzf)

close(gzf)

read_fwf("fwf.dat.gz", 
         fwf_widths(
           rep(7, 10)
         )
)

Some additional notes:

Session info:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5 readr_1.1.1 

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1    tibble_1.3.4   Rcpp_0.12.12   rlang_0.1.2 
AHoerner commented 6 years ago

I can offer a semi-reproducible example here. IPUMS-CPS makes a rationalized, harmonized version of the Current Population Survey available for free, but subject to a non-commercial use license which includes the requirement that you not pass the data on. So I can not send you the relevant data set, but you can download it yourself.

If you download a file with the person-level technical and core demographic variable, in the .dta format, then PER_TECH_DEMO <- as_factor(read_dta("cps_00128.dta.gz")) generates the error message: "Error in readconnection(con) : negative length vectors are not allowed." (Your filename, assigned by IPUMS, will be different.). However, if you download an otherwise identical file omitting the replicate weight variable in the person:technical part of the file, then haven handles it without further difficulty.

Note that although the replicate weights appear on the download page as a single variable, they are actually 160 9-digit variables, and substantially expand any download containing them.

Hope that is helpful. --andrewH

jimhester commented 6 years ago

At least in @djvanderlaan's case this is a limitation in the size of R's character strings. Each string can be at most 2^31-1 bytes long. This means you cannot read a file larger than 2^31-1 with read_file(), as it reads the full file into a single character string.

https://github.com/tidyverse/readr/commit/6aad83b657598d2a4197407c04e0858bbf776de2 now uses the same error message R uses for this case, which show the problem more clearly.

'R character strings are limited to 2^31-1 bytes'

lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/