negative length vectors are not allowed [again]

tidyverse / readr

Read flat files (csv, tsv, fwf) into R

https://readr.tidyverse.org

Other

1.01k stars 286 forks source link

negative length vectors are not allowed [again] #663

Closed randomgambit closed 6 years ago

randomgambit commented 7 years ago

Hello!

Coming from https://github.com/tidyverse/readr/issues/403. I use the last version of dplyr, but I get

Error in .Call("readr_read_connection_", PACKAGE = "readr", con, chunk_size) : 
  negative length vectors are not allowed

when loading a large zipped csv. I cannot post the data, but I am happy to provide you with any info useful to find out what's going on. Some ideas

The variable names contain spaces and square brackets (if that matters) such as Date[L] or Exch Time
for some variables there are many, many missing variables at the beginning of the data.

Thanks!

jerryysw commented 7 years ago

I've been trying to troubleshoot a very similar issue to @randomgambit's for 2-3 days now, and have visited #403 and related threads a few times. Similar to him/her I am also working with tick by tick data from a major exchange and don't think I am quite allowed to post the data.

I am using read_delim_chunked and an associated callback which does nothing more than format a date column and filter for specific dates, and I have the exact same error code. My data is 4gb for a year's worth, and I have roughly 113000 rows for a sample day.

The only thing I could think of doing so far was to first simply take the first 500k lines, and write it back to a .csv.gz format. My code works when pointed at this smaller test file, but produces the above error when run on the larger input.

jimhester commented 7 years ago

Having the full traceback() would be nice, but I cannot reproduce the error so unless you are able to provide synthetic data that exhibits it we won't be able to find the cause.

RS-eco commented 7 years ago

I have the same problem when using read_delim_chunked and an associated callback, just like @jerryysw . traceback() gives me this:

5: .Call("readr_read_tokens_chunked_", PACKAGE = "readr", sourceSpec, 
       callback, chunkSize, tokenizerSpec, colSpecs, colNames, locale_, 
       progress)
4: read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, 
       col_names, locale_, progress)
3: read_tokens_chunked(ds, callback = callback, chunk_size = chunk_size, 
       tokenizer, spec$cols, names(spec$cols), locale_ = locale, 
       progress = progress)
2: read_delimited_chunked(file, callback = callback, chunk_size = chunk_size, 
       tokenizer, col_names = col_names, col_types = col_types, 
       locale = locale, skip = skip, comment = comment, guess_max = guess_max, 
       progress = progress)
1: read_delim_chunked(file = gbif_file, skip = 1000, callback = append_to_sqlite, 
       delim = ",", chunk_size = 1e+05, col_names = colnames_gbif, 
       col_types = cols(gbifid = col_integer(), datasetkey = col_integer(), 
           occurrenceid = col_character(), kingdom = col_character(), 
           phylum = col_character(), class = col_character(), order = col_character(), 
           family = col_character(), genus = col_character(), species = col_character(), 
           countrycode = col_character(), locality = col_character(), 
           decimallatitude = col_double(), decimallongitude = col_double(), 
           coordinateuncertaintyinmeters = col_double(), coordinateprecision = col_double(), 
           elevation = col_double(), elevationaccuracy = col_double(), 
           depth = col_double(), depthaccuracy = col_double(), eventdate = col_date(format = ""), 
           day = col_integer(), month = col_integer(), year = col_integer(), 
           taxonkey = col_integer(), specieskey = col_integer(), 
           basisofrecord = col_character(), typestatus = col_character(), 
           issue = col_character()))

Any suggestions what is going wrong here?

djvanderlaan commented 7 years ago

Somebody reported this error on r-devel (http://r.789695.n4.nabble.com/readLines-segfaults-on-large-file-amp-question-on-how-to-work-around-td4745206.html#a4745218). I have generated an example the reproduces this error.

Generate file with one long line (more than 2^31 characters):

l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE), collapse="")
con <- file("test.txt", "wt")
for (i in seq_len(2500)) {
  writeLines(l, con, sep ="")
}
close(con)

Reading that with read_file generates the error:

library(readr)
read_file("test.txt")

So, it seems that one cause of this error message are character strings longer than maxint, probably an integer overflow.

Sessioninfo:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1    tibble_1.3.3   Rcpp_0.12.12   rlang_0.1.2

dtburk commented 6 years ago

Here's another reproducible example matching the situation in which I ran into this error (a fixed-width file with about 35 million lines):

library(readr)
library(magrittr)

N_LINES <- 3.5e7

gzf <- gzfile("fwf.dat.gz", "w")

sample(0:9, 70, replace = TRUE) %>%
  paste0(collapse = "") %>%
  rep(N_LINES) %>%
  write_lines(path = gzf)

close(gzf)

read_fwf("fwf.dat.gz", 
         fwf_widths(
           rep(7, 10)
         )
)

Some additional notes:

The exact error I get is Error in read_connection_(con) : negative length vectors are not allowed.
I get this error on Windows, but not on a Unix instance I tested it on.
I can load this file with read_fwf() if I unzip it manually first.
I don't get the error if I set N_LINES <- 3e7 (the only other value I tested).

Session info:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5 readr_1.1.1 

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1    tibble_1.3.4   Rcpp_0.12.12   rlang_0.1.2

AHoerner commented 6 years ago

I can offer a semi-reproducible example here. IPUMS-CPS makes a rationalized, harmonized version of the Current Population Survey available for free, but subject to a non-commercial use license which includes the requirement that you not pass the data on. So I can not send you the relevant data set, but you can download it yourself.

If you download a file with the person-level technical and core demographic variable, in the .dta format, then PER_TECH_DEMO <- as_factor(read_dta("cps_00128.dta.gz")) generates the error message: "Error in readconnection(con) : negative length vectors are not allowed." (Your filename, assigned by IPUMS, will be different.). However, if you download an otherwise identical file omitting the replicate weight variable in the person:technical part of the file, then haven handles it without further difficulty.

Note that although the replicate weights appear on the download page as a single variable, they are actually 160 9-digit variables, and substantially expand any download containing them.

Hope that is helpful. --andrewH

jimhester commented 6 years ago

At least in @djvanderlaan's case this is a limitation in the size of R's character strings. Each string can be at most 2^31-1 bytes long. This means you cannot read a file larger than 2^31-1 with read_file(), as it reads the full file into a single character string.

https://github.com/tidyverse/readr/commit/6aad83b657598d2a4197407c04e0858bbf776de2 now uses the same error message R uses for this case, which show the problem more clearly.

'R character strings are limited to 2^31-1 bytes'

lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/