tpq / propr

[OFFICIAL] [CoDA] An R package to calculate proportionality and other measures for compositional data
65 stars 8 forks source link

Error in labRcpp(ncol(lr)) : negative length vectors are not allowed #13

Open grayfall opened 4 years ago

grayfall commented 4 years ago

Greetings. I'm experiencing what seems to be a size-dependent bug in propr. Here is a reproducible example

library(dplyr)
library(propr)
n.samples <- 80
n.features <- 47000
feature.counts <- rpois(n.features*n.samples, lambda=10000) %>% matrix(nrow=n.samples, ncol=n.features)

phs <- propr(feature.counts, metric='phs')

The traceback doesn't show anything particularly useful:

Error in labRcpp(ncol(lr)) : negative length vectors are not allowed

2. labRcpp(ncol(lr))
1. propr(feature.counts, metric = "phs")

Reducing the number of features to 46000 (and anything below that) "solves" the issue.

tpq commented 4 years ago

Hey! Thanks for your interest in propr!

Hm, my first guess is that you have run out of RAM. Rcpp will try to make an object size n.features * n.features, which takes up a lot of space.

Usually this gives an error like "Error: cannot allocate vector of size 3.8 Gb". When I call "labRcpp(47000)" directly... I get the same error as you.

Perhaps see here -- https://github.com/alyssafrazee/polyester/issues/41#issuecomment-249474607 -- indeed, this may be a generic way that R says "no more memory!"

tpq commented 4 years ago

If you have some programming skills, I got a very basic script that divides up the propr analysis into a few chunks, then glues those chunks back together at the end. It was intended to be used for parallelization, but could be used to reduce RAM overhead too. I think you'd just need to replace the foreach() call with a basic for-loop.

propr.dopar.zip

grayfall commented 4 years ago

@tpq Thanks for replying. I doubt this is a memory issue. I've been running this on a machine with 0.5TB of RAM, and memory consumption never went past 100GB.

tpq commented 4 years ago

Hm. labRcpp is a very simple function, written in C++ via Rcpp. Perhaps something about C++ not having enough bits for the integer. Interestingly, log(46000, base = 2) is about 46,000 -- so it looks like C++ might use 32 bits for the integer by default.

I'm not sure if I can fix this quickly, but I will have a deeper think after the weekend.

// Function to label a half matrix
// [[Rcpp::export]]
List labRcpp(int nfeats){

  int llt = nfeats * (nfeats - 1) / 2;
  Rcpp::IntegerVector partner(llt);
  Rcpp::IntegerVector pair(llt);
  int counter = 0;

  for(int i = 1; i < nfeats; i++){
    for(int j = 0; j < i; j++){
      partner(counter) = i + 1;
      pair(counter) = j + 1;
      counter += 1;
    }
  }

  return List::create(
    _["Partner"] = partner,
    _["Pair"] = pair
  );
}
grayfall commented 4 years ago

@tpq Yes, I thought this might be an overflow somewhere in the cpp code, though 47k aren't enough to overflow a 32 bit signed integer, if we are looking at

int llt = nfeats * (nfeats - 1) / 2;. 

Thanks for taking care of this issue.

Update I was wrong. While the result should fit into a 32-bit int, intermediate values can overflow making the end result negative.

grayfall commented 4 years ago

@tpq seems like one can get the same error message whilst trying to create a vector with length exceeding a 31-bit number https://stackoverflow.com/a/48676389/3846213 and https://stackoverflow.com/a/5234293/3846213.

grayfall commented 4 years ago

@tpq Sorry for spamming so hard. Here I've tested the out-of-memory assumption by writing a function that should've worked with ncol = 47000. The function does nothing but allocating an integer vector. Assuming 64-bit integers, we are dealing with mere ~8GB of RAM. Yet, we still get the same error. I am not exactly sure, why this is happening, though. Given ncol = 47000, ncol * (ncol - 1) / 2 is two times smaller than the 31 uint vector-size limit.

> library(Rcpp)
>
> cppFunction("
+ IntegerVector test(int size) {
+     int veclen = size * (size - 1) / 2;
+     IntegerVector vec(veclen);
+     return vec;
+ }
+ ")
> vec <- test(47000)
Error in test(47000) : negative length vectors are not allowed
tpq commented 4 years ago

No need to apologize -- thanks heaps for the reproducible example!! I'm actually a bit of a C++ newb, but I'm thinking that if we could replace with long int, it might work. It looks like it helps with the test function. I can try something similar with labRcpp

library(Rcpp)
cppFunction("IntegerVector test(long int size) {
int veclen = size * (size - 1) / 2;
IntegerVector vec(veclen);
return vec;
}
")

vec <- test(47000)
grayfall commented 4 years ago

@tpq I went over the codebase and refactored all integer arithmetic operations to avoid overflows in intermediate results. This allows us to avoid 64-bit integers. You should still add a warning in propr (and other related functions) to let users know there is an upper-limit on the number of features, because due to R's limitation, indexing doesn't work on vanilla vectors and matrices with more than 2^31 - 1 entries.

grayfall commented 4 years ago

I have retracted my PR, because the overflow fix brought out another problem with integer division (https://github.com/tpq/propr/issues/15). At this point the only sensible thing I can think of is to switch the argument type in labRcpp (and other related functions) to a 64-bit int.

grayfall commented 4 years ago

@tpq I am under the impression, that making the package compatible with the number of features I'm interested in will be a lot more complicated than changing a couple of 32-bit arguments to 64-bits. I've found other places that can overflow. For example,

std::vector<int> indexPairs(NumericMatrix & X,
                            const String op = "==",
                            const double ref = 0){

  // Index pairs by operator and reference
  std::vector<int> index;
  int nfeats = X.nrow();
  for(int i = 1; i < nfeats; i++){
    for(int j = 0; j < i; j++){

      if(op == "==" || op == "="){
        if(X(i, j) == ref){
          index.push_back(j * nfeats + i + 1);
        }
      }else if(op == ">"){
        if(X(i, j) > ref){
          index.push_back(j * nfeats + i + 1);
        }
      }else if(op == ">="){
        if(X(i, j) >= ref){
          index.push_back(j * nfeats + i + 1);
        }
      }else if(op == "<"){
        if(X(i, j) < ref){
          index.push_back(j * nfeats + i + 1);
        }
      }else if(op == "<="){
        if(X(i, j) <= ref){
          index.push_back(j * nfeats + i + 1);
        }
      }else if(op == "!="){
        if(X(i, j) != ref){
          index.push_back(j * nfeats + i + 1);
        }

      }else if(op == "all"){

        index.push_back(j * nfeats + i + 1);

      }else{

        stop("Operator not found.");
      }
    }
  }

Here j * nfeats can easily overflow a 32-bit int with large enough nfeats.

tpq commented 4 years ago

I assume it should be possible to define index as a long int too?

std::vector<long int> index;

I'll try to have a closer look at this soon -- sorry, it's been a busy week!

grayfall commented 4 years ago

@tpq I guess there are many other places like this one. And I'm not sure how this will play out in the end. I've already tried changing all ints to int64_t, but the package doesn't compile that way.

grayfall commented 4 years ago

@tpq my understanding is that R has a very complicated relationship with 64-bit integers. There is no native 64-bit integer type, and R integer vectors are indexed by 32-bit signed ints, which limits their effective size (you can still have larger vectors, but there are indexing problems). I believe that's the reason, why a simple swap of integer type doesn't work on a package level.