privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

The result is NA #172

Closed Knight1995 closed 6 months ago

Knight1995 commented 8 months ago

Thanks for the great job! I try to do some easy calculations on the FBM object, I want to get the 3774 rows' result,but the result is NA Could you please tell me what the problem is? Thanks! image image image

privefl commented 8 months ago

You should start from the warning messages. X.sub is probably shorter than K1[, 2]. Also, ind spans the column indices by default in big_apply(), but here you're using it for the rows.

Knight1995 commented 8 months ago

Thanks for your quick reply! Actually,I have edited my code to calculate each row's result,but the result is also NA. colmeans <- big_apply(X1, ind = rows_along(X),function(X, ind) { X.sub <- X[ind,1]

K1<-map_dfr(unique(X[,1]),function(i){ S1 <-mean(Y[which(X[,1]==i),1]) data.frame(Value=S1,clu=i) })

a<-K1[which(K1[,2]==X.sub),1] b<-min(K1[which(K1[,2]!=X.sub),1]) si=(b-a)/max(b,a) return(si) }, a.combine = 'c') image image

privefl commented 8 months ago

Yes, cf. my first comment.

Knight1995 commented 8 months ago

Sorry for bothering again.When I test single numble (ind=1), my code works.But I put the code into the big_apply,the results are NA. What is the problem? Does the R algorithm not work in big_apply? Thanks. image image

privefl commented 8 months ago

No, your code doesn't work when using ind <- 1. It is just that X.sub is of length 1 and gets automatically recycled to match the size of K1[, 2]. Which is probably not what you want.

privefl commented 8 months ago

You need to think about what you are trying to achieve here. If I had to guess, I would say that you need to subset K1[ind, 2].

Knight1995 commented 8 months ago

Thanks for your reply. In order to find out the problem,i try a simple test as following.I think it may be that I didn't input one of the two variables, Y, so there is no result. But after I rewrite the code like your multivariate format (https://privefl.github.io/bigstatsr/articles/big-apply.html) , there is still no result output, which is very wired.Could you give me some suggestions? Thanks. image

privefl commented 8 months ago
Knight1995 commented 8 months ago

'ind' means the row number, mean(Y[-ind, ]) means that the matrix in this row will be removed, and the mean of new matrix will be calculated. 'Summary(Y)' shows as following. image

privefl commented 8 months ago
Knight1995 commented 8 months ago

Yes, I probably understand what you mean. I tested the simple example above to know how to rewrite the a.FUN in big_apply step by step.My original R code is below. Because the matrix is too big and it runs too slowly, I want to realize this function by using big_apply.cluster_info and dist, which are the original matrix. Their row names and number of rows are the same.

K3<-future_map_dfr(seq(ncol(cluster_info)),function(Y){
  K2<-map_dfr(seq(nrow(cluster_info)),function(index){
    x <-cluster_info[,Y]
    dist2 <- as.data.frame(cbind(x,dist))[-index,]

    K1<-map_dfr(unique(x),function(i){
      d<-mean(dist2[which(dist2$x==i),index+1])  
      #d<-sum(dist2[ which(dist2$x==i),index+1])/length( which(dist2$x==i))

      data.frame(Value=d,clu=i) 
    })

    si <- (min(K1[K1$clu!=x[index],]$Value)-K1[K1$clu==x[index],]$Value)/max(min(K1[K1$clu!=x[index],]$Value),K1[K1$clu==x[index],]$Value)
    if(is.na(si)){
      data.frame(cluster=x[index],sil_width=0) 
    }else{
      data.frame(cluster=x[index],sil_width=si) 
    }

  })
  data.frame(Resolution=colnames(cluster_info)[Y],silhouette_score=mean(K2$sil_width))
})
privefl commented 6 months ago

I don't get what you're trying to achieve here; sorry I cannot help.