malucalle / selbal

selbal: selection of balances for microbial signatures
32 stars 15 forks source link

error: Response and predictor must be vectors of the same length #38

Open nbat64 opened 2 months ago

nbat64 commented 2 months ago

Hello,

I am interested to try selbal on my dataset. However, when I run selbal.cv() using only y on one dichotomous variable, I have the following error at step 1: task 1 failed - "Response and predictor must be vectors of the same length."

My command is:

xRaw <- mifilterraw_selbal_MonoNeg[,1:821]
print(dim(xRaw))
yRaw <- as.factor(mifilterraw_selbal_MonoNeg[,822]) #metadata selected TypeResult
print(length(yRaw))
length(yRaw) == nrow(xRaw)

mifilterraw.selbal_MonoNeg <- selbal::selbal.cv(x = xRaw, y = yRaw, n.fold = 5, n.iter = 10,
                                                logit.acc = "AUC", zero.rep = "bayes")

I have checked my x reads counts matrix and y variable vector, and they have the same size (length(y) == nrow(x)). I have many warning of column with too many zeros/unobserved value, could it explain the issue?

I thank you in advance for the help.

Regards

Nicolas

IkaStat commented 2 months ago

Hi!

It seems to be an issue with roc function (line 1943 at Selbal_Functions.R), here you have a post where they talk about the message that appears to you.

It is difficult to know what is going on with your data because it may generate "NA" for the predictions (as explained in the post).

What can you do?

First, I would try again the code ignoring those rows whose values are mainly zeros and check if that works. Then, come back and tell us if that worked for you.

Let`s see if setp by step we can solve your problem!

nbat64 commented 1 month ago

Hi @IkaStat

Thanks for your answer. I am looking at the post you sent. However, I don't have missing data for my response variable (y) which is dichotomous ("Negative", "MonoInfec").

Most of the row (OTU) have more than 80% of zeros/unobserved values and are in fact deleted by selbal. Here is the full error message:

# Starting the cross - validation procedure . . .Erreur dans { : 
  task 1 failed - "Response and predictor must be vectors of the same length."
De plus : Messages d'avis :
1: Dans cmultRepl(x, suppress.print = T) :
  Column no. 1 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 2 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 3 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 4 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 5 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 6 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 7 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 8 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 9 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 10 containing >80% zeros/unobserved values deleted (see arguments z.warning and  […tronqué]
2: Dans cmultRepl(x, suppress.print = T) :
  Row no. 28 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 40 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 78 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 89 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 143 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 161 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 189 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 218 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 225 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 233 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row  […tronqué]
3: Dans e$fun(obj, substitute(ex), parent.frame(), e$data) :
  already exporting variable(s): logit.acc
IkaStat commented 1 month ago

The rows must correspond to the samples while the columns to the variables, so, that can be the origin of the issue. If your rows are the OTUs and your columns your samples, you have to use the trasposed matrix.

Let's see if that fixes the problem.

nbat64 commented 1 month ago

My rows are already samples (328 lines) and column OTU taxa name (821), last column is my response variable (822).

IkaStat commented 1 month ago

Ok, I read in your previous message "Most of the row (OTU) . . ." and I tought there was the mistake.

Without a reproducible example is difficult to guess where the issue is, but I think it comes from the roc function. I suggest you to run the code (line by line) and see where the issue apperars, it is a heavy work but also the only way to see where it fails.

nbat64 commented 1 month ago

Hi, So I ran step by step the code and I think I found the issue. It is due to the high number of missing data for some of my samples. line 779:

 logc <- log(cmultRepl2(x, zero.rep = zero.rep))

and the function cmultRepl2 line 2227 use new.x <- cmultRepl(x, suppress.print = T) which have as default parameters z.warning = 0.8, z.delete = TRUE and so delete the samples with more than 80% of missing data in the new object logc but not also in numy my dichotomous response variable.

I think that's why I have next in the code the issue with pROC and the error message "Response and predictor must be vectors of the same length.".

For now, I have removed the sample rows with the warning from my dataset and it works. I may also change the parameters new.x <- cmultRepl(x, suppress.print = T, z.delete = FALSE) but how to include the change in the selbal package installed in R?

thanks

Thanks

IkaStat commented 1 month ago

Hi @nbat64!

Thank you for your comment!

When using selbal we suggest not to include those OTU with more than 80% of zeros. In the case you consider them important, include them as a covariate of presence/absence, in other words, as a vector where 0 represents the absence in the sample and a 1 means the OTU is present in it.

Neverhteless, what you suggest ( include z.delete = FALSE)could be a good proposal if previously we check there is no errors running the code.

Thank you!