statOmics / msqrob2

Implementation of the MSqRob analysis of differentially expressed proteins using the Features infrastructure
9 stars 10 forks source link

MsqRobHurdle #36

Open JobbeG opened 2 years ago

JobbeG commented 2 years ago

The problem concerns the msqrobHurdle model, I've been trying to run it for quite a while now and I keep on getting subscript too long errors, which if I'm not mistaken, points towards problems with memory overflow. However, I've got a machine with 16 GB of working memory and 8 cores and do not have access to an even more powerful laptop, is there a workaround?

Additionally I get a few errors every now and then that seem unrelated to this issue and I can't understand the logic behind them but it seems to have something to do with the datatypes of my random factors in my annotation table: This is the error I get when I append them as they were coerced by R, which is numeric: Error in if (m == 0) { : missing value where TRUE/FALSE needed In addition: There were 50 or more warnings (use warnings() to see the first 50)

To circumvent it, I tried forcing them as factors. This make the code run up until the subscript too long error pops up, however, then I do get the warning: '|' is not meaninful for factors (I specified them as random effects with the following syntax: 1|randomfactor1 + 1|randomfactor.

Subsequently I tried forcing them to character strings, but this leads to error again.

Any help or clarification on where it goes wrong would be dearly appreciated.

Best regards

Jobbe

JobbeG commented 2 years ago

Extra info: apparently it could also be due to problems with dimensions not matching, however, I have 106 samples. I checked the dimensions of my Qfeatures object and it has the right dimensions. Furthermore I've checked under colData, and here I could also find 106 values.

JobbeG commented 2 years ago

I further checked the problem, and the problem disappears as soon as I drop my random effects, however, I would prefer to keep my random effects. They code badge effects for trypsin (6 levels) and badge effects for MS_batch (7 levels). Since it are many levels, it doesn't seem suited to code it as a bunch of fixed dummy variables. However, it doesn't seem to work with the standard lmer syntax, although it is the implementation within Msqrob if I'm not mistaken?

ococrook commented 2 years ago

Hey Jobbe,

I'll try and take a look a this. Any chance you could post a small reproducible example of the error and a copy the output and your sessionInfo?

Do you get errors when you fit a different model?

JobbeG commented 2 years ago

Hey Oliver,

Well, I'm not allowed to share my dataset, but I can copy the code and I'll make a dummy dataset and annotation table to go with it. Is it okay if I also comment on some subsequent issues here? Or would you rather have that I open a new issue for that? The thing is that issues persist even when I ignore the random variables. Is there a place where I can find more on the prerequisites for running a hurdle model?

JobbeG commented 2 years ago

I can't seem to add them as files, I'll try copy the code here:

exp_annotation_xlsx <- openxlsx::read.xlsx("C:/Users/jobbe/Documents/Data analyse/Takeda processed files/MSQROB/Annotation_Table2.xlsx") pepHurdle <- MSnbase::grepEcols("C:/Users/jobbe/Documents/Data analyse/Takeda processed files/MSQROB/peptides7.txt", "Intensity ", split = "\t")

PH <- readQFeatures( table = "C:/Users/jobbe/Documents/Data analyse/Takeda processed files/MSQROB/peptides7.txt", fnames = 1, ecol = pepHurdle, name = "peptide", sep="\t")

colData(PH)$Age <- exp_annotation_xlsx$Age colData(PH)$Tissue_age <- exp_annotation_xlsx$Tissue_age colData(PH)$ExpGroup <- as.factor(exp_annotation_xlsx$Expgroup) colData(PH)$ExpGroupR <- as.factor(exp_annotation_xlsx$ExpgroupR) colData(PH)$Gender <- as.character(exp_annotation_xlsx$Gender) colData(PH)$MS_batch <- exp_annotation_xlsx$MS_batch colData(PH)$Trypsin_batch <- exp_annotation_xlsx$Trypsin_batch colData(PH)$Cause_of_death <- exp_annotation_xlsx$Cause_of_death colData(PH)$FullRunName <- exp_annotation_xlsx$FullRunName colData(PH)$order <- exp_annotation_xlsx$order

Extra issue: I went through aggregateFeatures, which I rather not do since I prefer the peptidemodel without intermediate. However, I do not have the skill to implement this through the code with the Hurdle paper and the MsqRob2 package implementation of the hurdle model doesn't seem to accept peptide level models.

PH <- aggregateFeatures(PH, i = "peptideLog", fcol = "Proteins", name = "protein")

PH <- msqrobHurdle( PH, i = "protein", ~AgeExpGroupTissue_age + (1|Trypsin_batch)+(1|MS_batch), modelColumnName = "msqrobHurdle", overwrite = FALSE, robust = TRUE, ridge = TRUE, maxitRob = 40, tol = 1e-06, doQR = TRUE, lmerArgs = list(control = lmerControl(calc.derivs = FALSE)), priorCount = 0.1, binomialBound = TRUE)

Initially my random effects were coded as numbers, however this is not desirable since these are not really meaningful numbers in my opinion. However, it didn't run either way, tried to force them as factors and as characters, but neither one worked. Extra issue 2: When I ignore the random effects, I can plot the model. (but I'd rather not ignore them)

However, I get the following warning: "50: In eval(family$initialize) : non-integer counts in a binomial glm!" Should I be concerned about this?

Additionally, when I want to extract contrasts it doesn't seem to work. I based my code on the vignettes and I've pasted it underneath.

Side note: I tried this with several different numbers between the square brackets, since it differed throughout the vignettes but all return NA.

getCoef(rowData(PH[["protein"]])$msqrobModels[[1]])

When I changed the name to match the modelcolumnname argument in the hurdle function, it no longer returns NA but it returns

"Error in (function (classes, fdef, mtable): unable to find an inherited method for function ‘getCoef’ for signature ‘"NULL"’"

getCoef(rowData(PH[["protein"]])$msqrobHurdle[[1]])

Errors are also returned for subsequent steps involving contrasts, here it is suggested that both elements do not exist,

eventhough the initial error messages where different

"Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 't': argument 2 is empty

L <- makeContrast( c( "ExpGroupPD - ExpGroupHC", ), parameterNames = rowData(PH[["protein"]])$msqrobModels[[2]] %>% getCoef %>% names )

L <- makeContrast( c( "ExpGroupPD - ExpGroupHC", ), parameterNames = rowData(PH[["protein"]])$msqrobHurdle[[2]] %>% getCoef %>% names )

JobbeG commented 2 years ago

As for the dataset, I made a randomized version of the dataset and the annotation table, which I can send if you want, the annotation table looks something like this:

Picture_AT

The peptide files is just standard MaxQuant output and too big to screenshot.

JobbeG commented 2 years ago

I have done some digging in the functions, here I found that one of the problems that is encountered can be circumvented by changing the underlying linear algebra package to lapack rather than linpack. In this way it can run past the step behind the QR-decomposition. There it now leads to NA results since it declares that the number of parameters for the decomposition does not match the original number of parameters. Lapack does return the correct number of parameters after decomposition, however, the loop where it is parallelized no longer functions. Could it be that the implementation just doesn't work for dataset where a lot of factors are modelled simultaneously?