lgatto / synapter

Label-free data analysis pipeline for optimal identification and quantitation
https://lgatto.github.io/synapter/
4 stars 2 forks source link

question about memory allocation #97

Closed pavel-shliaha closed 7 years ago

pavel-shliaha commented 9 years ago

Synapter objects have become very big now + I am doing many synapter analysis in a loop (i.e. I perform up to 20 transfers). To save memory I am doing the following:

  1. using the same variable for synapter analysis in the loop
  2. save the analysis at the end of cycle iteration
  3. remove the object from memory at the end of cycle iteration, e.g.:
for (i in 1:20){

    l = list  (identpeptide = masterFile,
               quantpeptide = FPFiles  [i],
               quantpep3d = P3DFiles[i],
               fasta = fastaFile,
               quantspectra = specFiles[i])

    synapterAnalysis <- Synapter(l, master = TRUE)
    ... (analysis here) 
    saveRDS ( synapterAnalysis, file = paste0 (outputFileNames[i], ".RDS" ))
       rm (synapterAnalysis)
} 

Nonetheless the memory keeps being used and finally computer freezes, i.e. I guess the memory is still allocated even though the synapter object is removed from memory by rm ()

see the screenshot: Note R takes 4Gb of memory after only 3 synapter analysis AND threre is no synapterAnalysis object that could be accessed!

31 7 15 synapter memory allocation

sgibb commented 9 years ago

You are right, because we store the complete spectra data from identification fragments and quantitation spectra the size of the synapter objects is a lot of larger than before. Nevertheless I can't reproduce your findings (maybe it is a Windows specific problem?). On my Linux laptop the following code runs just fine and use more or less around 1.2 GB the whole run time.

for (i in 1:20) {
    l = list  (identpeptide = "masterFile.RDS",
               quantpeptide = "BC_F24_CW_HDMSE_01_IA_final_peptide.csv" ,
               quantpep3d =  "BC_F24_CW_HDMSE_01_Pep3DAMRT.csv",
               fasta = "TAIR10_comb_CC.fasta",
               quantspectra = "BC_F24_CW_HDMSE_01_Pep3D_Spectrum.xml")

    synapterAnalysis <- Synapter(l, master=TRUE)
    saveRDS (synapterAnalysis, file=paste0("_synapterAnalysis", i, ".RDS"))
    rm(synapterAnalysis)
}

Maybe calling the garbage collection (via gc()) after rm helps? BTW: Neither gc nor rm should be needed at all because the variable would be overwritten in the next iteration of the loop. But give it a try.

pavel-shliaha commented 9 years ago

Having the same problem again. Computer freezes after 5! synapter analysis. Should we report it to R core team? Laurent, your thoughts please

lgatto commented 9 years ago

On 13 August 2015 11:01:01 BST, pavel-shliaha notifications@github.com wrote:

Having the same problem again. Computer freezes after 5! synapter analysis. Should we report it to R core team? Laurent, your thoughts please

No, definitely not R core. I am currently travelling and will have a look next week.

Laurent

lgatto commented 8 years ago

I will investigate this. Could one of you point me the a set of files I can use to reproduce. What version/branch has this been observed with?

sgibb commented 8 years ago

As noted above I can't reproduce this. It happens in branch 2.0. E.g. I tried it with the Kuharev data on molerat and it works fine (molerat:/disk1/sg777/synapter2paper/kuharev2015/synapter2)

lgatto commented 8 years ago

I will try it on a Windows machine too. Of course, if we can't reproduce, it will be difficult to do anything, but worth a try.

@pavel-shliaha could you try to check/set the memory on you computer - see ?memory.limit.

pavel-shliaha commented 8 years ago

lets keep this one open. Still seeing the problem

pavel-shliaha commented 7 years ago

The problem has become more severe recently with synergise2. It is impossible to run synergise2 more than once before I get the following message:

Quitting from lines 147-147 (synergise2.Rmd) Show Traceback

Rerun with Debug Error: cannot allocate vector of size 2.5 Gb In addition: Warning message: In dir.create(outputdir) : 'Y:\RAW\pvs22_QTOF_DATA_data3\synapter2paper\kuharev2015\synapter2_synergise\output\UDMSE\S130423_05' already exists

judging by the taskmanager r now takes 16! GB of memory. Interestingly it was taking 20gb, when synergise2 threw out the error message above, but gc() helped recover 4gb, taking total memory used down to 16gb

sgibb commented 7 years ago

Unfortunately I can't test this currently. My laptop has not enough RAM and R was killed in the grid search (I currently testing with a large swapfile on my external harddisk) and molerat doesn't provide an X11 interface and the generation of png stops the whole synergise2. But the pep3d file has 4.8-5.5 Gb, the Spectrum.xml file 2-4 Gb, so I am not suprised that synpater needs 16 Gb because we have a lot of copies of the data in memory (e.g. MergedFeatures is a copy of some rows in IdentPeptideData/QuantPep3Data, same for MatchedEMRTs and FragmentMatching and the whole spectra/fragments of course ...) It is not the most efficient way of storing these data. E.g. MergedFeatures/MatchedEMRTs etc. could just contain indices of the original data in IdentPeptideData and QuantPep3Data but that would require a complete rewrite of synapter.

pavel-shliaha commented 7 years ago

Hey sebastian,

sorry for not explaining the problem cleaer. The problem is not that an instance of running synapter takes up memory. That is smth I would understand. The problem is once synapter is run it leaves smth back in the memory, even if I call gc(). E.g. I call synergise2 once after it has been run and gc() has been done R still consumes 12gb of memory!

So as a result the problem is not that I cant run synergise2 at all (or synapter2) the problem is that I can only run it once, because every time I call synergise2 it leaves more and more of processes behind that keep consuming memory. So I have to close R and restart. When I was running with just synapter I had to close R every 4th-5th synapter analysis, now I have to close after every synapter analysis.

sgibb commented 7 years ago

Ok, I indeed understood it wrong. Regarding multiple synapter runs I am afraid we have to live with the memory problem (at least under Windows). I am not very familiar with the memory management and garbage collection in R on Windows. But according to (the quite old) answers on stack overflow we can't do anything but restart R:

(maybe the R memory management has improved in the last 6 years ? )

pavel-shliaha commented 7 years ago

can we at least make it so the problem is less severe with synergise2

I mean the pipeline when ran command by command runs out of memory much less severely, than synergise.

sgibb commented 7 years ago

I run the following example multiple times on linux and there were not any "out of memory" errors nor any difference in the resulting objects (regarding size and content):

library("synapter")

inlist <- list(
  identpeptide="fermentor_03_sample_01_HDMSE_01_IA_final_peptide.csv.gz",
  identfragments="fermentor_03_sample_01_HDMSE_01_IA_final_fragment.csv.gz",
  quantpeptide="fermentor_02_sample_01_HDMSE_01_IA_final_peptide.csv.gz",
  quantpep3d="fermentor_02_sample_01_HDMSE_01_Pep3DAMRT.csv.gz",
  quantspectra="fermentor_02_sample_01_HDMSE_01_Pep3D_Spectrum.xml.gz",
  fasta="S.cerevisiae_Uniprot_reference_canonical_18_03_14.fasta")

outputDir <- "output"
fdr <- 0.01
fdrMethod <- "BH"
fpr <- 0.01
peplen <- 1
missedCleavages <- 2
IisL <- TRUE
identppm <- 20
quantppm <- 20
uniquepep <- TRUE
span.rt <- 0.02
span.int <- 0.02
grid.ppm.from <- 2
grid.ppm.to <- 20
grid.ppm.by <- 2
grid.nsd.from <- 0.5
grid.nsd.to <- 5
grid.nsd.by <- 0.5
grid.imdiffs.from <- 0.6
grid.imdiffs.to <- 1.6
grid.imdiffs.by <- 0.2
grid.subset <- 1
grid.param.sel <- "auto"
fm.ppm <- 25
fm.ident.minIntensity <- 0
fm.quant.minIntensity <- 0
fm.minCommon <- 1
fm.fdr.nonunique <- 0.05
mergedEMRTs <- "rescue"

(system.time({
syn2 <- synergise2(filenames=inlist,
                   outputdir=file.path(outputDir, "synergise2"),
                   fdr=fdr,
                   fdrMethod= fdrMethod,
                   fpr=fpr,
                   peplen=peplen,
                   missedCleavages=missedCleavages,
                   IisL=IisL,
                   identppm=identppm,
                   quantppm=quantppm,
                   uniquepep=uniquepep,
                   span.rt=span.rt, span.int=span.int,
                   grid.ppm.from=grid.ppm.from, grid.ppm.to=grid.ppm.to,
                   grid.ppm.by=grid.ppm.by,
                   grid.nsd.from=grid.nsd.from, grid.nsd.to=grid.nsd.to,
                   grid.nsd.by=grid.nsd.by,
                   grid.imdiffs.from=grid.imdiffs.from,
                   grid.imdiffs.to=grid.imdiffs.to,
                   grid.imdiffs.by=grid.imdiffs.by,
                   grid.subset=grid.subset,
                   grid.param.sel=grid.param.sel,
                   fm.ppm=fm.ppm,
                   fm.ident.minIntensity=fm.ident.minIntensity,
                   fm.quant.minIntensity=fm.quant.minIntensity,
                   fm.minCommon=fm.minCommon,
                   fm.fdr.nonunique=fm.fdr.nonunique,
                   mergedEMRTs=mergedEMRTs)
}))

(system.time({
expl <- Synapter(inlist)

filterUniqueDbPeptides(expl,
                       missedCleavages=missedCleavages,
                       IisL=IisL)
filterPeptideLength(expl, l=peplen)
filterQuantPepScore(expl, method=fdrMethod, fdr=fdr)
filterIdentPepScore(expl, method=fdrMethod, fdr=fdr)
filterQuantPpmError(expl, ppm=quantppm)
filterIdentPpmError(expl, ppm=identppm)
filterIdentProtFpr(expl, fpr=fpr)
filterQuantProtFpr(expl, fpr=fpr)

mergePeptides(expl)
setLowessSpan(expl, span.rt)
modelRt(expl)

searchGrid(expl,
           imdiffs=seq(grid.imdiffs.from, grid.imdiffs.to, grid.imdiffs.by),
           ppms=seq(grid.ppm.from, grid.ppm.to, grid.ppm.by),
           nsds=seq(grid.nsd.from, grid.nsd.to, grid.nsd.by))
setBestGridParams(expl, what=grid.param.sel)
findEMRTs(expl)

filterFragments(expl, what="fragments.ident", minIntensity=fm.ident.minIntensity)
filterFragments(expl, what="spectra.quant", minIntensity=fm.quant.minIntensity)
fragmentMatching(expl, ppm=fm.ppm)

fragmentMatchingStats <- fragmentMatchingPerformance(expl, what="non-unique")
sel <- which(fragmentMatchingStats[, "fdr"] < fm.fdr.nonunique)
nonUniqueThreshold <- min(fragmentMatchingStats[sel, "deltacommon"])

filterUniqueMatches(expl, minNumber=1)
filterNonUniqueMatches(expl, minDelta=nonUniqueThreshold)
filterNonUniqueIdentMatches(expl)

rescueEMRTs(expl, method="rescue")
setLowessSpan(expl, span.int)
modelIntensity(expl)

writeMergedPeptides(expl, file=file.path(outputDir, "explicit", "MergedPeptides.csv"))
writeMatchedEMRTs(expl, file=file.path(outputDir, "explicit", "MatchedPeptides.csv"))
writeIdentPeptides(expl, file=file.path(outputDir, "explicit", "IdentPeptides.csv"))
writeQuantPeptides(expl, file=file.path(outputDir, "explicit", "QuantPeptides.csv"))
saveRDS(expl, file=file.path(outputDir, "explicit", "SynapterObject.rds"))
}))

## remove processingData to allow comparison
syn2$IdentFragmentData@processingData <- new("MSnProcess")
expl$IdentFragmentData@processingData <- new("MSnProcess")
syn2$QuantSpectrumData@processingData <- new("MSnProcess")
expl$QuantSpectrumData@processingData <- new("MSnProcess")

## remove log
syn2$SynapterLog <- character()
expl$SynapterLog <- character()

all.equal(syn2, expl)
pavel-shliaha commented 7 years ago

let me try to run it on windows and I will report back to you.

sgibb commented 7 years ago

Ok, I run the whole analysis on molerat now (unfortunately with IisL = TRUE so I have to rerun it). Please note that I never call rm(...) or gc() manually. I could run all samples without restarting R and the whole process needs ~ 20 GB RAM (during RDS export it sometimes uses up to 25 GB; blue and green lines represent the RAM usage, purple line CPU usage, red vertical lines are synapter events): usage

So at least on linux there seems to be no problem with accumulating memory usage. I will close that with the label won't fix if nobody has a different meaning/idea.

I used the following scripts: https://gist.github.com/sgibb/5bb4625364f076cc5ca0c4bbb57d630c

pavel-shliaha commented 7 years ago

I posted a question about the memory consumption on stackoverflow. Basically I wanted to know if there is a function in R which will open a new R session, execute functions inside it and then close it, which would solve our problem:

http://stackoverflow.com/questions/41791788/problems-with-r-memory-management-in-windows-can-i-restart-r-within-a-loop

It seems I am not the first to ask. Are you familiar with this function?

makeActiveBinding("refresh", function() { shell("Rgui"); q("no") }, .GlobalEnv)?

sgibb commented 7 years ago

As I mentioned a few posts above the question was asked many times on SO. This refresh function just restarts R and closes the current version. This won't help. In general: could it be that you Windows machine has not enough memory at all? As you see from my plot you would need at least 25 GB (with Windows GUI/RStudio and all this stuff 32 GB would better).

pavel-shliaha commented 7 years ago

Yes but that's what I need 25Gb. I mean I can easily go through 1 analysis. Please note that I need to go through each of the 10 analysis with Kuharev's dataset individually by closing R after every analysis, e.g.

Pavel runs analysis 1 Pavel closes R Pavel runs analysis 2 Pavel closes R ....

this works

What does not work is

for (i in 1:10){ Pavel runs analysis 1 Pavel runs analysis 2 error.... }

so there must be a way I can automate the first process and just ask R to close and reopen within a loop to clear memory in between.

e.g.

for (i in 1:10){ Pavel runs analysis[i] R is restarted and memory gets cleaned }

Isnt this what those guys are discsussing on stackoverflow under the link I provided?

sgibb commented 7 years ago

I can't test it on Windows but something like the following could work (please note that you have to fill the ... with the quantpep3d, spectrum, fragment and fasta files):

identfiles <- c("identfile1.csv", "identfile2.csv")
quantfiles <- c("quantfile1.csv", "quantfile2.csv")
...
outputdir <- c("outputdir1", "outputdir2")
for (i in seq(along=identfiles)) {
  system(sprintf("R -e 'library(synapter); synergise2(filenames=list(identpeptide=%s, quantpeptide=%s, ...), outputdir =%s)'", 
                   identfiles[i], quantfiles[i], ..., outputdir[i]))
}

However that is not very clean and we could just run the analysis on molerat where it is working!

pavel-shliaha commented 7 years ago

1) thanks for the code, I'll try it tomorrow 2) Sorry guys to be a pain, but I strongly believe that if there is a solution for windows we have to implement it into the package itself (perhaps as an argument to the function or as a very explicit explanation in the vignette?) I mean the vast majority of users are windows users and its kind of unfair to release smth that works only under Linux. Please let me know what you think. Bottom line I think "works under windows" really beats "clean" in this particular case...

lgatto commented 7 years ago

@pavel-shliaha what are the specs of your machine (in particular RAM) and output of sessionInfo()? Please post as an issue.

lgatto commented 7 years ago

FYI, I am running @sgibb's scripts (from molerat) on a Windows computer, to try to reproduce the problems.

pavel-shliaha commented 7 years ago

Thanks, Laurent! Hopefully you can make this work so we have a solution for windows.

lgatto commented 7 years ago

I have executed the molerat:/disk1/sg777/synapter2paper/kuharev2015/synapter2scripts provided by @sgibb on a Windows computer. I successfully ran the synapter1 and synapter2 analyses for UDMSE. Should I proceed with more?

pavel-shliaha commented 7 years ago

Could you please elaborate on what was the configuration of the windows machine? In particular how much memory did it have?

lgatto commented 7 years ago

64 bit OS, 32 GB RAM.

pavel-shliaha commented 7 years ago

did you perhaps monitor the memory consumption? If so, did you observe a steady increase in consumed memory or did it peak as shown above in sgibb investigation and remained flat afterwards.

lgatto commented 7 years ago

I didn't monitor RAM usage; it's a shared computer and it's painful enough already to use it occasionally. The memory went up quite a bit and most of it was used by R. I could re-run it more systematically (I ran the different analyses one by one and inadvertently restarted R at one point), but it would be useful to know exactly which ones seem to be the issue. Also, I used a recent R and updated the packages before running the scripts.