nanxstats / Rcpi

💊 Molecular informatics toolkit with integration of bioinformatics and cheminformatics tools for drug discovery
https://nanx.me/Rcpi/
Artistic License 2.0
35 stars 12 forks source link

extractDrugOBFP4 or similar, consumes too much RAM #16

Closed tcaceresm closed 4 months ago

tcaceresm commented 5 months ago

Hi there, I'm trying to calculate fingerprints for ~50,000 molecules. However, I notice that the RAM usage only increases, to the point of completely depleting it. I don't understand how it is possible given that the matrix created by the function extractDrugOBFP4 to store the fingerprints is previously created, with the correct dimensions. Upon review, the size of the matrix is constant (~1.6gb) in each loop, however, the RAM usage by the R session increases as the loop continues. Furthermore, the process is sequential, molecule by molecule, which should not increase RAM usage. This is the code of the function, and the section of the function that increases RAM usage. I know that this is the problematic section because when I change it to any vector of size 512 (not the fingerprint returned by ChemmineOB), the process does not consume more ram. I'm not R expert, any help will be useful. Thanks, and sorry about my english.

function (molecules, type = c("smile", "sdf")) 
{
  check_ob()
  if (type == "smile") {
    if (length(molecules) == 1L) {
      molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', molecules, identity)"))
      fp = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
    }
    else if (length(molecules) > 1L) {
      fp = matrix(0L, nrow = length(molecules), ncol = 512L)
      for (i in 1:length(molecules)) {
        molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', molecules[i], identity)"))
###########################################################
####### This is the step which increases RAM usage in each loop step
        fp[i, ] = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
###########################################################
      }
    }
  }
  else if (type == "sdf") {
    smi = eval(parse(text = "ChemmineOB::convertFormat(from = 'SDF', to = 'SMILES', source = molecules)"))
    smiclean = strsplit(smi, "\\t.*?\\n")[[1]]
    if (length(smiclean) == 1L) {
      molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', smiclean, identity)"))
      fp = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
    }
    else if (length(smiclean) > 1L) {
      fp = matrix(0L, nrow = length(smiclean), ncol = 512L)
      for (i in 1:length(smiclean)) {
        molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', smiclean[i], identity)"))
        fp[i, ] = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
      }
    }
  }
  else {
    stop("Molecule type must be \"smile\" or \"sdf\"")
  }
  return(fp)
}
nanxstats commented 5 months ago

@tcaceresm Great question. I'm not exactly sure where the problem is coming from, but an educated guess is somewhere under the hood from ChemmineOB or Open Babel, something like a memory leak.

From the R perspective, I have an alternative solution, see this blog post on using callr to create more robust wrappers for similar batch processing tasks: https://nanx.me/blog/post/disposable-computing-with-callr/.

In brief, split up the fingerprint calculations into much smaller batches, and use callr to launch new, separate R processes to go through each batch. This can potentially eliminate runtime issues related to low-level code.

tcaceresm commented 4 months ago

Yup, the solution was to split up the fingerprint calculations! Thanks, I'll close this issue.