girke-lab / ChemmineOB

OpenBabel wrapper package for R
9 stars 5 forks source link

Possible RAM leakage in fingerprint calculations using fingerprint_OB() function #39

Open tcaceresm opened 3 months ago

tcaceresm commented 3 months ago

Hi there, I opened a issue in Rcpi package, however, I think it's more appropiate to open a issue here because Rcpi relies on ChemmineOB. I have ~50k molecules, of which I want to calculate fingerprints. I use a function from Rcpi package (see code below), which create an appropiate matrix to store fingerprints, then iterate over molecules and finally calculate the fingerprints using ChemmineOB::fingerprint_OB. However, in each loop, RAM usage increases, despite the fact that the size of the matrix is constant. I noticed that R session memory is increasing, not the object size. I calculated the fingerprints using open babel cli, and it runs smoothly. Thanks, and sorry about my english.

function (molecules, type = c("smile", "sdf")) 
  if (type == "smile") {
    if (length(molecules) == 1L) {
      molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', molecules, identity)"))
      fp = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
    else if (length(molecules) > 1L) {
      fp = matrix(0L, nrow = length(molecules), ncol = 512L)
      for (i in 1:length(molecules)) {
        molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', molecules[i], identity)"))
####### This is the step which increases RAM usage in each loop step
        fp[i, ] = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
  else if (type == "sdf") {
    smi = eval(parse(text = "ChemmineOB::convertFormat(from = 'SDF', to = 'SMILES', source = molecules)"))
    smiclean = strsplit(smi, "\\t.*?\\n")[[1]]
    if (length(smiclean) == 1L) {
      molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', smiclean, identity)"))
      fp = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
    else if (length(smiclean) > 1L) {
      fp = matrix(0L, nrow = length(smiclean), ncol = 512L)
      for (i in 1:length(smiclean)) {
        molRefs = eval(parse(text = "ChemmineOB::forEachMol('SMILES', smiclean[i], identity)"))
        fp[i, ] = eval(parse(text = "ChemmineOB::fingerprint_OB(molRefs, 'FP4')"))
  else {
    stop("Molecule type must be \"smile\" or \"sdf\"")