Revised autocorrelation descriptor calculator function returns a different value on each individual run

Pratham-Deshmukh commented 3 years ago

Hello, I'm using molsimplify to calculate revised autocorrelations by using SMILES as input . But I noticed that for every individual run the function returns different values for some properties. Following is the code and images showing the varying values that are returned. How can this problem be solved ? I wish to continue using SMILES as input so a solution that uses SMILES will be preferred over any other workarounds. Thanks .

def dataset_generator(input_data):
  descriptor_dataset=pd.DataFrame()
  for smiles in input_data:
    descriptor_set=pd.DataFrame()
    complex_mol=mol3D()
    complex_mol.read_smiles(smiles=smiles,ff='uff')
    results_dictionary=Informatics.lacRACAssemble.generate_metal_autocorrelations(complex_mol,depth=8,loud=False)
    results=results_dictionary["results"]
    results=np.reshape(results,(1,-1))
    results=pd.DataFrame(results)

    for i in range(len(results)):
      descriptor_set=pd.concat([descriptor_set,results.iloc[i,:]],axis=0,ignore_index=True)
    descriptor_dataset=pd.concat([descriptor_dataset,descriptor_set],axis=1)

  descriptor_dataset=descriptor_dataset.T
  print(descriptor_dataset)
  scaler=StandardScaler()
  scale=scaler.fit(descriptor_dataset)
  normalized_data=scale.transform(descriptor_dataset)
  normalized_data=pd.DataFrame(normalized_data)
  return normalized_data

Data-1 Data-2

adityanandy commented 3 years ago

Hi Pratham,

This is an artifact of building from SMILES within openbabel. Unfortunately, openbabel does not save the connectivity from SMILES, and rather interprets the connectivity of the molecule after building the 3D coordinates. For this reason, sometimes the molecular graph is incorrect (because it's based on the interpreted connectivity), leading to an incorrect molecular graph.

Since you are interested in using SMILES, I recommend the following workaround, using the pysmiles package (see here: https://stackoverflow.com/questions/57062757/how-to-generate-a-graph-from-a-smiles-molecule-representation). From pysmiles, you can get the graph you want, interpreted directly from SMILES. Code example below, with smiles_string being your smiles:

from pysmiles import read_smiles
import networkx as nx
molecule_pysmiles = read_smiles(smiles_string,explicit_hydrogen=True)
adjacency_matrix = nx.to_numpy_matrix(molecule_pysmiles)
complex_mol.graph = adjacency_matrix

The above code reassigns the molecular graph to be what is directly encoded in smiles. I highly recommend checking this graph (at least for one molecule) by hand to make sure things are correct. Once this is done, your graph will be the same each time, as will the RACs.

Please feel free to follow up if you have more questions.

Pratham-Deshmukh commented 3 years ago

Thank you for the response Aditya. The solution does interpret the SMILES correctly. Could you please tell me how I could use the graph to calculate RACs ? I tried using the "complex_mol.graph" above as input but the RAC function returns "Error, no metal found in mol object!" .

hjkgrp / molSimplify

Revised autocorrelation descriptor calculator function returns a different value on each individual run #39