Apply the get_most_stable_tautomer function to pandas dataframe (from csv)

isayevlab / Auto3D_pkg

Auto3D generates low-energy conformers from SMILES/SDF

MIT License

142 stars 31 forks source link

Apply the get_most_stable_tautomer function to pandas dataframe (from csv) #71

Open mmagithub opened 6 months ago

mmagithub commented 6 months ago

Hi,

I am wondering if there is a straightforward way to use the get_most_stable_tautomer function as a standalone rdkit function that can be applied to a SMILES column in a pandas dataframe, for example:

df['most_stable_taut'] = df['smiles'].apply(get_most_stable_tautomer)

It seems the current use of the function requires a strict input file format (smiles / sdf), and i cannot find a typical use case to satisfy what I am looking for.

Any clue ?

Thanks, M

LiuCMU commented 6 months ago

I guess you meant get_stable_tautomers. The initial design takes a smi/sdf file as the input file because auto3d can process them in parallel. This will be much more efficient than processing a molecule at a time.

It is indeed a good idea to make it applicable on a pandas dataframe. The only concern is whether the function is fast enough. For each tautomer, auto3d needs to search, optimize and compare conformers, which is a relative expensive process to be broadcasted to individual molecules.

mmagithub commented 6 months ago

Thanks for the answer, this repo is amazing. Considering pandas' powerful capabilities for handling and manipulating large datasets, including features that support parallel processing, I believe a tailored solution might be within reach. Specifically, a wrapper function for the (get_stable_tautomers) code designed to work directly on the SMILES column of a pandas DataFrame could significantly streamline the process. This function would essentially bridge the gap between pandas' data management and auto3d's parallel processing capabilities. I am wondering if you have a code snippet to do that.

LiuCMU commented 6 months ago

Thanks for the encouragement!

When applying a function to a specific column of a data frame, for example df[col].apply(func), the function func itself takes a single element of the column as the input. Data frame is able to broadcast the function to all elements in the column efficiently. So, the function itself does not have a parallel feature, but the data frame is able to run the function efficiently through some special data management.

If get_stable_tautomers is to be applied to a column of a data frame, then the input should be a single SMILES. However, get_stable_tautomers is more efficient when processing a batch of SMILES together so that it can optimize all conformers at the same time. That's why this function takes a smi file of molecules.

In short, using data frame's data management would disable the parallel capacity of auto3d. I'm not sure if there is a way to work around this. If so, I'd love to learn about it :-)