Open mmagithub opened 6 months ago
I guess you meant get_stable_tautomers
. The initial design takes a smi/sdf file as the input file because auto3d can process them in parallel. This will be much more efficient than processing a molecule at a time.
It is indeed a good idea to make it applicable on a pandas dataframe. The only concern is whether the function is fast enough. For each tautomer, auto3d needs to search, optimize and compare conformers, which is a relative expensive process to be broadcasted to individual molecules.
Thanks for the answer, this repo is amazing. Considering pandas' powerful capabilities for handling and manipulating large datasets, including features that support parallel processing, I believe a tailored solution might be within reach. Specifically, a wrapper function for the (get_stable_tautomers) code designed to work directly on the SMILES column of a pandas DataFrame could significantly streamline the process. This function would essentially bridge the gap between pandas' data management and auto3d's parallel processing capabilities. I am wondering if you have a code snippet to do that.
Thanks for the encouragement!
When applying a function to a specific column of a data frame, for example df[col].apply(func)
, the function func
itself takes a single element of the column as the input. Data frame is able to broadcast the function to all elements in the column efficiently. So, the function itself does not have a parallel feature, but the data frame is able to run the function efficiently through some special data management.
If get_stable_tautomers
is to be applied to a column of a data frame, then the input should be a single SMILES. However, get_stable_tautomers
is more efficient when processing a batch of SMILES together so that it can optimize all conformers at the same time. That's why this function takes a smi file of molecules.
In short, using data frame's data management would disable the parallel capacity of auto3d. I'm not sure if there is a way to work around this. If so, I'd love to learn about it :-)
Hi,
I am wondering if there is a straightforward way to use the get_most_stable_tautomer function as a standalone rdkit function that can be applied to a SMILES column in a pandas dataframe, for example:
df['most_stable_taut'] = df['smiles'].apply(get_most_stable_tautomer)
It seems the current use of the function requires a strict input file format (smiles / sdf), and i cannot find a typical use case to satisfy what I am looking for.
Any clue ?
Thanks, M