volkamerlab / teachopencadd

TeachOpenCADD: a teaching platform for computer-aided drug design (CADD) using open source packages and data
https://projects.volkamerlab.org/teachopencadd
Creative Commons Attribution 4.0 International
713 stars 197 forks source link

T018: Deprecation warnings and images not showing #334

Closed mbackenkoehler closed 1 year ago

mbackenkoehler commented 1 year ago

Currently, there are several:

AndreaVolkamer commented 1 year ago

Adding this line after generarting the molecule column should fix it: PandasTools.RenderImagesInAllDataFrames(True)

mbackenkoehler commented 1 year ago

See PR #330, for example

mbackenkoehler commented 1 year ago

Adding this line after generarting the molecule column should fix it: PandasTools.RenderImagesInAllDataFrames(True)

This does not fix the error from the screenshot, unfortunately.

AAriam commented 1 year ago

@AndreaVolkamer @mbackenkoehler

I submitted a pull request (#350) that solves the problems in T018. However, there are still some points to discuss, which are also important for other talktorials. One of the problems was the rendering of RDKit structures inside Pandas DataFrames, and all other problems were deprecation warnings. Both are discussed separately below.

Rendering RDKit structures in Pandas

TLDR

The problem arises from the pinned RDKit version (2021.09.5) in TeachOpenCADD environment, and is not affected by using this suggested method (see code snippets below). This is not unique to T018 and will happen in all talktorials. The only solution I could find to fix the problem in all situations is to remove the pin from the TeachOpenCADD environment (at devtools/test_env.yml), causing an update to RDKit version 2022.09.1 (note that this is the latest compatible version with other pinned packages in the environment, and not the latest RDKit version, which is now 2023.03.1). When RDKit is updated, there is even no need to set PandasTools.RenderImagesInAllDataFrames anymore.

Problem Description

There are two factors contributing to this problem in T018; one factor is unique to T018, and the other one affects all TeachOpenCADD notebooks.

The one unique to T018 is straightforward and discussed first:

Unique problem in T018

Earlier RDKit versions were able to render Mol structures even in the index columns of a Pandas DataFrame. However, this does not work in later RDKit versions. For example:

from IPython.display import display
import rdkit
import pandas as pd
from rdkit.Chem import PandasTools

PandasTools.RenderImagesInAllDataFrames(True)

mol = rdkit.Chem.MolFromSmiles("c1ccccc1")

# Create dataframe containing `Mol` objects
df = pd.DataFrame({"mol":[mol]})
display(df)  # Dataframe displays correctly

# Put Mol object in index
display(
    pd.DataFrame({"col1":[1]}, index=df.mol)
)  # Doesn't display correctly

# Another way of putting Mol in an index column
display(
    pd.concat({df.loc[0, "mol"]: pd.DataFrame({"a":[1,2,3],"b":[3,4,4]})}, names=["Structure"])
)  # Doesn't display correctly

Output: Screenshot 2023-05-06 at 18 33 23

Assuming there won't a be fix anytime soon, I simply solved this in the PR by moving the structure into a normal column.

Bug in pinned RDKit version (2021.09.5)

The main problem affecting all talktorials is a bit more complex and requires further work:

There is a bug in the RDKit version (2021.09.5) pinned in the current environment for TeachOpenCADD, regarding rendering structures in dataframes. The bug occurs regardless of the method used for rendering (i.e. even when this suggested method is used). The good news is that this bug is already fixed in later RDKit versions. However, it requires unpinning the RDKit version in the TeachOpenCADD environment, which requires re-checking all other talktorials for issues that might arise from doing so.

Bug description and examples

During the runtime, regardless of the method used to render Mol structures in Pandas, the first created column in the first created dataframe will always render correctly. But as soon as another Mol column is added to a dataframe (either the same or a new one), all existing and future dataframes will stop rendering the Mol column correctly.

Below are several code snippets to reproduce the bug in the currently pinned RDKit version, using different methods. Note that all methods work fine in later versions mentioned above. Thus there seems to be no need to change the code; we only have to update the RDKit version.

Current method used in TeachOpenCADD

Here, we are using the old method that is currently used in talktorials, i.e. using PandasTools.AddMoleculeColumnToFrame to create a Mol column from a SMILES column:

from IPython.display import display
import rdkit
import pandas as pd
from rdkit.Chem import PandasTools

print(f"RDKit version: {rdkit.__version__}")
print(f"Pandas version: {pd.__version__}")

# Create dataframe with SMILES column
df = pd.DataFrame({"smiles":["c1ccccc1"]})
# Add Mol column using RDKit
PandasTools.AddMoleculeColumnToFrame(df, "smiles")
print("First dataframe:")
display(df)  # Dataframe displays correctly

# Now do the same for a second dataframe
df2 = pd.DataFrame({"smiles":["c1ccccc1"]})
PandasTools.AddMoleculeColumnToFrame(df2, "smiles")
print("Second dataframe:")
display(df2)  # Dataframe does not display correctly
print("First dataframe again:")
display(df)  # The first dataframe also becomes corrupted

As shown in the output below, the first dataframe renders correctly the first time, but when another dataframe is created both of them stop working correctly: Screenshot 2023-05-06 at 16 58 30

Using PandasTools.RenderImagesInAllDataFrames

Adding PandasTools.RenderImagesInAllDataFrames after creating the Mol column doesn't change this behavior; the below code snippet generates the same output as above:

# Create dataframe with SMILES column
df = pd.DataFrame({"smiles":["c1ccccc1"]})
# Add Mol column using RDKit
PandasTools.AddMoleculeColumnToFrame(df, "smiles")
PandasTools.RenderImagesInAllDataFrames(True)
print("First dataframe:")
display(df)  # Dataframe displays correctly

# Now do the same for a second dataframe
df2 = pd.DataFrame({"smiles":["c1ccccc1"]})
PandasTools.AddMoleculeColumnToFrame(df2, "smiles")
PandasTools.RenderImagesInAllDataFrames(True)
print("Second dataframe:")
display(df2)  # Dataframe does not display correctly
print("First dataframe again:")
display(df)  # The first dataframe also becomes corrupted

Using PandasTools.RenderImagesInAllDataFrames only once

Adding it only once before any column creation doesn't work either (still same behavior):

PandasTools.RenderImagesInAllDataFrames(True)

# Create dataframe with SMILES column
df = pd.DataFrame({"smiles":["c1ccccc1"]})
# Add Mol column using RDKit
PandasTools.AddMoleculeColumnToFrame(df, "smiles")
print("First dataframe:")
display(df)  # Dataframe displays correctly

# Now do the same for a second dataframe
df2 = pd.DataFrame({"smiles":["c1ccccc1"]})
PandasTools.AddMoleculeColumnToFrame(df2, "smiles")
print("Second dataframe:")
display(df2)  # Dataframe does not display correctly
print("First dataframe again:")
display(df)  # The first dataframe also becomes corrupted

Directly creating Mol column

Notice that when adding PandasTools.RenderImagesInAllDataFrames before column creation, there is even no need to use PandasTools.AddMoleculeColumnToFrame afterwards, since a column containing RDKit Mol object will be automatically rendered as image already:

PandasTools.RenderImagesInAllDataFrames(True)

mol = rdkit.Chem.MolFromSmiles("c1ccccc1")

# Create dataframe containing `Mol` objects
df = pd.DataFrame({"mol":[mol]})

display(df)  # Dataframe displays correctly

Output: Screenshot 2023-05-06 at 17 10 46

However, creating any other dataframe with the same or different Mol objects will again break all existing and future dataframes:

df2 = pd.DataFrame({"mol":[mol]})
display(df2) # Dataframe doesn't display correctly anymore
display(df)  # Dataframe doesn't display correctly anymore

Deprecation warnings

Other than the two deprecation warning @mbackenkoehler has already fixed (i.e. passing height to nglwidget, and using new function in pypdb) and a couple more that I could find, all other warnings are raised by dependencies, not by our code. Therefore, there is not much we can change in our code to resolve those.

In most cases, the reason this is happening is that in the current environment for TeachOpenCADD (at devtools/test_env.yml) some packages are pinned, and some are not. Thus, when creating a new environment, the latest possible version of unpinned packages are downloaded. Now, some of those pinned packages are making deprecated API calls to up-to-date unpinned packages, and the unpinned packages raise deprecation warnings.

A simple exam to reproduce the deprecation warning encountered in cell 4 of T018:

import pypdb
from biopandas.pdb import PandasPdb

Output:

DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. pd_version = LooseVersion(pd.__version__)

I fixed these temporarily by suppressing the warnings, but a definite fix is to update all dependencies for TeachOpenCADD, which of course requires much more work checking all other talktorials for incompatibilities.

mbackenkoehler commented 1 year ago

Thanks! This really helps a lot! I will go through the details later on.

mbackenkoehler commented 1 year ago

The related PR #350 Armin made updates rdkit from the previously pinned version. This solves the dataframe/molecule renderng issue. @dominiquesydow Do you think, this will cause any problems?