Adjust RDKit embeddings

theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.

https://arxiv.org/abs/2204.13545

MIT License

88 stars 23 forks source link

Adjust RDKit embeddings #56

Closed siboehm closed 2 years ago

siboehm commented 2 years ago

The RDKit embedding (fingerprint) as saved on the server has some Inf and NaN values, which triggers immediate early stopping.

Options:

Clamp and normalize the existing embeddings
Switch to a different mechanic embedding (eg ECFP)

MxMstrmn commented 2 years ago

This is not yet covered in #58.

MxMstrmn commented 2 years ago

Notebook is updated in EMBEDDING_DIR /storage/groups/ml01/projects/2021_chemicalCPA_leon.hetzel/embeddings/rdkit

Cannot make this a PR as this is outside of the repository, a quick summary:

In total only 62 entries where nans (56) or infs (6), the embedding has shape [17869,201]

The features where this happens are related to charge, see cell 11 in notebook:

array([['MaxAbsPartialCharge', <class 'numpy.float64'>],
       ['MaxPartialCharge', <class 'numpy.float64'>],
       ['MinAbsPartialCharge', <class 'numpy.float64'>],
       ['MinPartialCharge', <class 'numpy.float64'>]], dtype=object)

As I do not expect this to make a huge difference I set all affected entries to 0.. @siboehm, please try to reset the mongodb and start the rdkit runs, if they run successfully, this can be closed from my side.

siboehm commented 2 years ago

I just restarted the rdkit runs. Can you add the notebook here nevertheless? Then it's tracked and we won't loose it.

siboehm commented 2 years ago

Still doesn't work, all of them get NaNs in the first epoch. I couldn't find any more NaNs or Infs, however column 34 looks horrible in terms of value distribution:

>> df.describe()
          latent_31     latent_32     latent_33     latent_34     latent_35  \
count  17869.000000  17869.000000  17869.000000  1.786900e+04  17869.000000   
mean      -2.769221     32.124965    422.670055  4.222259e+50     22.915930   
std        1.032440      9.656746    126.018583  4.285530e+52      7.760063   
min      -17.360000      2.000000     40.021000  0.000000e+00      2.513645   
25%       -3.370000     27.000000    350.272000  1.119745e+06     18.373752   
50%       -2.750000     32.000000    424.149000  1.928494e+07     22.608138   
75%       -2.160000     37.000000    485.342000  2.329593e+08     26.805200   
max        2.880000    170.000000   2245.454000  5.537438e+54    150.996536

Should we just normalize all columns using mean and std?

MxMstrmn commented 2 years ago

Yes, we could try normalising this, alternatively log1 transform column 34?

I will add the notebook in a sec.