Closed siboehm closed 2 years ago
This is not yet covered in #58.
Notebook is updated in EMBEDDING_DIR
/storage/groups/ml01/projects/2021_chemicalCPA_leon.hetzel/embeddings/rdkit
Cannot make this a PR as this is outside of the repository, a quick summary:
nans
(56) or infs
(6), the embedding has shape [17869,201]
The features where this happens are related to charge, see cell 11 in notebook:
array([['MaxAbsPartialCharge', <class 'numpy.float64'>],
['MaxPartialCharge', <class 'numpy.float64'>],
['MinAbsPartialCharge', <class 'numpy.float64'>],
['MinPartialCharge', <class 'numpy.float64'>]], dtype=object)
As I do not expect this to make a huge difference I set all affected entries to 0.
. @siboehm, please try to reset the mongodb and start the rdkit runs, if they run successfully, this can be closed from my side.
I just restarted the rdkit runs. Can you add the notebook here nevertheless? Then it's tracked and we won't loose it.
Still doesn't work, all of them get NaNs in the first epoch. I couldn't find any more NaNs or Infs, however column 34 looks horrible in terms of value distribution:
>> df.describe()
latent_31 latent_32 latent_33 latent_34 latent_35 \
count 17869.000000 17869.000000 17869.000000 1.786900e+04 17869.000000
mean -2.769221 32.124965 422.670055 4.222259e+50 22.915930
std 1.032440 9.656746 126.018583 4.285530e+52 7.760063
min -17.360000 2.000000 40.021000 0.000000e+00 2.513645
25% -3.370000 27.000000 350.272000 1.119745e+06 18.373752
50% -2.750000 32.000000 424.149000 1.928494e+07 22.608138
75% -2.160000 37.000000 485.342000 2.329593e+08 26.805200
max 2.880000 170.000000 2245.454000 5.537438e+54 150.996536
Should we just normalize all columns using mean and std?
Yes, we could try normalising this, alternatively log1 transform column 34?
I will add the notebook in a sec.
The RDKit embedding (fingerprint) as saved on the server has some
Inf
andNaN
values, which triggers immediate early stopping.Options: