reymond-group / map4

The MinHashed Atom Pair fingerprint of radius 2
MIT License
104 stars 31 forks source link

How to preprocess MAP4 before training? #7

Closed tevang closed 3 years ago

tevang commented 3 years ago

Hello,

I tried today to train a simple Classifier MLP to predict the bioactivity of a set of small molecules (not peptidomimetics of peptides). This is how I preprocess the MAP4 vectors before training: x = np.array(MAP4.calculate_many(mol_list), dtype=np.int) ColorPrint("Scaling features in the range [0,1].", "OKBLUE") scaler = MinMax_Scaler() # scaler with memory, to be used later on the xtest x = scaler.fit_transform(x) ColorPrint("Removing only uniform features.", "OKBLUE") x = remove_uniform_columns(x) The x 2D array is the input to the MLP. Oddly enough, the performance of the MLP in 5-fold cross-validation is poorer than any other fingerprint that I tested. See the results below:

Results for feature vector type ECFPL: average AUC-ROC=0.754875+-0.061359 average DOR=16.044811+-13.862254 average MK=0.509749+-0.122718 Results for feature vector type FCFPL: average AUC-ROC=0.754583+-0.067879 average DOR=17.411853+-14.740475 average MK=0.509166+-0.135758 Results for feature vector type AvalonFPL: average AUC-ROC=0.755908+-0.072506 average DOR=22.775238+-29.836969 average MK=0.511817+-0.145013 Results for feature vector type gCSFP: average AUC-ROC=0.733945+-0.072095 average DOR=14.667879+-16.226123 average MK=0.467889+-0.144191 Results for feature vector type CSFPL: average AUC-ROC=0.716520+-0.053522 average DOR=8.701626+-4.768975 average MK=0.433040+-0.107043 Results for feature vector type tCSFPL: average AUC-ROC=0.719829+-0.058146 average DOR=8.996485+-5.867861 average MK=0.439658+-0.116293 Results for feature vector type iCSFPL: average AUC-ROC=0.768058+-0.091553 average DOR=58.927698+-100.427486 average MK=0.536115+-0.183106 Results for feature vector type fCSFPL: average AUC-ROC=0.741297+-0.062825 average DOR=13.043251+-8.988206 average MK=0.482595+-0.125651 Results for feature vector type pCSFPL: average AUC-ROC=0.749496+-0.063083 average DOR=14.137692+-10.776114 average MK=0.498991+-0.126165 Results for feature vector type gCSFPL: average AUC-ROC=0.738966+-0.083883 average DOR=17.951455+-21.426471 average MK=0.477931+-0.167767 Results for feature vector type AP: average AUC-ROC=0.779686+-0.067921 average DOR=23.284337+-19.439457 average MK=0.559371+-0.135841 Results for feature vector type cAP: average AUC-ROC=0.785569+-0.056231 average DOR=21.165699+-13.829002 average MK=0.571139+-0.112462 Results for feature vector type TT: average AUC-ROC=0.728741+-0.093040 average DOR=26.068889+-41.656358 average MK=0.457481+-0.186081 Results for feature vector type cTT: average AUC-ROC=0.725244+-0.087174 average DOR=21.606984+-33.135657 average MK=0.450488+-0.174347 Results for feature vector type ErgFP: average AUC-ROC=0.722143+-0.042805 average DOR=7.991111+-3.364458 average MK=0.444285+-0.085610 Results for feature vector type 2Dpp: average AUC-ROC=0.754694+-0.036142 average DOR=13.907273+-8.282393 average MK=0.509388+-0.072285 Results for feature vector type MAP4: average AUC-ROC=0.713780+-0.055750 average DOR=8.804872+-4.667274 average MK=0.427561+-0.111501

Am I doing something wrong in the preparation of the MAP4 feature vectors? Is this the right way to train a network using MAP4 as input? I am asking this question because I read in the documentation that due to MinHashing, the order of the features matters and the distance cannot be calculated "feature-wise". I wonder if this attribute affects also the neural network's training.

Thanks in advance. Thomas

alicecapecchi commented 3 years ago

Hi Thomas, MinHashed vectors cannot be scaled between 0 and 1 and their distance cannot be calculated using the log loss function. The feature number and the feature order are important. You would need to implement a custom loss function (see also https://github.com/reymond-group/MAP4-Chemical-Space-of-NPAtlas and https://chemrxiv.org/s/4c5a0ac4e65cd2f90a5a). Since the MPL classifier cannot be used with a custom loss function (at least in its scikit-learn implementation), I would suggest you to use the MAP4 fingerprint as a bit vector. To calculate the MAP4 fingerprint as a bit vector, you need to initialized the MAP4Calculator with is_folded = True (this method applies the SHA-1 hashing and then the modulo operation on the unique set of shingles). Otherwise you could give as input to the classifier a similarity vector to the training set. In this case you would need to calculate for each input molecule the similarity of its minhashed fingerprint fp1 to each molecule of the training set fp2 as np.float(np.count_nonzero(fp1 == fp2))/np.float(len(fp1)) Let me know if it works :) Cheers, Alice

tevang commented 3 years ago

Hello Alice and thank you for your advice. Indeed, conversion to bit vector improves the performance of the Classifier.

Results for feature vector type MAP4: average AUC-ROC=0.741585+-0.079234

average DOR=16.686330+-15.898923 average MK=0.483170+-0.158467

Can I remove uniform features (columns with the same value in all rows)? I don't understand the alternative approach you suggested using similarities. A Classifier must receive as input the same type of feature vectors both in training and testing. If the input feature vectors of the test set contain as features similarities to every single molecule of the training set, what will the feature vectors of the training contain? Similarities of every training molecule to the rest in the training set? It doesn't make sense. Finally, speaking of parameters in MAP4Calculator(), which values do you suggest for 'dimensions' and 'radius' to minimize bit collision? With ECFP fingerprints, I use 8192 bits and radius=3. What does 'is_counted' do?

Best, Thomas

PS: Btw, I am building custom Classifiers with TensorFlow and Keras thus I can control the loss function.

On Mon, 21 Sep 2020 at 11:29, Alice Capecchi notifications@github.com wrote:

Hi Thomas, MinHashed vectors cannot be scaled between 0 and 1 and their distance cannot be calculated using the log loss function. The feature number and the feature order are important. You would need to implement a custom loss function (see also https://github.com/reymond-group/MAP4-Chemical-Space-of-NPAtlas and https://chemrxiv.org/s/4c5a0ac4e65cd2f90a5a). Since the MPL classifier cannot be used with a custom loss function (at least in its scikit-learn implementation), I would suggest you to use the MAP4 fingerprint as a bit vector. To calculate the MAP4 fingerprint as a bit vector, you need to initialized the MAP4Calculator with is_folded = True (this method applies the SHA-1 hashing and then the modulo operation on the unique set of shingles). Otherwise you could give as input to the classifier a similarity vector to the training set. In this case you would need to calculate for each input molecule the similarity of its minhashed fingerprint fp1 to each molecule of the training set fp2 as np.float(np.count_nonzero(fp1 == fp2))/np.float(len(fp1)) Let me know if it works :) Cheers, Alice

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/reymond-group/map4/issues/7#issuecomment-696003475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABL6IFCHBHNRNQSMXDZOYP3SG4MIDANCNFSM4RTVAXIQ .

--

======================================================================

Dr. Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences https://www.uochb.cz/web/structure/31.html?lang=en, Prague, Czech Republic & CEITEC - Central European Institute of Technology https://www.ceitec.eu/, Brno, Czech Republic

email: tevang3@gmail.com, Twitter: tevangelidis https://twitter.com/tevangelidis, LinkedIn: Thomas Evangelidis https://www.linkedin.com/in/thomas-evangelidis-495b45125/

website: https://sites.google.com/site/thomasevangelidishomepage/

alicecapecchi commented 3 years ago

Hi Thomas,

Can I remove uniform features (columns with the same value in all rows)?

If you are talking about the bit vector, yes (you can treat it as ECFP4).

I don't understand the alternative approach you suggested using similarities. A Classifier must receive as input the same type of feature vectors both in training and testing. If the input feature vectors of the test set contain as features similarities to every single molecule of the training set, what will the feature vectors of the training contain? Similarities of every training molecule to the rest in the training set? It doesn't make sense.

Yes, it makes sense. It is like giving a similarity matrix split into individual vectors as input.

Finally, speaking of parameters in MAP4Calculator(), which values do you suggest for 'dimensions' and 'radius' to minimize bit collision? With ECFP fingerprints, I use 8192 bits and radius=3.

Assuming that you are talking about the folded version, I am not sure, I have used the folded version of MAP4 only for the benchmark, in that case I have used 2048 dimensions and radius 2. To make it comparable to ECFP I would use the same dimensions and radius.

What does 'is_counted' do?

In its default version the shingles representing a molecule are made unique hashed and MinHashed. Instead, when using is_counted=True, all shingles are considered.

Cheers, Alice

mohammad-saber commented 3 years ago

Hi, Thank you very much for sharing your great work. I also would like to use MAP4 to train machine learning model. I understand how to get bit vector by setting is_folded = True.

I have a question regarding the meaning of bit vector. What is the meaning of every bit? Does it mean , for example, there is a specific subgraph or chemical bound in the molecule?

alicecapecchi commented 3 years ago

Hi Mohammad,

Thanks for you interest! One bit represents the absence or presence of one (or more when bit collision occurs) atom pair shingle:

Cheers, Alice

mohammad-saber commented 3 years ago

Thank you Alice for your comprehensive explanations.

According to the paper, for atom pair (j,k), resulting list of Shingles is hashed using SHA-1 to a set of integers Si. I guess the number of entries in Si is equal to the number of shingles.

I understood how to fold the set of unique numbers into a vector of the desired length when (is_folded = True).

I was wondering how to transform the set of unique numbers into a vector of the desired length when (is_folded = False). According to the explanations in Readme.md, if we have 2 shingles, the length of hmin(Si, x, y) will be two. So, the length of MinHashed vector depends on the number of shingles. Therefore, size of vector is not fixed.
Could you please let me know how to fix the size of MinHashed vector into the desired length?

Thank you very much, and sorry for the lengthy question.

alicecapecchi commented 3 years ago

The length of the MinHashed vector does not depends on the amount of shingles. Referring to the readme, the length of the MinHashed vector is the same as x. By default it is 1024, and it can be changed to 128, 512, or 2048 using the argument dimensions.

mohammad-saber commented 3 years ago

Thank you. So, in the formula written in Readme, the vector Si contains all hashed shingles. Is my understanding correct?

alicecapecchi commented 3 years ago

Yes, Si contains all hashed unique shingles (duplicates are removed unless is_counted = True).