gicsaw / ARAE_SMILES

BSD 3-Clause "New" or "Revised" License
24 stars 3 forks source link

generation problem #1

Open superk1200 opened 4 years ago

superk1200 commented 4 years ago

When I ran "python gen_ARAEZINC.py", i get a empty "/result"+modelname+"%d.txt" file.

How can i get the generation smiles in final step?

Hope you could help me.

thank you.

gicsaw commented 4 years ago

"ARAE_SMILES/out_ARAE_ZINC/39/ARAE_SMILES/smiles_fake.txt" is generated SMILES file. please check this file.

superk1200 commented 4 years ago

sorry, l have a problem.

I read my "smiles_fake.txt",and I got a bad result.

How can I get a good result on "smiles_fake.txt" ?

fake

gicsaw commented 4 years ago

It is correct output 'Y' is end code

I updated del_end_code.py You can get clean generated SMILES python del_end_code.py out_ARAE_ZINC/39

generated smiles: out_ARAE_ZINC/39/smiles_gen.txt

I am making a ARAE code using pytorch https://github.com/gicsaw/ARAE_torch This code is much simpler and cleaner. It will be updated within 3 days.

superk1200 commented 4 years ago

First, I appreciate your help very much.

Second, I cause trouble to you because of my ignorance. I'm very sorry.

Your response really help me a lot.

I will try ARAE code using pytorch ,and I will give you feedback !

superk1200 commented 4 years ago

sorry , I have a problem again...

How long are smiles which can input in ARAE's encoder ?

gicsaw commented 4 years ago

the model trained with seq_length = 110 (include start code X and end code Y). However, if you change the value, longer SMILES may be generated, although the probability is low. RNN does not limit the maximum length of the string at generation step. I stopped generating if the exit code did not appear by the 110th letter. You can change the maximum length of SMILES for training or generation.

superk1200 commented 4 years ago

Could I use Yfake.npy to reconstruct molecule from the latent vectors ?

And, thank you for your help.

latent vector

gicsaw commented 4 years ago

I don't understand what exactly you want. Do you want to give a specific molecule and create similar molecules? Or, as in Figure 3, do you want to find molecules between two specific molecules?

superk1200 commented 4 years ago

As in Figure 3, I want to find molecules between two specific molecules. I'm not good at English . Sorry, I make you trouble.

ned000 commented 4 years ago

I uploaded a new file "interpolation_ARAE_ZINC.py" And I updateed "model/ARAE.py" and "model/ARAE.py" example:

smi logP SAS QED MW TPSA

CCOC(=O)[C@@H]1CCCN(C(=O)c2nc(-c3ccc(C)cc3)n3c2CCCCC3)C1 4.000 2.823 0.691 409.237 64.430 COc1ccc(C(=O)N(C)C@@HC/C(N)=N/O)cc1O 0.998 2.852 0.327 281.138 108.380

First, two molecules to be interpolated must be recorded in a text file. (see examples:) Next, using the "data_char_ZINC.py" file, create files as "Xtest.npy", "Ytest.npy", and "Ltest.npy" You will probably need to edit the "data_char_ZINC.py" file. finally, run script python interpolation_ARAE_ZINC.py You can change some parameters as 60 epoch = 39 91 Ninterpolation = 1000 (Change number of interpolation only to multiples of batch size of 100)

ARAE does not directly design latent space unlike VAE. Instead, design the probability distribution of the data in code (or seed) like GAN. Therefore, interpolation and perturbation should also be done in code space, not latent space. Auto-encoder of ARAE converts SMILES space into latent space. You need to convert the latent space to code space. This part is defined in the "recover_sample_vector_df" method of "ARAE.py".

superk1200 commented 3 years ago

I would ask a stupid question again.

I want to use the specific molecule for interpolation.

How could I specify the smiles in "interpolation_ARAE_ZINC.py" ?

Or did I read the wrong file?

And, I got successful result of interpolation.

Thanks for your guidance.

20200727

gicsaw commented 3 years ago

I uploaded example file for interpolation You can replace the contents of the file with your SMILES.

cat example/interpolation.txt

SMILES

CCOC(=O)[C@@H]1CCCN(C(=O)c2nc(-c3ccc(C)cc3)n3c2CCCCC3)C1 COc1ccc(C(=O)N(C)C@@HC/C(N)=N/O)cc1O

****Some characters are automatically converted on github comment.

python data_char_interpolation.py example/interpolation.txt This code converts smiles file to input data of ARAE The input data is prepared in the "data_interpolation/" directory. python interpolation_ARAE_ZINC.py

superk1200 commented 3 years ago

Excuse me, can I ask the details of generating EGFR Inhibitors ?

I want to know the batch size , num_epochs , numbers of training data and test data in train_ARAE_ZINC.py.

I can not get similar molecules of my targets.

Sorry for being such a hassle.

ned000 commented 3 years ago

ARAE can be trained with EGFR inhibitor data. However, since there are not enough EGFR inhibitor data, It is better to use the conditional generation. (CARAE) I updated git code The DUD-E EGFR data is in "dataset/EGFR_DUDE/"

run below script python data_char_EGFR.py python train_CARAE_EGFR.py python gen_CARAE_EGFR.py The training parameters have not been uploaded because the training has not been completed yet. It will be uploaded within 3 days.

superk1200 commented 3 years ago

Thanks for your reply. I read update information. So, you don't put the EGFR's data set in zinc data.

I always put my target molecule data in zinc data before running the training script. I will try your updated version. If I want to change my target molecule data , how can I modify the data_char_EGFR.py ?

gicsaw commented 3 years ago

I updated weight parameter files You can generate EGFR inhibitor-like molecules with the command : python gen_CARAE_EGFR.py 1 python del_end_code.py out_CARAE_EGFR/490 python valid.py out_CARAE_EGFR/490 where 1 is active. The final output file is out_CARAE_EGFR/490/smiles_unique.txt

Prepare data by labeling 1 for active and 0 for inactive. Canonicalize molecular SMILES using rdkit. (see dataset/EGFR_DUDE/canonical.py)

In data_char_EGFR.py , you can change below ########################

seq_length = 123 # maximum length of your SMILES string

data_dir ='./dataset/EGFR_DUDE' # input data dir ... active_filename = data_dir+"/actives_canonical.txt" data_active = [[x.strip().split()[0], 1] for x in open(active_filename)] decoy_filename = data_dir+"/decoys_canonical.txt" data_decoy = [[x.strip().split()[0], 0] for x in open(decoy_filename)] data_list = data_active + data_decoy ...

data_dir2="./data/EGFR/" # output data dir ######################## DUDE has active data and decoy data in separate files. I merged the two data into data_list. data_list contains data in the following format. [[SMILES1, active1], [SMILES2, activity2], [SMILES3, activity3], ...]

You can also change SMILES's character type. (char_dict and char_list) SMILES with characters not included in char_list in the current code or SMILES exceeding the maximum length limit are excluded.

superk1200 commented 3 years ago

Sorry, I have a question again...

Can I add $logP $SAS $ TPSA in python gen_CARAE_EGFR.py 1 like "python gen_CARAE_con_logP_SAS_TPSA.py $logP $SAS $TPSA" ?

gicsaw commented 3 years ago

activity is classification task (1 or 0) and "logP, SAS, and TPSA" are regression task. So I upload new scripts: "data_char_EGFR_property.py " This script use SA_Score module from rdkit.Contrib You must manually enter the path of this module in "data_char_EGFR_property.py" git clone https://github.com/rdkit/rdkit.git relative path is "rdkit/Contrib/SA_Score" after this run script python data_char_EGFR_property.py

training code is "train_CARAE_EGFR_logP_SAS_TPSA.py " this script use "model/CARAE_cla_reg.py" This works only when both binary classification tack and regression tasks are given.

and generation script is python gen_CARAE_EGFR_logP_SAS_TPSA.py activity logP SAS TPSA example: python gen_CARAE_EGFR_logP_SAS_TPSA.py 1.0 1.0 3.0 100.0

The weight parameter files have not been uploaded yet.

superk1200 commented 3 years ago

Could I add number of logP SAS QED MW TPSA after the smiles string which is in data of decoys and active?

It is like zinc data .

zinc data

gicsaw commented 3 years ago

Yes, you can add any property. in data_char_EGFR_property.py, I used rdkit to do calculations when inserting properties into data (Pdata). (And the data was normalized between 0 and 1. (or -1 and 1) However, it is also possible to use pre-calculated values from data. Because of the implementation in code, the binary classification task data (activity, arr[1] in below) must precede the regression task (logP, SAS...).

see data_char_EGFR_property.py: m = Chem.MolFromSmiles(smiles) logP = MolLogP(m) SAS = sascorer.calculateScore(m) tpsa0 = TPSA(m) Xdata+=[X_d] Ydata+=[Y_d] Ldata+=[istring+1] cdd=[arr[1], logP/10.0, SAS/10.0, tpsa0/150.0] Pdata+=[cdd] #affinity classification

and in gen_CARAE_EGFR_logP_SAS_TPSA.py: property_task=4 classification_task=1 regression_task=3

task_nor=np.array([1.0, 10.0,10.0,150.0]) task_low=np.array([0.0, -1.0,1.0,0.0]) task_high=np.array([1.0, 5.0,8.0,150.0]) task_low=task_low/task_nor task_high=task_high/task_nor

task_val=task_val/task_nor

superk1200 commented 3 years ago

Excuse me, can I ask the details of Xdata, Ydata, Ldata, Pdata? And, could I get the the latent vector z to the molecular structure x ? the vector picture like this picture. vector2

gicsaw commented 3 years ago

원본 텍스트 Xdatawa Ydata neun wi geulimgwa machangajilo SMILES leul one-hot code lo byeonhwanhan geos-ibnida. SMILESleul guseonghaneun munjadeul-eul susja kodeue daeeungsikyeossseubnida. ('Br' gat-eun wonsogihoui gyeong-u 2gaeui munjaleul hanaui munjacheoleom chwigeubhayeossseubnida.) x chug-eun munjaui index leul uimihago, ychug-eun munja-e daeeungdoeneun kodeuleul uimihabnida. Xdata wa Ydata neun gaggag Auto-Encoderui iblyeoggwa chullyeog-e haedanghabnida. Xdataui jeil dwineun jonglyo kodeuga sab-ibdoeeo issgo, Ydataui jeil ap-eneun sijag kodeuga sab-ibdoeeo issseubnida. Xdatawa Ydataneun rank 3ui tensor ibnida. Xdata[i,j,k] eseo, ieun datasang-eseoui index, jneun SMILESeseo myeochbeonjjae munjainjie haedanghaneun index, k neun munja-e haedangdoeneun code ibnida. Ldataneun gag SMILES munjayeol-ui gil-ie haedangdoeneun long type deiteoga jeojangdoeeo issseubnida. geugeos-eun gag deiteoga myeochgaeui SMILES kodeuleul gajineunjie daeeungdoeneun rank 1ui tensoribnida. SMILESmada munjayeol-ui gil-iga daleumeulo RNNeseo dynamic lengthleul cheolihal ttae pil-yohabnida. Pdataneun properties jeongboleul damgoissseubnida. float hyeongtae deiteoil sudo issgo, classna binary il sudo issseubnida. geugeos-eun rank 3 tensor ibnida. z-space neun x-space lobuteo inkodingdoen gong-gan-ibnida. x-spaceneun munjayeol indexwa munja kodeulo ilueojin rank 2ui gong-gan-ijiman, z-spaceneun rank 1ui vector gong-gan-ibnida. Zdataneun rank 2ui tensort igo cheosbeonjjae rankneun dataui indexe haedangdoebnida. Xdataneun bul-yeonsogjeog-in one-hot gabs-eul gajineun banmyeon-e Zdata neun yeonsogjeog-in float gabs-ibnida. dangsin-eun AutoEncoder lobuteo Zdata gabs-eul ppob-anael su issseubnida. encodere Xdataleul neoh-eojundamyeon Zdataga chullyeogdoebnida. hajiman Xdatawa daleun rankleul gajimeulo, wiui geulimgwaneun hyeongtaega daleul geos-ibnida. 자세히 1060 / 5000 번역 결과 As in the picture above, Xdata and Ydata are converted from SMILES into one-hot code. The letters that make up SMILES are mapped to numeric codes. (In the case of an element symbol such as'Br', two characters are treated as one character.) The x-axis represents the index of the character, and the y-axis represents the code corresponding to the character. Xdata and Ydata correspond to the input and output of Auto-Encoder, respectively. The end code is inserted at the end of Xdata, and the start code is inserted at the beginning of Ydata. Xdata and Ydata are rank 3 tensors. In Xdata[i,j,k], i is the index on the data, j is the index corresponding to the number of characters in SMILES, and k is the code corresponding to the character.

Ldata stores long type data corresponding to the length of each SMILES string. It is a rank 1 tensor corresponding to how many SMILES codes each data has. The length of the string is different for each SMILES, so it is required when handling dynamic length in the RNN.

Pdata contains properties information. It can be float type data, or it can be class or binary. It is a rank 3 tensor.

z-space is the space encoded from x-space. The x-space is the rank 2 space consisting of the string index and character code, but the z-space is the rank 1 vector space. Zdata is the tensort of rank 2, and the first rank corresponds to the index of data. Xdata has a discrete one-hot value, while Zdata is a contiguous float value.

You can extract Zdata values ​​from AutoEncoder. If you put Xdata in the encoder, Zdata is output. However, since it has a different rank than Xdata, it will be different from the picture above.

In test_n_ARAE_ZINC.py Y, cost1, cost2, cost3, cost4, latent_vector, mol_encoded0 = model.test(x, y, l, s, n) mol_encoded0 is Zdata encoded from Xdata

(latent_vector is generated from random seed Sdata

superk1200 commented 3 years ago

Thanks for your reply. Is Sdata saved in model.ckpt-X.index or meta ? And, is qλ(y|z) the vector of Scaffold after training phase? q

gicsaw commented 3 years ago

Sdata is just a random seed generated from a Gaussian random number generator, so it is not saved. qλ(y|z) is a function of predictor that predicts properties from Z.

superk1200 commented 3 years ago

Thanks for your reply,again. 'Y, cost1, cost2, cost3, cost4, latent_vector, mol_encoded0 = model.test(x, y, l, s, n)' Can I get the vector of scaffold to the molecules(training data or acvivies) before adding the Gaussian random number ? I've been bothering you in such a cold winter. How severe is the weather in Korea now ?

gicsaw commented 3 years ago

Does the scaffold mean a point in z space (latent vector)? Or does it mean molecular scaffold? mol_encoded0 will be what you want. No noise is added to the latent vector in the test or generation stage. The random number mentioned above is a seed (S) for generator. The test function returns the result of autoencoder and generator together, but autoencoder and generator are separate. mol_encoded0 and latent_vector are the output from the encoder and the generator, respectively.

Since I am working from home, there was no problem except that the windows in my house were frozen. Today, it wasn't very cold, and the car road has been cleared of snow. However, the alleyways are still slippery.

superk1200 commented 3 years ago

Maybe, my presentation skills are not good or my understanding may be wrong. I want to get the vector of feature to training data or activities data,like original data in picture. And,is mol_encoded0 saved in test.npy?

Is the epidemic in your city still serious? Hope you are in a safe environment. original code

gicsaw commented 3 years ago
  1. What is the meaning of your intended feature vector? In general, feature vector means input vector (Xdata) and this is what you already showed above, so I don't think you want the input vector. Does the feature vector mean the output data (Ydata in my notation) (corresponding to the input data) from the auto-encoder? 118 Y, cost1, cost2, cost3, cost4, latent_vector, mol_encoded0 = model.test(x, y, l, s, n) here, Y is argmax(Ydata)

If you want Ydata before activating the argmax function, See, in model/ARAE.py 90 self.mol_decoded_softmax, mol_decoded_logits = self.total_decoder(self.mol_encoded, mol_onehot) self.mol_decoded_softmax is Ydata

activities data just enters the data you already have. (Although there is a predictor in ARAE, it doesn't actually exist for accurate prediction.)

  1. mol_encoded0 is saved in Zreal.npy See test_n_ARAE_ZINC.py 120 latent_vector_real.append(mol_encoded0) 188 latent_vector_real=np.array(latent_vector_real,dtype="float32").reshape(-1,latent_size) 192 outfile=out_dir+"/Zreal.npy" 193 np.save(outfile,latent_vector_real)

Thank you for asking my regards. Recently, the number of new infections per day has been decreasing. Where do you live? Is it safe there?

superk1200 commented 3 years ago

I found the mol_encoded0 in Zreal.npy. Thank you! If I want to decode the mol_encoded0, did I only change the ''gen_ARAE_ZINC.py''? like picture When I decode the mol_encoded0, do i only get the training data? My adviser wants to get the feature of dataset or mean vector of dataset. Sorry, I was the first student who used deep learning in my laboratory. I have no seniors to ask questions. My thesis is in the final stage. Thank you for helping me a lot. I live in Taiwan. Today, there are two local infection cases in Taiwan. vetor3

gicsaw commented 3 years ago

I heard that Taiwan is one of the safest places. I hope the corona is well controlled.

latent_vector in my code is generated data (or fake data) point in Z-space (Zfake) mol_encoded0 is encoded data (or real data) in Z-space (Zreal) (z-space is latent space or code space. it is hidden space between encoder and decoder. ) In ARAE, the generator generate fake Z vector from random variable "S" If the ARAE is well trained, the fake z will have a distribution similar to real z encoded from real data X. The picture you uploaded is a code that decodes fake data (Zfake) generated from random vector s. The gen_ARAE_ZINC.py file is for generation only, so it does not include code that encodes Zreal from Xreal and decodes Yreal from Zreal. This part in the test_ARAE_ZINC.py file

I didn't understand exactly what feature of dataset you want. Do you mean the distribution or mean of the Zreals? Do you want to compare the distribution differences between Zreal and Zfake?

superk1200 commented 3 years ago

I want the distribution and mean of the Zreals. Can I use the PCA or t-sne plot to presentation the distribution and mean of the Zreals?

But the vaccine seems to have many problems. I hope to quell this disaster and return to the past lifestyle.

gicsaw commented 3 years ago

See, in test_n_ARAE_ZINC.py (or test_n_CARAE_uncon_logP_SAS_TPSA.py) 192 outfile=out_dir+"/Zreal.npy" 193 np.save(outfile,latent_vector_real) 194 outfile=out_dir+"/Zfake.npy" 195 np.save(outfile,latent_vector_fake)

Zreal.npy and Zfake.npy are latent vector from real SMILES and random seed, respectively. You can do PCA analysis using Zreal files. I haven't, but others use PCA or t-sne to see if the distributions of Zreal and Zfake are similar.

superk1200 commented 3 years ago

Excuse me, I have a question. Why did you choose dud-e database?

gicsaw commented 3 years ago

Because I thought it would be better to use as much data as possible. When molecules are directly collected from ChEMBL, only hundreds to thousands of molecules can be obtained even if active and inactive molecules are combined, but DUD-E contains more data. DUD-E is data composed of active and decoys. The active data were those with high activity (probably IC50 <1uM) among those collected in ChEMBL. The decoy is a collection of active-unlike structures from public molecular databases (ZINC?) in addition to those with low activity among those collected from ChEMBL. Therefore, decoy contains molecules of uncertain activity in addition to inactive.