Where's the code for model distill

jkwang93 / MCMG

MCMG_V1

MIT License

69 stars 25 forks source link

Where's the code for model distill #2

Closed charlesxu90 closed 2 years ago

charlesxu90 commented 2 years ago

Dear authors,

I didn't see the code for transformer distill. Could you please tell me where you did it?

I looked into the code. Below is what I learned.

model/ containts the model definitions.
- MCMG_utils are some common utilities
1, 2 for train and sample from the transformer model.
3 is training another RNN model
4 is for the RNN agent training.

jkwang93 commented 2 years ago

Model distillation aims to transfer the knowledge learned from a large model or multiple models ensemble to another lightweight single model for easy to fine-tuning. 1 is the transformer model, 3 is the RNN, we use RNN to replace transformer, this is the distill process. The code only contains Distilled-Molecules and not the Distilled-Likelihood method.

charlesxu90 commented 2 years ago

@jkwang93 I'm expecting a training process which uses both transformer and GRU model. But infact, Transformer is used to generate molecules to train the GRU model?

What's the difference from training GRU model on the original dataset? It should be similar?

jkwang93 commented 2 years ago

Transformer generates constraint molecules (constraint chemical space), this will create a different chemical space compared with the original data. In this space, in RL process, we can find the desired molecules faster with high molecular diversity.

charlesxu90 commented 2 years ago

Get it. Thanks for answering my questions.

charlesxu90 commented 2 years ago

One additional question, why not directly training your transformer model on goal-directed generation? It is for computational cost concerns?

I thought it might be better to train transformer than GRU as it can learn faster with better abstractions such as scaffolds.

jkwang93 commented 2 years ago

There are two reasons for the analysis: one is for computational cost concerns, and the other is to ensure the diversity of molecules. We did an experiment and used RNN to learn the transformer's likelihood. We called the obtained RNN DL. The distribution of the DL model was basically the same as the original transfomer. Then we fine-tuned the DL with RL and found that the molecular diversity generated was very low. In order for the model to be able to get a balance between diversity and the generation of desried molecules, we adopted the distillation operation of DM (existing code). We analyzed this part in detail in the paper.