Question about the training for stage 3

QizhiPei commented 7 months ago

Thanks for the interesting work that bridges 3D molecule and text. I'm a little confused about the training for stage 3.

Does the model is jointly trained on PubChem and PubChemQC datasets at the same time? bash ./scripts/stage3_train.sh seems only train the model on one of them. May I need to manually merge these two datasets provided in huggingface?
bash ./scripts/stage3_train.sh seems use the pretrain subset of PubChem and PubChemQC datasets. So I'm a little confused about when the train subset of PubChem and PubChemQC is used? I know that the pretrain subset of PubChem is used for stage 1 and starge 2 pre-training, and the train subset of PubChem is used for stage 1 and stage 2 fine-tuning. But for stage 3 I'm confused.

Any help you might provide is appreciated and thanks for your time and attention.

lsh0520 commented 7 months ago

Thanks for your interest in our work.

(1) Yes, it is jointly trained on both PubChem and PubChemQC. Please check the data provider fucntions for detailes (instruct_dataset.py & balance_dataset.py in data_provider folder).

(2) Sorry for the confusion in our codes. Actually, in stage 3, only train subsets of PubChem and PubChemQC are used. We have revised the mode name (pretrain -> train) in stage 3 to avoid this confusion.

QizhiPei commented 7 months ago

Thanks for your response. Does it means that the pretrain subset of PubChemQC is not used for 3D-MoLM training?

lsh0520 commented 7 months ago

Yes

QizhiPei commented 7 months ago

Thanks for your quick reply~

lhkhiem28 commented 6 months ago

@QizhiPei @lsh0520 Hi, I'm also a little confused about stage 2 and stage 3. What is the difference between stafe2-ft and stage3-train? Are they corresponding to Specialist and Generalist, respectively?

lsh0520 / 3D-MoLM

Question about the training for stage 3 #5