Closed QizhiPei closed 7 months ago
Thanks for your interest in our work.
(1) Yes, it is jointly trained on both PubChem and PubChemQC. Please check the data provider fucntions for detailes (instruct_dataset.py & balance_dataset.py in data_provider folder).
(2) Sorry for the confusion in our codes. Actually, in stage 3, only train subsets of PubChem and PubChemQC are used. We have revised the mode name (pretrain -> train) in stage 3 to avoid this confusion.
Thanks for your response. Does it means that the pretrain
subset of PubChemQC is not used for 3D-MoLM training?
Yes
Thanks for your quick reply~
@QizhiPei @lsh0520 Hi, I'm also a little confused about stage 2 and stage 3. What is the difference between stafe2-ft and stage3-train? Are they corresponding to Specialist and Generalist, respectively?
Thanks for the interesting work that bridges 3D molecule and text. I'm a little confused about the training for stage 3.
bash ./scripts/stage3_train.sh
seems only train the model on one of them. May I need to manually merge these two datasets provided in huggingface?bash ./scripts/stage3_train.sh
seems use thepretrain
subset of PubChem and PubChemQC datasets. So I'm a little confused about when thetrain
subset of PubChem and PubChemQC is used? I know that thepretrain
subset of PubChem is used for stage 1 and starge 2 pre-training, and thetrain
subset of PubChem is used for stage 1 and stage 2 fine-tuning. But for stage 3 I'm confused.Any help you might provide is appreciated and thanks for your time and attention.