lsh0520 / 3D-MoLM

39 stars 5 forks source link

Questions about the performance of baselines in 3D-MOLM and MolCA paper #11

Closed lhkhiem28 closed 6 months ago

lhkhiem28 commented 6 months ago

Hi Zhiyuan and Sihang,

1) I went through both your 3D-MOLM and MolCA papers and I found that the reported performances of MolT5 and MoMu for Molecule Captioning on PubChem dataset in these papers are different without an explanation. (Table 2a in MolCA paper and Table 3a in 3D-MOLM paper). Could you please explain my concern and let me know which baselines are reliable?

2) Additionally, do you have a justification for why MolCA-Galac1.3B outperforms 3D-MoLM-Llama-7B on the Molecule Captioning task?

Thank you very much for your response. Looking forward to hearing from you again.

Screenshot 2024-04-22 005542 Screenshot 2024-04-22 005518

Thanks.

acharkq commented 6 months ago

Hi Khiem,

The processed dataset used in 3D-MoLM and MolCA is different. This is because the two works used different pre-processing strategy, that 3D-MoLM includes a molecules's IUPAC name in the caption and MolCA does not. Therefore, the results are different for baselines and the results are not directly comparable

There are other differences in data-processing in addition to the one that I mention above. I recommend reference to the two paper's appendix section on details of the data-processing