Questions about 3D-MoLM

icycookies commented 6 months ago

Thanks again for sharing this innovative work! However, I encountered several problems during reproduction:

In model/blip2qformer.py L317, the encoder attention mask is directly derived from Uni-mol. However, I observed that this mask is contrary with the format of Transformers. As a result, the Q-Former extracts representations based on the padding representations of Uni-Mol, which seems to integrate certain global information of the molecule.
The stage1-ft.ckpt works fine with cross-modal retrieval. However, when I try the stage1.ckpt by running python stage1.py --filename stage1_ft --mode eval --max_epochs 10 --warmup_steps 200 --lm --init_checkpoint all_checkpoints/stage1.ckpt on the test set of PubChem, I got the following results:

Since the Stage1 checkpoint is pre-trained on the pre-training set of PubChem, I wonder if this sub-optimal result is the case for 3D-MoLM.

Looking forward to your reply!

lsh0520 commented 6 months ago

(1) @acharkq Zhiyuan, can you help with the first question? Thanks. (2) No. I have uploaded the training log with evaluation results of stage 1 pretrained model to huggingface, please check the performance of it. The results in the picture doesn't make sense, which could be (I guess): (1) the environt problem, can you check if your environment is the same with the requirements.txt, (2) I wrongly uploaed a corrupted ckpt. I will look into this later this week and make sure you can reproduce the results in our paper with the ckpt and scripts.

acharkq commented 6 months ago

Thanks @icycookies for raising the issue about uni-mol's encoder mask. I have checked and it is indeed contrary to the format of huggingface's transformers. Luckily, the padding tokens can still capture some global information, thus the model still works. I will discuss with sihang and see how we will deal with bug.

pansanity666 commented 4 months ago

Hi,

The padding tokens are fixed with zeros with no gradients in the nn.Embedding of unimol encoder. And the padding part is masked with -inf in the attention.

My question is: is the padding token really capturing certain global information? or just no information about 3D at all?

Best,

acharkq commented 4 months ago

Hi,

Thanks for your interest. It is clear that it works somehow. Otherwise the stage 1 will have 0 performance

pansanity666 commented 4 months ago

Hi,

I checked the attn_bias added to the self-attention of unimol encoder:

It seems that the attn_bias is actually not properly processed. Therefore, the last few rows (correspond to padding token) still get attention weight towards atom tokens, i.e., 17.8750.

I think this is the reason for the global info in padding tokens.

Best,

acharkq commented 3 months ago

Thanks for the efforts on proving this

lsh0520 / 3D-MoLM

Questions about 3D-MoLM #8