Closed icycookies closed 4 months ago
(1) @acharkq Zhiyuan, can you help with the first question? Thanks. (2) No. I have uploaded the training log with evaluation results of stage 1 pretrained model to huggingface, please check the performance of it. The results in the picture doesn't make sense, which could be (I guess): (1) the environt problem, can you check if your environment is the same with the requirements.txt, (2) I wrongly uploaed a corrupted ckpt. I will look into this later this week and make sure you can reproduce the results in our paper with the ckpt and scripts.
Thanks @icycookies for raising the issue about uni-mol's encoder mask. I have checked and it is indeed contrary to the format of huggingface's transformers. Luckily, the padding tokens can still capture some global information, thus the model still works. I will discuss with sihang and see how we will deal with bug.
Hi,
The padding tokens are fixed with zeros with no gradients in the nn.Embedding of unimol encoder. And the padding part is masked with -inf in the attention.
My question is: is the padding token really capturing certain global information? or just no information about 3D at all?
Best,
Hi,
Thanks for your interest. It is clear that it works somehow. Otherwise the stage 1 will have 0 performance
Hi,
I checked the attn_bias added to the self-attention of unimol encoder:
It seems that the attn_bias is actually not properly processed. Therefore, the last few rows (correspond to padding token) still get attention weight towards atom tokens, i.e., 17.8750.
I think this is the reason for the global info in padding tokens.
Best,
Thanks for the efforts on proving this
Thanks again for sharing this innovative work! However, I encountered several problems during reproduction:
model/blip2qformer.py
L317, the encoder attention mask is directly derived from Uni-mol. However, I observed that this mask is contrary with the format of Transformers. As a result, the Q-Former extracts representations based on the padding representations of Uni-Mol, which seems to integrate certain global information of the molecule.stage1-ft.ckpt
works fine with cross-modal retrieval. However, when I try thestage1.ckpt
by runningpython stage1.py --filename stage1_ft --mode eval --max_epochs 10 --warmup_steps 200 --lm --init_checkpoint all_checkpoints/stage1.ckpt
on the test set of PubChem, I got the following results:Since the Stage1 checkpoint is pre-trained on the pre-training set of PubChem, I wonder if this sub-optimal result is the case for 3D-MoLM.
Looking forward to your reply!