lsh0520 / 3D-MoLM

39 stars 5 forks source link

Questions about 3D-MoLM #8

Closed icycookies closed 4 months ago

icycookies commented 6 months ago

Thanks again for sharing this innovative work! However, I encountered several problems during reproduction:

屏幕快照 2024-04-09 下午9 17 50

Since the Stage1 checkpoint is pre-trained on the pre-training set of PubChem, I wonder if this sub-optimal result is the case for 3D-MoLM.

Looking forward to your reply!

lsh0520 commented 6 months ago

(1) @acharkq Zhiyuan, can you help with the first question? Thanks. (2) No. I have uploaded the training log with evaluation results of stage 1 pretrained model to huggingface, please check the performance of it. The results in the picture doesn't make sense, which could be (I guess): (1) the environt problem, can you check if your environment is the same with the requirements.txt, (2) I wrongly uploaed a corrupted ckpt. I will look into this later this week and make sure you can reproduce the results in our paper with the ckpt and scripts.

acharkq commented 6 months ago

Thanks @icycookies for raising the issue about uni-mol's encoder mask. I have checked and it is indeed contrary to the format of huggingface's transformers. Luckily, the padding tokens can still capture some global information, thus the model still works. I will discuss with sihang and see how we will deal with bug.

pansanity666 commented 4 months ago

Hi,

The padding tokens are fixed with zeros with no gradients in the nn.Embedding of unimol encoder. And the padding part is masked with -inf in the attention.

My question is: is the padding token really capturing certain global information? or just no information about 3D at all?

Best,

acharkq commented 4 months ago

Hi,

Thanks for your interest. It is clear that it works somehow. Otherwise the stage 1 will have 0 performance

pansanity666 commented 4 months ago

Hi,

I checked the attn_bias added to the self-attention of unimol encoder:

image

It seems that the attn_bias is actually not properly processed. Therefore, the last few rows (correspond to padding token) still get attention weight towards atom tokens, i.e., 17.8750.

I think this is the reason for the global info in padding tokens.

Best,

acharkq commented 3 months ago

Thanks for the efforts on proving this