smiles724 / VQMAE

2 stars 0 forks source link

Error loading model weights #2

Open BSharmi opened 1 month ago

BSharmi commented 1 month ago

Hello!

Great work!

I tried to load the light weight model following

https://github.com/smiles724/VQMAE/blob/master/infer_new_pdb.py#L62C5-L62C47

using the code

    ckpt = torch.load(args.ckpt, map_location='cpu')
    cfg_ckpt = ckpt['config']
    model = get_model(cfg_ckpt.model).to(args.device)
    lsd = model.load_state_dict(ckpt['model'])

and get the following error

 lsd = model.load_state_dict(ckpt['model'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/SurfVQMAE/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SurfaceTransformerV2:
    Missing key(s) in state_dict: "classifier.0.weight", "classifier.0.bias", "classifier.2.weight", "classifier.2.bias", "classifier.4.weight", "classifier.4.bias". 
    Unexpected key(s) in state_dict: "surface_encoder.orientation_scores.0.weight", "surface_encoder.orientation_scores.0.bias", "surface_encoder.orientation_scores.2.weight", "surface_encoder.orientation_scores.2.bias". 
    size mismatch for surface_encoder.conv.layers.0.net_in.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16]).
    size mismatch for surface_encoder.conv.linear_transform.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16]).

Do I need to change something in the code to load the model?

Thank you!

smiles724 commented 1 month ago

Thanks again for reviewing our code. It is great news for me to know that people in the community get interested in SurfVQMAE and attempt to reproduce the results.

Regarding your question, can you please use the infer.py for a second try? infer_new_pdb.py is an older version for inference and I have deleted it promptly.

If you still encounter errors, please do not hesitate to let me know!

smiles724 commented 1 month ago

The major difference between 'infer.py' and 'infer_new_pdb.py' is that I turned off the strict option to false.

    model.load_state_dict(ckpt['model'], strict=False)

This is because when training VAE, I built several additional blocks including tokenizer.mlp, decoder, and hbond_mlp for predicting unsupervised features. But when transferring to downstream tasks like epitope prediction. Those blocks are not required, and a new classifer is needed. So there is a mismatch when loading the pretrained model weight. Hope this explanation helps.

BSharmi commented 1 month ago

Thank you I just realized after the message that I should try that first :) but I still get a shape mismatch error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/SurfVQMAE/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SurfaceTransformerV2:
    size mismatch for surface_encoder.conv.layers.0.net_in.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16]).
    size mismatch for surface_encoder.conv.linear_transform.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16]).
smiles724 commented 1 month ago

I have uploaded both the model weights of pretraining and fine-tuning in weight/ folder. Can you please try them again?

I encountered no trouble when loading these parameters on my computer. Perhaps I have given a model weight of the previous version. Sorry for the inconvenience.

BSharmi commented 1 month ago

Thank you so much for addressing both issues. I still cannot load the model, may be I am doing something wrong. Here is how I am trying to load

from src.models import get_model
import torch
device = "cuda:1"

ckpt = torch.load("/efs/home/sharmiba/SurfVQMAE_V2/VQMAE/weight/light_pretrain.pt", device)
cfg = ckpt['config']
model = get_model(cfg.model).to(device)
model.load_state_dict(ckpt['model'], strict=False)

I get the error

RuntimeError: Error(s) in loading state_dict for SurfaceTransformerV2:
    size mismatch for surface_encoder.conv.layers.0.net_in.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16]).
    size mismatch for surface_encoder.conv.linear_transform.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16]).
>>> 

I also tried the fine tune model, the difference is I need to change cfg = ckpt['config'] to cfg = ckpt['cfg'] , copied the code below

from src.models import get_model
import torch
device = "cuda:1"

ckpt = torch.load("/efs/home/sharmiba/SurfVQMAE_V2/VQMAE/weight/light_finetune.pt", device)
cfg = ckpt['cfg']
model = get_model(cfg.model).to(device)
model.load_state_dict(ckpt['model'], strict=False)

and get the same error

RuntimeError: Error(s) in loading state_dict for SurfaceTransformerV2:
    size mismatch for surface_encoder.conv.layers.0.net_in.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16]).
    size mismatch for surface_encoder.conv.linear_transform.0.weight: copying a param with shape torch.Size([16, 26]) from checkpoint, the shape in current model is torch.Size([16, 16])

Did I do something wrong? I have updated the github repository, and downloaded the new weights and still not able to load the model :(

Sorry for bugging you again, I just want to try to predict the tokens given a structure. Is there an easy way to do that?

Thank you very much

smiles724 commented 1 month ago

Sorry for the late reply! I have double-checked the checkpoints and encountered the same problem.

Let me give you a comprehensive overview of those checkpoints:

I used the vae_2024_02_01__19_03_24 checkpoint for fine-tuning. image

However, as you discovered, layers in dMaSIFConv_seg are mismatched: image

To understand the problem, the mismatch comes from the net_in module in the geometry.py script. image

As far as I am concerned, I directly used dMaSIF (https://github.com/FreyrS/dMaSIF/blob/0dcc26c3c218a39d5fe26beb2e788b95fb028896/benchmark_models.py#L233) as the surface point cloud encoder and did not change its internal modules. Therefore, ideally, the size of surface_encoder.conv.layers.0.net_in.0.weight ought to be (res_dim, hidden_dim), namely, (16, 16) in this case.

I believe that in the previous SurfFormer v2, I modified the architecture of dMaSIF for pretraining but later changed the modification. However, as it was half a year ago, I forgot what the original version looks like (a lesson that I need to do better version control).

This is completely my fault and I really understand your need to predict the tokens given a structure. As a partial resolution, I provided the latest fine-tuned model weight for surface-based epitope prediction (see checkpoints/light_finetune_new.pt), which has a consistent weight shape of (16, 16). I wish this can meet your needs.

If you have an urgent demand for pretrained model weight, please let me know.

BSharmi commented 1 month ago

Thank you that makes sense now. I am able to load the model!

I am not in a rush but it would be great if you can provide the pretrained model weight at some point, the same that you used in the paper for reproducibility,

Thank you very much, Sharmi