Open eliasm56 opened 2 years ago
Well,
I managed to figure this out by modifying optimizer_state_dict
with the following:
ckpt_mod['optimizer_state_dict']['state'][196]['exp_avg'] = checkpoint['optimizer_state_dict']['state'][196]['exp_avg'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][196]['exp_avg_sq'] = checkpoint['optimizer_state_dict']['state'][196]['exp_avg_sq'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][197]['exp_avg'] = checkpoint['optimizer_state_dict']['state'][197]['exp_avg'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][197]['exp_avg_sq'] = checkpoint['optimizer_state_dict']['state'][197]['exp_avg_sq'][np.r_[2,5,6]]
Now I'm running into a new weird issue. I get Loss train: nan eval: nan
at each epoch. Any suggestions?
Hi @eliasm56 I have met the same bug with you. Have you solved this problem?
To fix this issue, you should modify the model architecture to have the correct number of output channels in the last layer. You can do this by updating the num_classes parameter when creating the RandLANet model instance. In your case, you should set num_classes=3 to match the three classes you are training on. ------------------------Below is the updated Code -------------------
import torch import open3d.ml as _ml3d import open3d.ml.torch as ml3d import numpy as np from ml3d.datasets.customdataset import Custom3D
cfg_file = "ml3d/configs/randlanet_s3dis.yml" cfg = _ml3d.utils.Config.load_from_file(cfg_file)
cfg.model.num_classes = 3
model = ml3d.models.RandLANet(**cfg.model)
cfg.dataset['dataset_path'] = '/mnt/d/auto_class/Open3D-ML-master/NIST_data/NPY'
def freeze_all_but_last(model):
for n, p in model.named_parameters():
if 'fc1.3' in n:
pass
else:
p.requires_grad = False
freeze_all_but_last(model)
dataset = Custom3D(cfg.dataset.pop('dataset_path', None), cfg.dataset) pipeline = ml3d.pipelines.SemanticSegmentation(model, dataset=dataset, device="cpu", cfg.pipeline)
ckpt_path = 'randlanet_finetune.pth' pipeline.load_ckpt(ckpt_path=ckpt_path)
pipeline.run_train()
@AdityaRajThakur hey! I tried out your code but was still getting the error of:
RuntimeError: Error(s) in loading state_dict for RandLANet:
size mismatch for fc1.3.conv.weight: copying a param with shape torch.Size([13, 32, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 32, 1, 1]).
size mismatch for fc1.3.conv.bias: copying a param with shape torch.Size([13]) from checkpoint, the shape in current model is torch.Size([3]).
It seems like setting num_classes=3 helps to load the model to size 3 but the checkpoint is still size 13. Am I missing a step? Thanks!
Checklist
master
branch).My Question
Hello everyone,
First of all, here are the versions I am running:
Python 3.8 Open3d 0.15.2 Torch 1.8.2+cpu
I would like to finetune RandLA-Net (pretrained on S3DIS) on my own custom point cloud dataset for indoor scene semantic segmentation. I've managed to load the dataset correctly and run inference with the randlanet_s3dis_202201071330utc.pth checkpoint. I have also managed to modify the final
fc.conv.weight
andfc.conv.bias
inmodel_state_dict
so that it only includes weights for my classes of interest (implemented a solution from #277 ). However, when bringing in this modified model checkpoint for training, I run into an issue:RuntimeError: The size of tensor a (13) must match the size of tensor b (3) at non-singleton dimension 0
. I know this error is occurring because S3DIS contains 13 classes, while I am trying to train on only 3 of these. So, I must have forgotten some necessary modification. I will further elaborate below.Below is how I manually sliced the weights in the last layer, since I am only interested in walls, windows, and doors. But, I am not sure if this is technically the right thing to do. Please let me know if I need to make other modifications in order to finetune:
Below is my training script. As you see, I froze all but the last parameters, that is,
fc1.3.conv.bias
andfc1.3.conv.weight
:However, when running this above training script, I run into the following error:
Based on where the error is pointing, I assume this has something to do with the optimizer. Do I need to make any modifications to
optimizer_state_dict
as I did to themodel_state_dict
? Any help is appreciated, as being able to successfully implement the library would be a big breakthrough in my own research. Please let me know if I have left out any important details.