isl-org / Open3D-ML

An extension of Open3D to address 3D Machine Learning tasks
Other
1.81k stars 315 forks source link

What is the proper procedure for finetuning a pretrained model on a custom dataset using this library? #558

Open eliasm56 opened 2 years ago

eliasm56 commented 2 years ago

Checklist

My Question

Hello everyone,

First of all, here are the versions I am running:

Python 3.8 Open3d 0.15.2 Torch 1.8.2+cpu

I would like to finetune RandLA-Net (pretrained on S3DIS) on my own custom point cloud dataset for indoor scene semantic segmentation. I've managed to load the dataset correctly and run inference with the randlanet_s3dis_202201071330utc.pth checkpoint. I have also managed to modify the final fc.conv.weight and fc.conv.bias in model_state_dict so that it only includes weights for my classes of interest (implemented a solution from #277 ). However, when bringing in this modified model checkpoint for training, I run into an issue: RuntimeError: The size of tensor a (13) must match the size of tensor b (3) at non-singleton dimension 0. I know this error is occurring because S3DIS contains 13 classes, while I am trying to train on only 3 of these. So, I must have forgotten some necessary modification. I will further elaborate below.

Below is how I manually sliced the weights in the last layer, since I am only interested in walls, windows, and doors. But, I am not sure if this is technically the right thing to do. Please let me know if I need to make other modifications in order to finetune:

import torch
import os, sys, logging
import numpy as np

# Load pre-trained model checkpoint
checkpoint = torch.load('randlanet_s3dis_202201071330utc.pth', map_location=torch.device('cpu'))
ckpt_mod = checkpoint

# Modify pretrained model weights for fine-tuning
# Slice out weights for the 3 classes of interest (walls, windows, doors)

ckpt_mod['model_state_dict']['fc1.3.conv.bias'] = checkpoint['model_state_dict']['fc1.3.conv.bias'][np.r_[2,5,6]] 
ckpt_mod['model_state_dict']['fc1.3.conv.weight'] = checkpoint['model_state_dict']['fc1.3.conv.weight'][np.r_[2,5,6]] 
# Save modified weights
save_mod = dict(epoch=checkpoint['epoch'],
                model_state_dict=ckpt_mod['model_state_dict'],
                optimizer_state_dict=checkpoint['optimizer_state_dict'],
                scheduler_state_dict=checkpoint['scheduler_state_dict'])
torch.save(save_mod, 'randlanet_finetune.pth')

Below is my training script. As you see, I froze all but the last parameters, that is, fc1.3.conv.bias and fc1.3.conv.weight:

import torch
import open3d.ml as _ml3d
import open3d.ml.torch as ml3d
import numpy as np
from ml3d.datasets.customdataset import Custom3D

cfg_file = "ml3d/configs/randlanet_s3dis.yml"
cfg = _ml3d.utils.Config.load_from_file(cfg_file)

model = ml3d.models.RandLANet(**cfg.model)

cfg.dataset['dataset_path'] = '/mnt/d/auto_class/Open3D-ML-master/NIST_data/NPY'

 def freeze_all_but_last(model):

        #named_parameters is a tuple with (parameter name: string, parameters: tensor)
        for n, p in model.named_parameters():
            if 'fc1.3' in n:
                pass
            else:
                p.requires_grad = False

freeze_all_but_last(model)

dataset = Custom3D(cfg.dataset.pop('dataset_path', None), **cfg.dataset)
pipeline = ml3d.pipelines.SemanticSegmentation(model, dataset=dataset, device="cpu", **cfg.pipeline)

ckpt_path = 'randlanet_finetune.pth'
pipeline.load_ckpt(ckpt_path=ckpt_path)

pipeline.run_train()

However, when running this above training script, I run into the following error:

INFO - 2022-08-09 16:04:56,085 - semantic_segmentation - DEVICE : cpu
INFO - 2022-08-09 16:04:56,086 - semantic_segmentation - Logging in file : ./logs/RandLANet_Custom3D_torch/log_train_2022-08-09_16:04:56.txt
INFO - 2022-08-09 16:04:56,090 - customdataset - Found 70 pointclouds for train
INFO - 2022-08-09 16:04:56,091 - customdataset - Found 9 pointclouds for validation
INFO - 2022-08-09 16:04:56,096 - semantic_segmentation - Loading checkpoint randlanet_finetune.pth
INFO - 2022-08-09 16:04:56,241 - semantic_segmentation - Loading checkpoint optimizer_state_dict
INFO - 2022-08-09 16:04:56,265 - semantic_segmentation - Loading checkpoint scheduler_state_dict
INFO - 2022-08-09 16:04:56,277 - semantic_segmentation - Writing summary in train_log/00023_RandLANet_Custom3D_torch.
INFO - 2022-08-09 16:04:56,278 - semantic_segmentation - Started training
INFO - 2022-08-09 16:04:56,279 - semantic_segmentation - === EPOCH 0/200 ===
training:   0%|                                                                                  | 0/35 [00:03<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 pipeline.run_train()

File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py:421, in SemanticSegmentation.run_train(self)
    418 if model.cfg.get('grad_clip_norm', -1) > 0:
    419     torch.nn.utils.clip_grad_value_(model.parameters(),
    420                                     model.cfg.grad_clip_norm)
--> 421 self.optimizer.step()
    423 self.metric_train.update(predict_scores, gt_labels)
    425 self.losses.append(loss.cpu().item())

File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:65, in _LRScheduler.__init__.<locals>.with_counter.<locals>.wrapper(*args, **kwargs)
     63 instance._step_count += 1
     64 wrapped = func.__get__(instance, cls)
---> 65 return wrapped(*args, **kwargs)

File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/optimizer.py:89, in Optimizer._hook_for_profile.<locals>.profile_hook_step.<locals>.wrapper(*args, **kwargs)
     87 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
     88 with torch.autograd.profiler.record_function(profile_name):
---> 89     return func(*args, **kwargs)

File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.__class__():
---> 27         return func(*args, **kwargs)

File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/adam.py:108, in Adam.step(self, closure)
    105             state_steps.append(state['step'])
    107     beta1, beta2 = group['betas']
--> 108     F.adam(params_with_grad,
    109            grads,
    110            exp_avgs,
    111            exp_avg_sqs,
    112            max_exp_avg_sqs,
    113            state_steps,
    114            group['amsgrad'],
    115            beta1,
    116            beta2,
    117            group['lr'],
    118            group['weight_decay'],
    119            group['eps'])
    120 return loss

File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/_functional.py:84, in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
     81     grad = grad.add(param, alpha=weight_decay)
     83 # Decay the first and second moment running average coefficient
---> 84 exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
     85 exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
     86 if amsgrad:
     87     # Maintains the maximum of all 2nd moment running avg. till now

RuntimeError: The size of tensor a (13) must match the size of tensor b (3) at non-singleton dimension 0

Based on where the error is pointing, I assume this has something to do with the optimizer. Do I need to make any modifications to optimizer_state_dict as I did to the model_state_dict? Any help is appreciated, as being able to successfully implement the library would be a big breakthrough in my own research. Please let me know if I have left out any important details.

eliasm56 commented 2 years ago

Well, I managed to figure this out by modifying optimizer_state_dict with the following:

ckpt_mod['optimizer_state_dict']['state'][196]['exp_avg'] = checkpoint['optimizer_state_dict']['state'][196]['exp_avg'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][196]['exp_avg_sq'] = checkpoint['optimizer_state_dict']['state'][196]['exp_avg_sq'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][197]['exp_avg'] = checkpoint['optimizer_state_dict']['state'][197]['exp_avg'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][197]['exp_avg_sq'] = checkpoint['optimizer_state_dict']['state'][197]['exp_avg_sq'][np.r_[2,5,6]]

Now I'm running into a new weird issue. I get Loss train: nan eval: nan at each epoch. Any suggestions?

whuhxb commented 1 year ago

Hi @eliasm56 I have met the same bug with you. Have you solved this problem?

AdityaRajThakur commented 1 year ago

To fix this issue, you should modify the model architecture to have the correct number of output channels in the last layer. You can do this by updating the num_classes parameter when creating the RandLANet model instance. In your case, you should set num_classes=3 to match the three classes you are training on. ------------------------Below is the updated Code -------------------

import torch import open3d.ml as _ml3d import open3d.ml.torch as ml3d import numpy as np from ml3d.datasets.customdataset import Custom3D

cfg_file = "ml3d/configs/randlanet_s3dis.yml" cfg = _ml3d.utils.Config.load_from_file(cfg_file)

////Update the number of classes in the last layer

cfg.model.num_classes = 3

model = ml3d.models.RandLANet(**cfg.model)

cfg.dataset['dataset_path'] = '/mnt/d/auto_class/Open3D-ML-master/NIST_data/NPY'

def freeze_all_but_last(model):

Named_parameters is a tuple with (parameter name: string, parameters: tensor)

for n, p in model.named_parameters():
    if 'fc1.3' in n:
        pass
    else:
        p.requires_grad = False

freeze_all_but_last(model)

dataset = Custom3D(cfg.dataset.pop('dataset_path', None), cfg.dataset) pipeline = ml3d.pipelines.SemanticSegmentation(model, dataset=dataset, device="cpu", cfg.pipeline)

ckpt_path = 'randlanet_finetune.pth' pipeline.load_ckpt(ckpt_path=ckpt_path)

pipeline.run_train()

samuelrawrs commented 1 year ago

@AdityaRajThakur hey! I tried out your code but was still getting the error of:

RuntimeError: Error(s) in loading state_dict for RandLANet:
    size mismatch for fc1.3.conv.weight: copying a param with shape torch.Size([13, 32, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 32, 1, 1]).
    size mismatch for fc1.3.conv.bias: copying a param with shape torch.Size([13]) from checkpoint, the shape in current model is torch.Size([3]).

It seems like setting num_classes=3 helps to load the model to size 3 but the checkpoint is still size 13. Am I missing a step? Thanks!