Error "Weight tensor should be defined either for all or no classes" when training on S3DIS

shayan-nikoo commented 1 year ago

Checklist

[X] I have searched for similar issues.
[X] I have tested with the latest development wheel.
[X] I have checked the release documentation and the latest documentation (for master branch).

Describe the issue

I am trying to train the kp-conv pytorch semantic segmentation on S3DIS dataset but keep getting RuntimeError: weight tensor should be defined either for all 13 classes or no classes but got weight tensor of shape: [1, 13]. First, I noticed pickles are not created for all files. Preprocessing stops at Area5/office18. I fixed below line and the preprocessing works fine.

in Line `323474` of the file `S3DIS/Stanford3dDataset_v1.2/Area_5/office_19/Annotations/ceiling_1.txt`
correct the number 103.0�0000 to 103.000000.

Steps to reproduce the bug

This is the command I run:


$ python3 scripts/run_pipeline.py torch -c ml3d/configs/kpconv_s3dis.yml --dataset.dataset_path /media/data/S3DIS/Stanford3dDataset_v1.2 --device gpu --pipeline SemanticSegmentation --dataset.use_cache True

The config file is the same as ml3d/configs/kpconv_s3dis.yml I only added two parameters to the pipeline config file:

  num_workers: 0 # added by me
  pin_memory: false # added by me

Error message

regular arguments
backend: gloo
batch_size: null
cfg_dataset: null
cfg_file: ml3d/configs/kpconv_s3dis.yml
cfg_model: null
cfg_pipeline: null
ckpt_path: null
dataset: null
dataset_path: null
device: gpu
device_ids:
- '0'
framework: torch
host: localhost
main_log_dir: null
max_epochs: null
mode: null
model: null
pipeline: SemanticSegmentation
port: '12355'
seed: 0
split: train

extra arguments
dataset.dataset_path: /media/data/S3DIS/Stanford3dDataset_v1.2
dataset.use_cache: 'True'

creating dataset
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 272/272 [02:10<00:00,  2.09it/s]
INFO - 2023-02-10 12:16:32,218 - semantic_segmentation - DEVICE : cuda
INFO - 2023-02-10 12:16:32,218 - semantic_segmentation - Logging in file : ./logs/KPFCNN_S3DIS_torch/log_train_2023-02-10_12:16:32.txt
INFO - 2023-02-10 12:16:32,228 - s3dis - Found 249 pointclouds for train
preprocess: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249/249 [00:24<00:00, 10.19it/s]
INFO - 2023-02-10 12:16:56,666 - s3dis - Found 23 pointclouds for validation
preprocess: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:14<00:00,  1.64it/s]
INFO - 2023-02-10 12:17:10,706 - semantic_segmentation - Initializing from scratch.
INFO - 2023-02-10 12:17:10,707 - semantic_segmentation - Writing summary in train_log/00020_KPFCNN_S3DIS_torch.
INFO - 2023-02-10 12:17:10,707 - semantic_segmentation - Started training
INFO - 2023-02-10 12:17:10,707 - semantic_segmentation - === EPOCH 0/10 ===
training:   0%|                                                                                                                                                                                                            | 0/63 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/xx/repos/kpconv-seg/scripts/run_pipeline.py", line 245, in <module>
    sys.exit(main())
  File "/home/xx/repos/kpconv-seg/scripts/run_pipeline.py", line 179, in main
    pipeline.run_train()
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py", line 411, in run_train
    loss, gt_labels, predict_scores = model.get_loss(
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/open3d/_ml3d/torch/models/kpconv.py", line 337, in get_loss
    self.output_loss = Loss.weighted_CrossEntropyLoss(scores, labels)
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1164, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: weight tensor should be defined either for all 13 classes or no classes but got weight tensor of shape: [1, 13]

Expected behavior

Run the trainnig on S3DIS with no error.

Open3D, Python and System information

- Operating system: Ubuntu 22.04
- Python version: 3.10.6 [GCC 11.3.0] (output from `import sys  print(sys.version)`)
- Open3D version: 0.16.0 (output from python: `print(open3d.__version__)`)
- System type: x84
- Is this remote workstation?: no
- How did you install Open3D?: pip

Additional information

CUDA Version: 11.7
Pytorch Version: torch==1.12.0+cu116 torchvision==0.13.0+cu116

shayan-nikoo commented 1 year ago

I tested this in a CPU-only docker environment but the error persists.

shayan-nikoo commented 1 year ago

I could skip the error and run the training by passing the class_weights: [] as empty in the config file. However, the result of training and test don't look right to me, the mIoU is very low. The test result is: Overall Testing Accuracy : 0.108, mIoU : 0.058 is mIoU almost zero or am I missing something?

Training results s3dis

INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - Loss train: 1.693  eval: 1.692
INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - Mean acc train: 0.174  eval: 0.267
INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - Mean IoU train: 0.087  eval: 0.148
INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - === EPOCH 798/800 ===
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [00:01<00:00, 42.48it/s]
validation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 79.69it/s]
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - Loss train: 1.635  eval: 1.396
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - Mean acc train: 0.151  eval: 0.352
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - Mean IoU train: 0.081  eval: 0.182
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - === EPOCH 799/800 ===
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [00:01<00:00, 42.03it/s]
validation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 84.19it/s]
INFO - 2023-02-10 17:32:12,779 - semantic_segmentation - Loss train: 1.659  eval: 1.652
INFO - 2023-02-10 17:32:12,779 - semantic_segmentation - Mean acc train: 0.163  eval: 0.267
INFO - 2023-02-10 17:32:12,779 - semantic_segmentation - Mean IoU train: 0.086  eval: 0.109
INFO - 2023-02-10 17:32:12,780 - semantic_segmentation - === EPOCH 800/800 ===
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [00:01<00:00, 39.32it/s]
validation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 75.36it/s]
INFO - 2023-02-10 17:32:14,463 - semantic_segmentation - Loss train: 1.726  eval: 2.458
INFO - 2023-02-10 17:32:14,463 - semantic_segmentation - Mean acc train: 0.169  eval: 0.226
INFO - 2023-02-10 17:32:14,463 - semantic_segmentation - Mean IoU train: 0.086  eval: 0.072
INFO - 2023-02-10 17:32:14,552 - semantic_segmentation - Epoch 800: save ckpt to ./logs/KPFCNN_S3DIS_torch/checkpoint

Test

NFO - 2023-02-10 18:15:06,649 - semantic_segmentation - Accuracy : [0.21310890361166648, 0.3248931736760441, 0.5463872361721226, 0.0, 0.0, 0.0, 0.0, 0.0011374866459067771, 0.0, 0.0, 0.0, 0.0, 0.3207777518363869, 0.10817727322631744]04.97it/s]
INFO - 2023-02-10 18:15:06,649 - semantic_segmentation - IoU : [0.1751161378342312, 0.23549265987632134, 0.25156889849491443, 0.0, 0.0, 0.0, 0.0, 0.0010764818849628398, 0.0, 0.0, 0.0, 0.0, 0.09091394402481037, 0.05801293247040308]
INFO - 2023-02-10 18:15:06,658 - s3dis - Saved Area_3_office_6 in ./test/S3DIS/Area_3_office_6.npy.
test 21/23: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53173/53173 [00:01<00:00, 44756.97it/s]
test 22/23: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13400/13400 [00:00<00:00, 119477.80it/s]INFO - 2023-02-10 18:15:06,953 - semantic_segmentation - Accuracy : [0.21271639007411344, 0.3281782025759704, 0.5457846251452136, 0.0, 0.0, 0.0, 0.0, 0.0011374866459067771, 0.0, 0.0, 0.0, 0.0, 0.32096158074793024, 0.1083675603991642]
INFO - 2023-02-10 18:15:06,953 - semantic_segmentation - IoU : [0.1746411967770861, 0.23722240797212982, 0.2534656981039331, 0.0, 0.0, 0.0, 0.0, 0.0010754038786806772, 0.0, 0.0, 0.0, 0.0, 0.09036740278810888, 0.05821323919384143]
INFO - 2023-02-10 18:15:06,953 - s3dis - Saved Area_3_hallway_6 in ./test/S3DIS/Area_3_hallway_6.npy.
INFO - 2023-02-10 18:15:06,954 - semantic_segmentation - Overall Testing Accuracy : 0.1083675603991642, mIoU : 0.05821323919384143
INFO - 2023-02-10 18:15:06,954 - semantic_segmentation - Finished testing

I trained the model on S3DIS/Stanford3dDataset_v1.2 not the aligned version. Max_epochs is 800. Does anyone else also get this poor results?

srzxDragon commented 1 year ago

I have the same problem with you and I remedy this issu by modifying the SemSegLoss in semseg_loss.py (from line 40). I don't know why the previous code will return a 2-d array by 'DataProcessing.get_class_weights(dataset.cfg.class_weights)'. Actually dataset.cfg.class_weights is a 1-d array and we just need it. You can have a try by modifying the site-package 'site-packages/open3d/_ml3d/torch/modules/semseg_loss.py'.

class SemSegLoss(object):
    """Loss functions for semantic segmentation."""

    def __init__(self, pipeline, model, dataset, device):
        super(SemSegLoss, self).__init__()
        # weighted_CrossEntropyLoss
        if 'class_weights' in dataset.cfg.keys() and len(
                dataset.cfg.class_weights) != 0:
            #class_wt = DataProcessing.get_class_weights(
            #  dataset.cfg.class_weights)

            class_wt = dataset.cfg.class_weights
            weights = torch.tensor(class_wt, dtype=torch.float, device=device)

            self.weighted_CrossEntropyLoss = nn.CrossEntropyLoss(weight=weights)
        else:
            self.weighted_CrossEntropyLoss = nn.CrossEntropyLoss()

shayan-nikoo commented 1 year ago

Thank you @srzxDragon. This fix worked. Now the code runs without passing empty class_weights. But the results are still very poor. mIoU: 0.10

INFO - 2023-02-22 11:14:22,952 - semantic_segmentation - Loss train: 1.231  eval: 1.247
INFO - 2023-02-22 11:14:22,952 - semantic_segmentation - Mean acc train: 0.175  eval: 0.237
INFO - 2023-02-22 11:14:22,953 - semantic_segmentation - Mean IoU train: 0.101  eval: 0.123
INFO - 2023-02-22 11:14:23,090 - semantic_segmentation - Epoch 800: save ckpt to ./logs/KPFCNN_S3DIS_torch/checkpoint

Is your training results with kpconv also like this?

srzxDragon commented 1 year ago

I didn't train kpconvon S3DIS while I trained randlanet on semantic3d. But I found the similiar issue with you that the performance is not good as the official results. My results is as follows, but the official results get 76.0 on mIoU. I have no idea about this issue.

Marine98k commented 1 year ago

hi i meet the same problem, but beside this there is another warning like follows

UserWarning: An output with one or more elements was resized since it had shape [180224], which does not match the required output shape [4, 45056]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:17.) return torch.stack(batch, 0, out=out) do you have met the problem

shayan-nikoo commented 1 year ago

I don't remember exactly, but I don't think I encountered this warning. You can see my error log in the post.

weypro commented 1 year ago

Let's see the get_class_weights function:

@staticmethod
    def get_class_weights(num_per_class):
        # pre-calculate the number of points in each category
        num_per_class = np.array(num_per_class, dtype=np.float32)

        weight = num_per_class / float(sum(num_per_class))
        ce_label_weight = 1 / (weight + 0.02)

        return np.expand_dims(ce_label_weight, axis=0)

So is there anyone know why np.expand_dims is needed?

I changed it to the following:

return ce_label_weight

It works.

Marine98k commented 1 year ago

I don't remember exactly, but I don't think I encountered this warning. You can see my error log in the post.

thank you for your reply

minhtcai commented 1 year ago

Did you guy find the solution for the low IoU? @shayan-nikoo @srzxDragon

shayan-nikoo commented 1 year ago

Unfortunately not. I think it is an implementation issue by Open3d-ml because I am using everything as their settings.

isl-org / Open3D-ML