isl-org / Open3D-ML

An extension of Open3D to address 3D Machine Learning tasks
Other
1.87k stars 321 forks source link

Error "Weight tensor should be defined either for all or no classes" when training on S3DIS #580

Closed shayan-nikoo closed 1 year ago

shayan-nikoo commented 1 year ago

Checklist

Describe the issue

I am trying to train the kp-conv pytorch semantic segmentation on S3DIS dataset but keep getting RuntimeError: weight tensor should be defined either for all 13 classes or no classes but got weight tensor of shape: [1, 13]. First, I noticed pickles are not created for all files. Preprocessing stops at Area5/office18. I fixed below line and the preprocessing works fine.

in Line `323474` of the file `S3DIS/Stanford3dDataset_v1.2/Area_5/office_19/Annotations/ceiling_1.txt`
correct the number 103.0�0000 to 103.000000.

Steps to reproduce the bug

This is the command I run:


$ python3 scripts/run_pipeline.py torch -c ml3d/configs/kpconv_s3dis.yml --dataset.dataset_path /media/data/S3DIS/Stanford3dDataset_v1.2 --device gpu --pipeline SemanticSegmentation --dataset.use_cache True

The config file is the same as ml3d/configs/kpconv_s3dis.yml I only added two parameters to the pipeline config file:

  num_workers: 0 # added by me
  pin_memory: false # added by me

Error message

regular arguments
backend: gloo
batch_size: null
cfg_dataset: null
cfg_file: ml3d/configs/kpconv_s3dis.yml
cfg_model: null
cfg_pipeline: null
ckpt_path: null
dataset: null
dataset_path: null
device: gpu
device_ids:
- '0'
framework: torch
host: localhost
main_log_dir: null
max_epochs: null
mode: null
model: null
pipeline: SemanticSegmentation
port: '12355'
seed: 0
split: train

extra arguments
dataset.dataset_path: /media/data/S3DIS/Stanford3dDataset_v1.2
dataset.use_cache: 'True'

creating dataset
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 272/272 [02:10<00:00,  2.09it/s]
INFO - 2023-02-10 12:16:32,218 - semantic_segmentation - DEVICE : cuda
INFO - 2023-02-10 12:16:32,218 - semantic_segmentation - Logging in file : ./logs/KPFCNN_S3DIS_torch/log_train_2023-02-10_12:16:32.txt
INFO - 2023-02-10 12:16:32,228 - s3dis - Found 249 pointclouds for train
preprocess: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249/249 [00:24<00:00, 10.19it/s]
INFO - 2023-02-10 12:16:56,666 - s3dis - Found 23 pointclouds for validation
preprocess: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:14<00:00,  1.64it/s]
INFO - 2023-02-10 12:17:10,706 - semantic_segmentation - Initializing from scratch.
INFO - 2023-02-10 12:17:10,707 - semantic_segmentation - Writing summary in train_log/00020_KPFCNN_S3DIS_torch.
INFO - 2023-02-10 12:17:10,707 - semantic_segmentation - Started training
INFO - 2023-02-10 12:17:10,707 - semantic_segmentation - === EPOCH 0/10 ===
training:   0%|                                                                                                                                                                                                            | 0/63 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/xx/repos/kpconv-seg/scripts/run_pipeline.py", line 245, in <module>
    sys.exit(main())
  File "/home/xx/repos/kpconv-seg/scripts/run_pipeline.py", line 179, in main
    pipeline.run_train()
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py", line 411, in run_train
    loss, gt_labels, predict_scores = model.get_loss(
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/open3d/_ml3d/torch/models/kpconv.py", line 337, in get_loss
    self.output_loss = Loss.weighted_CrossEntropyLoss(scores, labels)
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1164, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/xx/VENV/o3dml-torch117/lib/python3.10/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: weight tensor should be defined either for all 13 classes or no classes but got weight tensor of shape: [1, 13]

Expected behavior

Run the trainnig on S3DIS with no error.

Open3D, Python and System information

- Operating system: Ubuntu 22.04
- Python version: 3.10.6 [GCC 11.3.0] (output from `import sys  print(sys.version)`)
- Open3D version: 0.16.0 (output from python: `print(open3d.__version__)`)
- System type: x84
- Is this remote workstation?: no
- How did you install Open3D?: pip

Additional information

shayan-nikoo commented 1 year ago

I tested this in a CPU-only docker environment but the error persists.

shayan-nikoo commented 1 year ago

I could skip the error and run the training by passing the class_weights: [] as empty in the config file. However, the result of training and test don't look right to me, the mIoU is very low. The test result is: Overall Testing Accuracy : 0.108, mIoU : 0.058 is mIoU almost zero or am I missing something?

Training results s3dis

INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - Loss train: 1.693  eval: 1.692
INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - Mean acc train: 0.174  eval: 0.267
INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - Mean IoU train: 0.087  eval: 0.148
INFO - 2023-02-10 17:32:09,648 - semantic_segmentation - === EPOCH 798/800 ===
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [00:01<00:00, 42.48it/s]
validation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 79.69it/s]
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - Loss train: 1.635  eval: 1.396
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - Mean acc train: 0.151  eval: 0.352
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - Mean IoU train: 0.081  eval: 0.182
INFO - 2023-02-10 17:32:11,208 - semantic_segmentation - === EPOCH 799/800 ===
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [00:01<00:00, 42.03it/s]
validation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 84.19it/s]
INFO - 2023-02-10 17:32:12,779 - semantic_segmentation - Loss train: 1.659  eval: 1.652
INFO - 2023-02-10 17:32:12,779 - semantic_segmentation - Mean acc train: 0.163  eval: 0.267
INFO - 2023-02-10 17:32:12,779 - semantic_segmentation - Mean IoU train: 0.086  eval: 0.109
INFO - 2023-02-10 17:32:12,780 - semantic_segmentation - === EPOCH 800/800 ===
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [00:01<00:00, 39.32it/s]
validation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 75.36it/s]
INFO - 2023-02-10 17:32:14,463 - semantic_segmentation - Loss train: 1.726  eval: 2.458
INFO - 2023-02-10 17:32:14,463 - semantic_segmentation - Mean acc train: 0.169  eval: 0.226
INFO - 2023-02-10 17:32:14,463 - semantic_segmentation - Mean IoU train: 0.086  eval: 0.072
INFO - 2023-02-10 17:32:14,552 - semantic_segmentation - Epoch 800: save ckpt to ./logs/KPFCNN_S3DIS_torch/checkpoint

Test

NFO - 2023-02-10 18:15:06,649 - semantic_segmentation - Accuracy : [0.21310890361166648, 0.3248931736760441, 0.5463872361721226, 0.0, 0.0, 0.0, 0.0, 0.0011374866459067771, 0.0, 0.0, 0.0, 0.0, 0.3207777518363869, 0.10817727322631744]04.97it/s]
INFO - 2023-02-10 18:15:06,649 - semantic_segmentation - IoU : [0.1751161378342312, 0.23549265987632134, 0.25156889849491443, 0.0, 0.0, 0.0, 0.0, 0.0010764818849628398, 0.0, 0.0, 0.0, 0.0, 0.09091394402481037, 0.05801293247040308]
INFO - 2023-02-10 18:15:06,658 - s3dis - Saved Area_3_office_6 in ./test/S3DIS/Area_3_office_6.npy.
test 21/23: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53173/53173 [00:01<00:00, 44756.97it/s]
test 22/23: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13400/13400 [00:00<00:00, 119477.80it/s]INFO - 2023-02-10 18:15:06,953 - semantic_segmentation - Accuracy : [0.21271639007411344, 0.3281782025759704, 0.5457846251452136, 0.0, 0.0, 0.0, 0.0, 0.0011374866459067771, 0.0, 0.0, 0.0, 0.0, 0.32096158074793024, 0.1083675603991642]
INFO - 2023-02-10 18:15:06,953 - semantic_segmentation - IoU : [0.1746411967770861, 0.23722240797212982, 0.2534656981039331, 0.0, 0.0, 0.0, 0.0, 0.0010754038786806772, 0.0, 0.0, 0.0, 0.0, 0.09036740278810888, 0.05821323919384143]
INFO - 2023-02-10 18:15:06,953 - s3dis - Saved Area_3_hallway_6 in ./test/S3DIS/Area_3_hallway_6.npy.
INFO - 2023-02-10 18:15:06,954 - semantic_segmentation - Overall Testing Accuracy : 0.1083675603991642, mIoU : 0.05821323919384143
INFO - 2023-02-10 18:15:06,954 - semantic_segmentation - Finished testing

I trained the model on S3DIS/Stanford3dDataset_v1.2 not the aligned version. Max_epochs is 800. Does anyone else also get this poor results?

srzxDragon commented 1 year ago

I have the same problem with you and I remedy this issu by modifying the SemSegLoss in semseg_loss.py (from line 40). I don't know why the previous code will return a 2-d array by 'DataProcessing.get_class_weights(dataset.cfg.class_weights)'. Actually dataset.cfg.class_weights is a 1-d array and we just need it. You can have a try by modifying the site-package 'site-packages/open3d/_ml3d/torch/modules/semseg_loss.py'.

class SemSegLoss(object):
    """Loss functions for semantic segmentation."""

    def __init__(self, pipeline, model, dataset, device):
        super(SemSegLoss, self).__init__()
        # weighted_CrossEntropyLoss
        if 'class_weights' in dataset.cfg.keys() and len(
                dataset.cfg.class_weights) != 0:
            #class_wt = DataProcessing.get_class_weights(
            #  dataset.cfg.class_weights)

            class_wt = dataset.cfg.class_weights
            weights = torch.tensor(class_wt, dtype=torch.float, device=device)

            self.weighted_CrossEntropyLoss = nn.CrossEntropyLoss(weight=weights)
        else:
            self.weighted_CrossEntropyLoss = nn.CrossEntropyLoss()
shayan-nikoo commented 1 year ago

Thank you @srzxDragon. This fix worked. Now the code runs without passing empty class_weights. But the results are still very poor. mIoU: 0.10

INFO - 2023-02-22 11:14:22,952 - semantic_segmentation - Loss train: 1.231  eval: 1.247
INFO - 2023-02-22 11:14:22,952 - semantic_segmentation - Mean acc train: 0.175  eval: 0.237
INFO - 2023-02-22 11:14:22,953 - semantic_segmentation - Mean IoU train: 0.101  eval: 0.123
INFO - 2023-02-22 11:14:23,090 - semantic_segmentation - Epoch 800: save ckpt to ./logs/KPFCNN_S3DIS_torch/checkpoint

Is your training results with kpconv also like this?

srzxDragon commented 1 year ago

I didn't train kpconvon S3DIS while I trained randlanet on semantic3d. But I found the similiar issue with you that the performance is not good as the official results. My results is as follows, but the official results get 76.0 on mIoU. I have no idea about this issue. image

Marine98k commented 1 year ago

hi i meet the same problem, but beside this there is another warning like follows

UserWarning: An output with one or more elements was resized since it had shape [180224], which does not match the required output shape [4, 45056]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:17.) return torch.stack(batch, 0, out=out) do you have met the problem

shayan-nikoo commented 1 year ago

I don't remember exactly, but I don't think I encountered this warning. You can see my error log in the post.

weypro commented 1 year ago

Let's see the get_class_weights function:

@staticmethod
    def get_class_weights(num_per_class):
        # pre-calculate the number of points in each category
        num_per_class = np.array(num_per_class, dtype=np.float32)

        weight = num_per_class / float(sum(num_per_class))
        ce_label_weight = 1 / (weight + 0.02)

        return np.expand_dims(ce_label_weight, axis=0)

So is there anyone know why np.expand_dims is needed?

I changed it to the following:

return ce_label_weight

It works.

Marine98k commented 1 year ago

I don't remember exactly, but I don't think I encountered this warning. You can see my error log in the post.

thank you for your reply

minhtcai commented 1 year ago

Did you guy find the solution for the low IoU? @shayan-nikoo @srzxDragon

shayan-nikoo commented 1 year ago

Unfortunately not. I think it is an implementation issue by Open3d-ml because I am using everything as their settings.