problems about training on custom dataset

predictwise commented 4 years ago

Hello,

I am trying to retrain a model from scratch without RGB values using a custom dataset. My dataset has 9 different classes and annotates the first class (let's say powerline) by 0. Additionally, I don't have any unlabelled data. According to your readme and some previous issues, I have modified the corresponding code:

line 44 in partition.py -> n_labels = 9
line 62 custom_dataset.py -> 'classes': 9, and line 63 -> 1: 'powerline', ........, 9: 'tree'
in read_custom_format I increased all labels by 1
line 214 in spg.py -> 'xyz' for P[:, :3], 'e' for P[:, 3, None], 'lpsv' for P[:, 4:8]. So in totoal 8 channels

After that, both partition/partition.py and learning/custom_dataset.py are ran successfully. However, it fails when I want to train the model (only one training file and one testing file):

CUDA_VISIBLE_DEVICES=0 python3 learning/main.py --dataset custom_dataset --CUSTOM_SET_PATH /home/zcq/oliverData/ISPRS/3DLabeling --epochs 500 --lr_steps '[350, 400, 450]' --test_nth_epoch 100 --model_config 'gru_10,f_9' --ptn_nfeat_stn 8 --nworkers 2 --pc_attribs xyzelpsv --odir "results/ISPRS/train_best" --batch_size 1

Here is the error I got:

Traceback (most recent call last):
  File "learning/main.py", line 477, in <module>
    main()
  File "learning/main.py", line 370, in main
    acc_test, loss_test, oacc_test, avg_iou_test, avg_acc_test = eval(False)
  File "learning/main.py", line 281, in eval
    confusion_matrix.count_predicted_batch(tvec_cpu, np.argmax(o_cpu,1))
  File "/home/zcq/oliverProjects/superpoint_graph/learning/../learning/metrics.py", line 22, in count_predicted_batch
    self.confusion_matrix[:,predicted[i]] += ground_truth_vec[i,:]
ValueError: operands could not be broadcast together with shapes (9,) (8,) (9,)

In the train() function, it works. But in the eval() function, it fails. Therefore, I print the shape of variable _label_veccpu , and get Nx8. The shape of _label_veccpu must be wrong and it should be Nx9 instead. It seems that the _superpointgraphs test file has not been generated correctly, but I dont know how to fix it.

Can you help me with it? Thanks a lot! @loicland

loicland commented 4 years ago

Hi,

the problem is probably linked to the use of 0 for unlabeleld data.

Check out this issue and let me know if that doesn't clear up your confusion.

predictwise commented 4 years ago

Hello @loicland ,

That issue can't work for me. I don't think the problem is linked to the use of 0 for unlabelled data.

A few questions:

The labels field in the features folder has 9+1 columns, is it right?

Here is the model if it can help you:

Module(
  (ecc): GraphNetwork(
    (0): RNNGraphConvModule(
      (_cell): GRUCellEx(
        32, 32
        (ini): InstanceNorm1d(1, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
        (inh): InstanceNorm1d(1, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
        (ig): Linear(in_features=32, out_features=32, bias=True)
      )(ingate layernorm)
      (_fnet): Sequential(
        (0): Linear(in_features=13, out_features=32, bias=True)
        (1): ReLU(inplace)
        (2): Linear(in_features=32, out_features=128, bias=True)
        (3): ReLU(inplace)
        (4): Linear(in_features=128, out_features=64, bias=True)
        (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (6): ReLU(inplace)
        (7): Linear(in_features=64, out_features=32, bias=False)
      )
    )
    (1): Linear(in_features=352, out_features=9, bias=True)
  )
  (ptn): PointNet(
    (stn): STNkD(
      (convs): Sequential(
        (0): Conv1d(8, 64, kernel_size=(1,), stride=(1,))
        (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
        (3): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
        (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): ReLU(inplace)
        (6): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
        (7): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (8): ReLU(inplace)
      )
      (fcs): Sequential(
        (0): Linear(in_features=128, out_features=128, bias=True)
        (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
        (3): Linear(in_features=128, out_features=64, bias=True)
        (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): ReLU(inplace)
      )
      (proj): Linear(in_features=64, out_features=4, bias=True)
    )
    (convs): Sequential(
      (0): Conv1d(8, 64, kernel_size=(1,), stride=(1,))
      (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace)
      (3): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
      (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU(inplace)
      (6): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
      (7): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU(inplace)
      (9): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
      (10): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU(inplace)
      (12): Conv1d(128, 256, kernel_size=(1,), stride=(1,))
      (13): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (14): ReLU(inplace)
    )
    (fcs): Sequential(
      (0): Linear(in_features=257, out_features=256, bias=True)
      (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace)
      (3): Linear(in_features=256, out_features=64, bias=True)
      (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU(inplace)
      (6): Linear(in_features=64, out_features=32, bias=True)
    )
  )
)

predictwise commented 4 years ago

Hello,

I find that when the range of sp_point_count is 0, the shape of node_gt_size is fixed (i.e. Nx9) in line 73 spg.py, which will also make variable targets have fixed size (i.e. Nx10) in line 105 spg.py.

line 73 in spg.py:
# Nx9
node_gt_size = np.concatenate([f['sp_point_count'][:].astype(np.int64), np.zeros((N,8), dtype=np.int64)], 1)
node_gt = np.zeros((N,1), dtype=np.int64)  # Nx1

line 105 in spg.py:
targets = np.concatenate([node_gt, node_gt_size], axis=1)  # Nx10

In addition, label_vec_cpu = targets[:, 2:] as you explained before, the label_vec_cpu is the breakdown of the classes of these points between the labels (vector of size 9 in my case). However, I can only get size 8. Is it a bug here? And should I adapt the 8 to _nlabels (i.e. 9) in line 73 spg.py ?

loicland commented 4 years ago

So it works correctly in training but not in eval? Check the shape and minimum/maximum value of o_cpu, t_cpu, tvec_cpu line 250 of main.py and let me.knownof the results before the error.

Check in the parsed files that the ones in train and eval have the same shape.

predictwise commented 4 years ago

So it works correctly in training but not in eval?

Yes, it works in training but not in eval

Check the shape and minimum/maximum value of o_cpu, t_cpu, tvec_cpu line 250 of main.py and let me.knownof the results before the error.

o_cpu : Nx9
t_cpu : 1xN
tvec_cpu : Nx8

More detail information about my dataset: my dataset is ISPRS Vaihingen 3D semantic labeling data, and the point density of it is approximately 8~10 points/m2. Training set and testing set have 750000+ and 110000 points, respectively. It is a small dataset and the point density is not very high. Therefore, I set --voxel_width 0.3 and --reg_strength 0.2 in partition phase.

EDIT: due to the large xyz value, I do the shift so that the xyz values are centered around minimum coordinate (i.e. (x_min, y_min, z_min)). The preprocessed training data is shown below, they are x, y, z, intensity, label. QQ图片20191101232759

loicland commented 4 years ago

Can you post the range of t_cpu? Does it go to 9? Either way, t_vec_cpu should be of size Nx9.

Checkout the files in /superpoint_graphs for both training and test, their sp_labels field must be of the same size (Nx10).

predictwise commented 4 years ago

Checkout the files in /superpoint_graphs for both training and test, their sp_labels field must be of the same size (Nx10)

My training dataset has labels but test dataset does not. Hence, the shape of sp_labels field in training dataset is (Nx10), and the shape of sp_labels field in test dataset is 0.

Either way, t_vec_cpu should be of size Nx9.

No, in my case it is Nx8. Thus, I debug the code and find that there is probably a bug in the function spg_reader of spg.py:

    if f['sp_labels'].size > 0:
        # column 0: no of unlabeled points, column 1+: no of labeled points per class
        node_gt_size = f['sp_labels'][:].astype(np.int64)
        node_gt = np.argmax(node_gt_size[:,1:], 1)[:,None]
        # superpoints without labels are to be ignored in loss computation
        node_gt[node_gt_size[:,1:].sum(1)==0,:] = -100
    else:
        N = f['sp_point_count'].shape[0]
        node_gt_size = np.concatenate([f['sp_point_count'][:].astype(np.int64), np.zeros((N,8), dtype=np.int64)], 1)
        node_gt = np.zeros((N,1), dtype=np.int64)

when f['sp_labels'].size == 0 your code concatenates the f['sp_point_count'][:] and np.zeros((N, 8), which will produce (Nx9). However, I think it should concatenate the f['sp_point_count'][:] and np.zeros((N, n_labels). In my case the n_labels is 9. Am i right?

loicland commented 4 years ago

Ah! You're correct. I think I let this slip because I only used semantic3d as a gt-less dataset. I will correct it Monday.

Alternatively, you can just not concatenate the matrix of zeros and add a check in eval such that the metrics are not evaluated if there is no ground truth to compare it with. The results would be meaningless anyway.

predictwise commented 4 years ago

Hello,

Thank you for your kind help! I have another question: due to my original dataset has absolute xyz coordinates, thus they are very large (e.g 496848.93000031 5419405.36000013 265.33999634). Therefore, I do the shift so that the xyz values are centered around minimum coordinate (i.e. (x_min, y_min, z_min)). The preprocessed training data is shown below, they are x, y, z, intensity, label. QQ图片20191101232759

In addition, since my dataset is small (training set and testing set have 750000+ and 110000 points, respectively) and the point density is not very high (approximately 8~10 points/m^2), I set --voxel_width 0.5 and --reg_strength 0.2.

python3 partition/partition.py --dataset custom_dataset --ROOT_PATH /home/zcq/oliverData/ISPRS/3DLabeling --voxel_width 0.5 --reg_strength 0.2

However, it can't produce good results in partition phase. Can you provide any suggestions for me?

loicland commented 4 years ago

Hi,

can you please show an example of failed partition in image form, as wella s the geomtric features?

This corresponds to option 'pf' in visualize.py

loicland / superpoint_graph

problems about training on custom dataset #166