Float division by zero in main/learning.py using custom_dataset

mirceta commented 5 years ago

Hello,

I have a custom dataset with no rgbs and only 2 classes. In the train() function I get a float division by zero error during confusion_matrix.get_average_intersection_union().

I found also that the loop above this mentioned call had 0 iterations. The partition is successful, but also after calling

xyz, rgb, labels = libply_c.prune(xyz, args.voxel_width, rgb, labels, n_labels)

labels become vectors of 3 components, and have a larger range of values (up to 76), whereas they were before just 0 or 1. Is this correct?

here are the arguments to all the scripts I call (partition/partition.py, learning/custom_dataset.py, main/learning.py in this order)

PARTITION/PARTITION.PY args

--dataset custom_dataset --ROOT_PATH /media/km/ad02048a-21c3-4454-b1b4-58c5a99df3c5/workspace --voxel_width 5 --reg_strength 0.8 --ver_batch 500

LEARNING/CUSTOM_DATASET.PY args

--CUSTOM_SET_PATH /media/km/ad02048a-21c3-4454-b1b4-58c5a99df3c5/workspace

LEARNING/MAIN.PY args

--dataset custom_dataset --CUSTOM_SET_PATH /media/km/ad02048a-21c3-4454-b1b4-58c5a99df3c5/workspace --epochs 10 --lr_steps '[275,320]' --test_nth_epoch 2 --model_config gru_10,f_2 --ptn_nfeat_stn 11 --nworkers 2 --pc_attribs xyzelpsv --odir "results"

/home/km/anaconda3/envs/newenv/bin/python /home/km/superpoint_graph/learning/main.py --dataset custom_dataset --CUSTOM_SET_PATH /media/km/ad02048a-21c3-4454-b1b4-58c5a99df3c5/workspace --epochs 10 --lr_steps '[275,320]' --test_nth_epoch 2 --model_config gru_10,f_2 --ptn_nfeat_stn 11 --nworkers 2 --pc_attribs xyzelpsv --odir results
Will save to results
/home/km/superpoint_graph/learning/graphnet.py:28: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_.
  if orthoinit: init.orthogonal(fnet_modules[-1].weight, gain=init.calculate_gain('relu'))
/home/km/superpoint_graph/learning/graphnet.py:32: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_.
  if orthoinit: init.orthogonal(fnet_modules[-1].weight)
/home/km/superpoint_graph/learning/pointnet.py:42: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.proj.weight, 0); nn.init.constant(self.proj.bias, 0)
Total number of parameters: 211462
Module(
  (ecc): GraphNetwork(
    (0): RNNGraphConvModule(
      (_cell): GRUCellEx(
        32, 32
        (ini): InstanceNorm1d(1, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
        (inh): InstanceNorm1d(1, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
        (ig): Linear(in_features=32, out_features=32, bias=True)
      )(ingate layernorm)
      (_fnet): Sequential(
        (0): Linear(in_features=13, out_features=32, bias=True)
        (1): ReLU(inplace)
        (2): Linear(in_features=32, out_features=128, bias=True)
        (3): ReLU(inplace)
        (4): Linear(in_features=128, out_features=64, bias=True)
        (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (6): ReLU(inplace)
        (7): Linear(in_features=64, out_features=32, bias=False)
      )
    )
    (1): Linear(in_features=352, out_features=2, bias=True)
  )
  (ptn): PointNet(
    (stn): STNkD(
      (convs): Sequential(
        (0): Conv1d(11, 64, kernel_size=(1,), stride=(1,))
        (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
        (3): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
        (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): ReLU(inplace)
        (6): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
        (7): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (8): ReLU(inplace)
      )
      (fcs): Sequential(
        (0): Linear(in_features=128, out_features=128, bias=True)
        (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
        (3): Linear(in_features=128, out_features=64, bias=True)
        (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): ReLU(inplace)
      )
      (proj): Linear(in_features=64, out_features=4, bias=True)
    )
    (convs): Sequential(
      (0): Conv1d(8, 64, kernel_size=(1,), stride=(1,))
      (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace)
      (3): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
      (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU(inplace)
      (6): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
      (7): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU(inplace)
      (9): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
      (10): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU(inplace)
      (12): Conv1d(128, 256, kernel_size=(1,), stride=(1,))
      (13): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (14): ReLU(inplace)
    )
    (fcs): Sequential(
      (0): Linear(in_features=257, out_features=256, bias=True)
      (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace)
      (3): Linear(in_features=256, out_features=64, bias=True)
      (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU(inplace)
      (6): Linear(in_features=64, out_features=32, bias=True)
    )
  )
)
Epoch 0/10 (results):
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/km/superpoint_graph/learning/main.py", line 404, in <module>
    main()
  File "/home/km/superpoint_graph/learning/main.py", line 303, in main
    acc, loss, oacc, avg_iou = train()
  File "/home/km/superpoint_graph/learning/main.py", line 222, in train
    return acc_meter.value()[0], loss_meter.value()[0], confusion_matrix.get_overall_accuracy(), confusion_matrix.get_average_intersection_union()
  File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torchnet/meter/classerrormeter.py", line 54, in value
    return [self.value(k_) for k_ in self.topk]
  File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torchnet/meter/classerrormeter.py", line 54, in <listcomp>
    return [self.value(k_) for k_ in self.topk]
  File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torchnet/meter/classerrormeter.py", line 50, in value
    return (1. - float(self.sum[k]) / self.n) * 100.0
ZeroDivisionError: float division by zero

Process finished with exit code 1

You can also check out https://github.com/FloatingObjectSegmentation/superpoint_graph/tree/adapt-to-mag/learning to see how to code is changed.

loicland commented 5 years ago

Hi,

labels become vectors of 3 components, and have a larger range of values (up to 76), whereas they were before just 0 or 1. Is this correct?

No. The class 0 is reserved for unlabelled data. if all your data is labeleled, make the first class be 1 and the second one be 2. be careful, the predicted values will be shifted by 1 (since 'unlabelled' is never predicted). It can be confusing.

Then, when pruning, the labels associated with each voxel is the histograms of point labels they contain. So 3 columns (unlabeled/class 1 / class 2) and for each the number of points. If you have values up to 76, it means you are probably usubsampling a little bit aggressively, but it really depends on your sensor (for example, if the density is variable).

I think the class_meter is confused because it has clouds with only class 0 and thinks none are annotated. See if the aboves fixes it.

mirceta commented 5 years ago

Oh I understand, I didn't know they were histograms. I pruned a lot because I wanted it to finish quick until I can get it working. After that I'll prune less.

These are the changes I made:

custom_dataset.py -> line 63 -> 1: 'class A', 2: 'class B'. basically just shifted the code of the labels by +1.
in read_custom_format I increased all labels by 1.

I also tried:

changing custom_dataset.py get_info function return 'classes': 3
changing the model_config arg to f_3 in learning/main.py

What confuses me is should I then, knowing that there is as label reserved for unlabeled - treat it as if it has 3 classes or 2? Though in this case neither of the attempts worked.

I also tried decreasing arg voxel_width in partition.py to 1, so I would not prune so much.

Keep getting the same error. Do you have any other ideas?

loicland commented 5 years ago

should be 2 classes in get_info, and f_2.

before calling loss_meter.value() print the following:

print(loss_meter.n)
print(loss_meter.sum[0])

it seems like n will be zero but I am curious about sum.

More generally, print your prediction o_cpu and ground truth t_cpu at each iteration (note that they will be already shifted back to start at 0, with -100 for superpoints containing no ground truth points at all).

mirceta commented 5 years ago

Correct, loss_meter.n is 0. I cannot print the predictions because the loop under #iterate over dataset in batches, is never entered. I think the loader was not loaded correctly. Perhaps something gets mixed up in line 165: trainset_dataset, test_dataset = create_dataset(args)

Though still it gets loaded, and the path is correct. It also reads the superpoint_graph files which are auto-generated so I don't know where's a chance for a mistake.

mirceta commented 5 years ago

Found another curious thing -> @line 176 learning/main.py

logging.getLogger().getEffectiveLevel() > logging.DEBUG

is true, and then

loader = tqdm(loader,ncols=100)

gets executed. Is this correct?

Edit: I tried deleting the lines under the if (so if logging... and loader = tqdm...) and the result was still the same: zero division error.

mirceta commented 5 years ago

I compared this to the execution of semantic3D and saw that only the number of files in the training set was different. So I add twice to trainlist in custom_dataset.py/get_datasets() and now it goes through. It seems that it was not working because all of my points were in a single file.

Though I get a new error in train(),

at embeddings = ptcCloudEmbedder.run(model, *clouds_data)

gives me RuntimeError: Given groups=1, weight of size 64 11 1, expected input [835, 5, 128] to have 11 channels, but got 5 channels instead.

loicland commented 5 years ago

Hi,

if you only have one file for training the problem might be with the batch size. Try to reduce the batch size to 1.

I assume your data does not have rgb? If so you need to adapt --ptn_attribs and --ptn_nfeat_stn.

mirceta commented 5 years ago

Yes, the data does not have rgb. But I already set rgb to empty list in line 158 partition.py as instructed, and set in partition --ver_batch = 1 and in main.py --ptn_attribs = xyzelpsv --ptn_nfeat_stn = 8

Still get RuntimeError: Given groups=1, weight of size 64 8 1, expected input [96,5,128] to have 8 channels, but got 5 channels instead.

After this, I set ptn_nfeat_stn to 5, and got Traceback (most recent call last): File "/home/km/superpoint_graph/learning/main.py", line 405, in <module> main() File "/home/km/superpoint_graph/learning/main.py", line 304, in main acc, loss, oacc, avg_iou = train() File "/home/km/superpoint_graph/learning/main.py", line 200, in train embeddings = ptnCloudEmbedder.run(model, *clouds_data) File "/home/km/superpoint_graph/learning/pointnet.py", line 131, in run_full_monger out = model.ptn(Variable(clouds, volatile=True), Variable(clouds_global, volatile=True)) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/superpoint_graph/learning/pointnet.py", line 90, in forward T = self.stn(input[:,:self.nfeat_stn,:]) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/superpoint_graph/learning/pointnet.py", line 47, in forward input = self.convs(input) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 196, in forward self.padding, self.dilation, self.groups) RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

Also if it helps you, here is the output from partition.py:

=================
   train/
=================
1 / 2---> data
    creating the feature file...
=========================
======== pruning ========
=========================
Voxelization into 133670 x 33680 x 206 grid
Reduced from 22713486 to 429575 points (1.89%)
(429575, 3)
[[4.49018625e+05 1.21040445e+05 4.05682587e+02]
 [4.49016094e+05 1.21042156e+05 4.19194794e+02]
 [4.49019000e+05 1.21042766e+05 4.05488037e+02]
 [4.49015844e+05 1.21042203e+05 4.05744904e+02]
 [4.49015812e+05 1.21045570e+05 4.05728302e+02]]
3.0
10
45
93% done          
    computing the superpoint graph...
        minimal partition...
L0-CUT PURSUIT WITH L2 FIDELITY
PARAMETERIZATION = SPECIAL SUPERPOINTGRAPH
Graph 429577 vertices and 10309800 edges and observation of dimension 4
        computation of the SPG...
Timer :   5.0 /  21.0 /  20.2 
2 / 2---> data2
    creating the feature file...
=========================
======== pruning ========
=========================
Voxelization into 133670 x 33680 x 206 grid
Reduced from 22713486 to 429575 points (1.89%)
(429575, 3)
[[4.49018625e+05 1.21040445e+05 4.05682587e+02]
 [4.49016094e+05 1.21042156e+05 4.19194794e+02]
 [4.49019000e+05 1.21042766e+05 4.05488037e+02]
 [4.49015844e+05 1.21042203e+05 4.05744904e+02]
 [4.49015812e+05 1.21045570e+05 4.05728302e+02]]
3.0
10
45
95% done          
    computing the superpoint graph...
        minimal partition...
L0-CUT PURSUIT WITH L2 FIDELITY
PARAMETERIZATION = SPECIAL SUPERPOINTGRAPH
Graph 429577 vertices and 10309800 edges and observation of dimension 4
        computation of the SPG...
Timer :   9.9 /  41.8 /  40.4 
=================
   test/
=================
1 / 2---> data
    creating the feature file...
=========================
======== pruning ========
=========================
Voxelization into 133670 x 33680 x 206 grid
Reduced from 22713486 to 429575 points (1.89%)
(429575, 3)
[[4.49018625e+05 1.21040445e+05 4.05682587e+02]
 [4.49016094e+05 1.21042156e+05 4.19194794e+02]
 [4.49019000e+05 1.21042766e+05 4.05488037e+02]
 [4.49015844e+05 1.21042203e+05 4.05744904e+02]
 [4.49015812e+05 1.21045570e+05 4.05728302e+02]]
3.0
10
45
95% done          
    computing the superpoint graph...
        minimal partition...
L0-CUT PURSUIT WITH L2 FIDELITY
PARAMETERIZATION = SPECIAL SUPERPOINTGRAPH
Graph 429577 vertices and 10309800 edges and observation of dimension 4
        computation of the SPG...
Timer :  14.8 /  62.7 /  60.5 
2 / 2---> data2
    creating the feature file...
=========================
======== pruning ========
=========================
Voxelization into 133670 x 33680 x 206 grid
Reduced from 22713486 to 429575 points (1.89%)
(429575, 3)
[[4.49018625e+05 1.21040445e+05 4.05682587e+02]
 [4.49016094e+05 1.21042156e+05 4.19194794e+02]
 [4.49019000e+05 1.21042766e+05 4.05488037e+02]
 [4.49015844e+05 1.21042203e+05 4.05744904e+02]
 [4.49015812e+05 1.21045570e+05 4.05728302e+02]]
3.0
10
45
95% done          
    computing the superpoint graph...
        minimal partition...
L0-CUT PURSUIT WITH L2 FIDELITY
PARAMETERIZATION = SPECIAL SUPERPOINTGRAPH
Graph 429577 vertices and 10309800 edges and observation of dimension 4
        computation of the SPG...
Timer :  19.6 /  83.4 /  80.3 

Process finished with exit code 0

loicland commented 5 years ago

1) if you set the batch size to 1 (--batch_size 1 in learning/main.py) you don't need to double your dataset. 2) can you print your model? 3) what branch/commit are you running?

mirceta commented 5 years ago

tried that too, got the same error
I'll post it shortly
it's commit dfb5b05 typo

mirceta commented 5 years ago

Here is the model:

Module(
  (ecc): GraphNetwork(
    (0): RNNGraphConvModule(
      (_cell): GRUCellEx(
        32, 32
        (ini): InstanceNorm1d(1, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
        (inh): InstanceNorm1d(1, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
        (ig): Linear(in_features=32, out_features=32, bias=True)
      )(ingate layernorm)
      (_fnet): Sequential(
        (0): Linear(in_features=13, out_features=32, bias=True)
        (1): ReLU(inplace)
        (2): Linear(in_features=32, out_features=128, bias=True)
        (3): ReLU(inplace)
        (4): Linear(in_features=128, out_features=64, bias=True)
        (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (6): ReLU(inplace)
        (7): Linear(in_features=64, out_features=32, bias=False)
      )
    )
    (1): Linear(in_features=352, out_features=2, bias=True)
  )
  (ptn): PointNet(
    (stn): STNkD(
      (convs): Sequential(
        (0): Conv1d(5, 64, kernel_size=(1,), stride=(1,))
        (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
        (3): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
        (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): ReLU(inplace)
        (6): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
        (7): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (8): ReLU(inplace)
      )
      (fcs): Sequential(
        (0): Linear(in_features=128, out_features=128, bias=True)
        (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
        (3): Linear(in_features=128, out_features=64, bias=True)
        (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): ReLU(inplace)
      )
      (proj): Linear(in_features=64, out_features=4, bias=True)
    )
    (convs): Sequential(
      (0): Conv1d(8, 64, kernel_size=(1,), stride=(1,))
      (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace)
      (3): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
      (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU(inplace)
      (6): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
      (7): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU(inplace)
      (9): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
      (10): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU(inplace)
      (12): Conv1d(128, 256, kernel_size=(1,), stride=(1,))
      (13): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (14): ReLU(inplace)
    )
    (fcs): Sequential(
      (0): Linear(in_features=257, out_features=256, bias=True)
      (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace)
      (3): Linear(in_features=256, out_features=64, bias=True)
      (4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU(inplace)
      (6): Linear(in_features=64, out_features=32, bias=True)
    )
  )
)

loicland commented 5 years ago

1 - it won't solve the cuda error, but will solve the training loop not being executed 3 - this commit is obsolete. if you want to stay on this commit, at least revert it locally by changing the track_running_stats to False again.

Your input type is not on the GPU (torch.FloatTensor and not torch.cuda.FloatTensor). Line 127 should convert them to cuda tensors. Are you running --cuda 0?

mirceta commented 5 years ago

Good point. I was running from within pycharm and didn't prepend CUDA_VISIBLE_DEVICES=0. I have set track_running_stats to False and added --cuda 0 (This is the same as prepending CUDA_VISIBLE_DEVICES=0 right?). The error message changes now, but it's again

Runtime error: Given groups=1, weight of size 64 8 1, expected input [48, 5, 128] to have 8 channels, but got 5 channels instead.

At the same line. Here are the args again:

--dataset custom_dataset --CUSTOM_SET_PATH /media/km/ad02048a-21c3-4454-b1b4-58c5a99df3c5/workspace --epochs 10 --lr_steps '[275,320]' --test_nth_epoch 2 --model_config gru_10,f_2 --nworkers 2 --pc_attribs xyzelpsv --odir "results" --ptn_nfeat_stn 5 --batch_size 1 --cuda 0

Also tried changing ptn_nfeat_stn with no luck.

Also is the 2nd dimension in the input matrix the features? Seems weird that there are 5? Though the input is the superpoint graph so it must be transformed?

Tried running from terminal with CUDA_VISIBLE_DEVICES=0 too, with the same result.

Edit3: Another really curious thing: in conv.py the forward() method, where the error occurs, gets executed 3 times before the error - possible the error is somewhere in the middle of the network.

loicland commented 5 years ago

Do you have a GPU? If so you should run --cuda 1.

At which line does the error occur?

print the size of P at the end of the function load_superpoint in learning/spg.py, it seems that your point clouds laods with the wrong number of columns for some reason.

mirceta commented 5 years ago

Yes exactly, just found it! in spg.py, I forgot to pad the indices of e and lpsv by 3 to the left, because there are no rgb values. It's working now!

And yes, I have a GPU, but when I set --cuda 1 it doesn't work again. It's

untimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

Traceback (most recent call last): File "/home/km/superpoint_graph/learning/main.py", line 405, in <module> main() File "/home/km/superpoint_graph/learning/main.py", line 304, in main acc, loss, oacc, avg_iou = train() File "/home/km/superpoint_graph/learning/main.py", line 200, in train embeddings = ptnCloudEmbedder.run(model, *clouds_data) File "/home/km/superpoint_graph/learning/pointnet.py", line 131, in run_full_monger out = model.ptn(Variable(clouds, volatile=True), Variable(clouds_global, volatile=True)) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/superpoint_graph/learning/pointnet.py", line 90, in forward T = self.stn(input[:,:self.nfeat_stn,:]) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/superpoint_graph/learning/pointnet.py", line 47, in forward input = self.convs(input) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/km/anaconda3/envs/newenv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 196, in forward self.padding, self.dilation, self.groups) RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

loicland commented 5 years ago

add line 130 of pointnet.py:

print(clouds.device)
print(clouds_global.device)

Maybe try to remove the Variable since its obsolete now? What version of pytorch are you using?

mirceta commented 5 years ago

Both if I put --cuda 1 or --cuda 0, the prints say device is cpu. Torch version is 1.1.0 . Which Variable do you mean?

loicland commented 5 years ago

Which Variable do you mean?

line 131 of pointnet.py

Both if I put --cuda 1 or --cuda 0, the prints say device is cpu.

Weird. Check wether the if clause line 127 of pointnet.py is entered with --cuda 1. Either by running it in debug mode or using a print inside the clause, and printing self.args.cuda just before line 127.

mirceta commented 5 years ago

Wow, weird, you are right. Even though I put in --cuda 1, self.args.cuda will be 0. Also when I start main.py args.cuda = 1, and even when CloudEmbedder is instantiated, self.args.cuda within CloudEmbedder will be 1, but when run_full_monger method is run, it becomes 0.

Edit: Sorry, it was a line at the start of the train() loop, where I manually set it to 0 to avoid a previous error.

mirceta commented 5 years ago

I get the same error as this issue now: https://github.com/loicland/superpoint_graph/issues/98 . Will let you know what happens after I fix it.

mirceta commented 5 years ago

Hey, after applying your fix for the above issue it now runs on the GPU as well. Thanks for all the help Loic!

loicland commented 5 years ago

Glad to hear it!

loicland / superpoint_graph

Float division by zero in main/learning.py using custom_dataset #143