Problems while running notebook examples

pvnieo commented 4 years ago

Hi,

Thank you for the great contribution and for sharing the code.

I have two questions!

The first one is how long does this network take for training, for example for classification or segmentation?

The second one, is that, while trying to run the provided notebooks for classification and segmentation, I run into errors while creating the dataloaders. In the classification case using shrec dataset, I got the following error:

/content/transforms/normalize_area.py in __call__(self, data)
     17         # Normalize by surface area
     18         pos_vh, face_vh = data.pos.cpu().numpy(), data.face.cpu().numpy().T
---> 19         area = 1 / np.sqrt(vh.surface_area(pos_vh, face_vh))
     20         data.pos = data.pos * area
     21 

RuntimeError: GC_SAFETY_ASSERT FAILURE from /tmp/pip-req-build-7f8xsuhg/deps/geometry-central/src/surface/halfedge_factories.cpp:18 - polygon list has index 5081 >= num vertices 252

and for the segmentation case, I got the following error:

/content/datasets/shape_seg.py in process(self)
    131             seg_path = osp.join(mit_seg, filename.replace('.obj', '.eseg'))
    132             segs = torch.from_numpy(np.loadtxt(seg_path)).long()
--> 133             data.y = edge_to_vertex_labels(data.face, segs, data.num_nodes)
    134             if self.pre_filter is not None and not self.pre_filter(data):
    135                 continue

/content/utils/harmonic.py in edge_to_vertex_labels(faces, labels, n_nodes)
    124     edge_index = torch.LongTensor(0, 2)
    125     for face in faces.transpose(0, 1):
--> 126         edges = torch.stack([face[:2], face[1:], face[::2]], dim=0)
    127         for idx, edge in enumerate(edges):
    128             edge = edge.sort().values

RuntimeError: stack expects each tensor to be equal size, but got [1] at entry 0 and [0] at entry 1

Can please tell how to solve this problem, or provide a collab link to run the notebooks directly?

Thank you in advance!

rubenwiersma commented 4 years ago

Hello and thanks for your interest! It seems like this is an issue with the obj loader/ply loader, related to this issue: https://github.com/rusty1s/pytorch_geometric/issues/1571. One thing that should solve this is to install a previous version of PyTorch Geometric (1.4.3 is the one I ran the experiments with). When you install PyTorch geometric with pip, just type ==1.4.3 the end of the command: pip install torch-geometric==1.4.3

With regards to training time: it depends on your system and task. On an RTX 2080 Ti it could take between 40 min and 4 hours. Especially the original correspondence task and the preprocessing stage for segmentation takes a long time. In case you want more speed, try replacing the complex_product operation with PyTorch’s new complex tensor operations: torch.view_as_complex(x); torch.view_as_complex(y); x * y. It’s not part of this repo, as it’s still an experimental feature and not part of the experiments we ran, but it could be beneficial if you’re wanting a bit more performance.

Let me know if you have any other questions!

pvnieo commented 4 years ago

Hi @rubenwiersma

Thanks for the quick replay. Concerning the notebook question, I updated the torch geometric version, but I'm still getting an error, this time, it's a vector_heat error:

/content/transforms/vector_heat.py in __call__(self, data)
     49         # We provide the degree of each vertex to the vector heat method,
     50         # so it can easily iterate over the edge tensor.
---> 51         vh_result = vh.precompute(pos, face, edge_index, deg, sample_idx)
     52         vh_weights = vh.weights(pos, face, sample_idx, np.arange(len(sample_idx)))
     53 

ValueError: Solver internals->factorization failed

You can find my notebook here: https://colab.research.google.com/drive/1pqey7-LWeAwoQFR3hkIEEbMEg98VcsW6?usp=sharing The only thing that I do manually, is move the hsn folder after cloning to the main folder, and download the dataset.

Concerning the timing question, I'm running this network for a shape matching experiment, and it's taking too long (4h for one epoch), when inspecting, I found that the forward pass takes in average 0.8571 s and the backward takes 1.1932s which is a total of more that 2 s per iteration. I'm using an RTX 2080. Is this the same as for you?

rubenwiersma commented 4 years ago

I think I came across this same problem. What solved it for me was to install suitesparse and build the vector heat plugin with suitesparse (it will automatically detect your installation).

Are you running the exact same setup for correspondences (FAUST dataset, same number of parameters, etc.) or a new experiment? I ran most experiments on a Tesla V100 (from Google cloud). I don’t remember doing experiments that ran longer than a night for the entire training process, so 4hrs per epoch definitely sounds like a long time.

pvnieo commented 4 years ago

Hi @rubenwiersma

Thank you for your response. I'm using the same parameters as the segmentation setting, except that I'm using 64 features, but even with 8 features ( as the segmentation setting), the timing doesn't change.

rubenwiersma commented 4 years ago

Hi @pvnieo, there are a couple of factors that have a major impact on runtime. You can try tweaking each to get a reasonable runtime:

Number of input points (the maximum number we tested with was 6890 for FAUST). The segmentation notebook also shows how you can train your network on a subsample of your points.
Number of neighbours per point: make sure that you tune the radius parameter (e.g. try 0.1, 0.2, instead of 0.2 and 0.4). If you have a lot of input points, this will especially impact performance.
Number of rings. For correspondence, we used 2 rings, instead of the 6 used for segmentation.
Machine you're running this on. The precomputation takes quite a bit of storage and needs to be moved to the GPU. This means your CPU also has some impact on training time. I don't expect this will be the game-changer (try the other things first), but it might help to try running this on a dedicated virtual machine somewhere.

If it isn't one of these things, I'm not sure what could be the problem. One thing that really helps is to run your code through a debugger (copy it to a .py file first) in, for example, visual studio code. It could help to check the number of points per neighbourhood (run torch_geometric.utils.degree(edge_index[0])), the size of the edge_index tensor, the size of the x tensor. Hope this helps and good luck!

pvnieo commented 4 years ago

Hi,

I tuned the radius parameter (0.1 and 0.2) and the model became fast and uses less memory. Thanks for the recommendations.

Closing the issue!

rubenwiersma / hsn

Problems while running notebook examples #1