Questions Regarding WaffleIron-6-64 and Additional Techniques

pierluigidovesi commented 9 months ago

Hey,

I've just read your paper and found it quite intriguing. I have a few questions and would appreciate your insights:

Inference Time, Embedded Devices, and Ablations

Have you measured the inference time for the WaffleIron-6-64 model, or any other smaller architectures you've experimented with?
Besides Table 6, do you have any ablation studies or numbers that show how changes in different components, such as changing the size of the voxel downsampling in preprocessing, affect the inference speed and metrics?
Have you tested the model on embedded devices, such as NVIDIA Jetson?
Are there any ablation studies on the inference speed when changing the Waffle Iron (WI) resolution? I'm interested in understanding which components are driving the speed and by how much.

Intermediate Output Heads and Residual Stacking

Have you considered adding intermediate output heads to speed up and stabilize training?
Have you explored stacking these heads residually? For instance, making the second head residual on the first, the third one on the second, and so on. This concept aligns with what we tried in this paper. We found that early losses - sometimes - stabilize and accelerate the training phase. Additionally, this setup enables anytime settings, allowing control over the speed/throughput tradeoff at inference time.

Benchmark Leaderboards

Have you submitted your results to the official leaderboards for Semantic KITTI and nuScenes?
Have you tested it in indoor scenarios, would you expect any required/recommended changes in the pipeline? (including augmentations and preproc)

Congrats again on the work and see you at ICCV!

gpuy commented 9 months ago

Thank you for your interest in the paper and your questions. I hope you will excuse me for the time it took me to get back to you.

I wanted to mention that we uploaded the latest version of the paper on the arxiv recently. Compared to the first version, there are no change in the method per se. The main changes are about the addition of data augmentations and regularizations techniques exploited in other works and which were integrated in WaffleIron to improve the performance.

Inference Time, Embedded Devices, and Ablations

These are good questions which are also of interest to us. Yet, in the first phase, it seemed more important to concentrate on improving the performance than the inference time. I thus don't have more measures to report than what we have in the paper. I can however mention that in the last version of the code, I integrated a function "model.compress()" which can be called before inference. It merges all batch norms, scaling layers with the preceding or following convolutional layer and give some speed up at inference.

Have you measured the inference time for the WaffleIron-6-64 model, or any other smaller architectures you've experimented with? Besides Table 6, do you have any ablation studies or numbers that show how changes in different components, such as changing the size of the voxel downsampling in preprocessing, affect the inference speed and metrics?

I have no rigorous numbers to report. Concerning the depth L, a linear increase of the inference time with L seems a reasonable approximation (we keep repeating the same operations at each layer).

Concerning voxel downsampling in the pre-processing, I tried smaller voxels (5 cm) on SemanticKITTI but did not notice an improvement on the mIoU on the val set while making the compute time longer. To optimize the compute time vs performance tradeoff, using cylindrical voxels for downsampling can be an option as well (the number of nearest neighbors in the embedding layer might need to be adapted).

Have you tested the model on embedded devices, such as NVIDIA Jetson?

We did not test the model on embedded devices yet. We had the chance to test it on AMD GPUs though where we could reproduce the results.

Are there any ablation studies on the inference speed when changing the Waffle Iron (WI) resolution? I'm interested in understanding which components are driving the speed and by how much.

I don't have numbers on this but the resolution $\rho$ definitely has an influence on the speed in the spatial mixing layers.

Actually to start optimizing the inference time, I would try to come back to a very early version we had tested. Very early on in this project, we were not keeping the resolution constant. We were trying to mimic the change of resolution we have in U-net architectures. For that, we were starting from a fine grid then progressively reducing the resolution to get a coarse grid when going deeper. Then, increasing the resolution again in the last layers just before classification.

We did not keep this strategy for two reasons. First, it introduces other degrees of flexibility which need to be tuned. I wanted to keep the tuning simple for the first version. Second, on SemanticKITTI, I noticed that the network was overfitting more easily when changing the resolution than when keeping it fixed. Yet this was before introducing all the strong augmentations we currently have on this dataset. Hopefully, these augmentations can allow us to reintegrate the change of resolution to gain compute time.

Intermediate Output Heads and Residual Stacking

Have you considered adding intermediate output heads to speed up and stabilize training? Have you explored stacking these heads residually? For instance, making the second head residual on the first, the third one on the second, and so on. This concept aligns with what we tried in this paper. We found that early losses - sometimes - stabilize and accelerate the training phase. Additionally, this setup enables anytime settings, allowing control over the speed/throughput tradeoff at inference time.

Thank you for the suggestion! It seems to be a good idea to try indeed. I don't recall testing any idea similar to this one. I always kept only one output head at the end and applied the loss there.

Benchmark Leaderboards

Have you submitted your results to the official leaderboards for Semantic KITTI and nuScenes?

The results we report on SemanticKITTI in the first table are obtained by submitting the prediction to the leaderboard. We have not submitted our results on the leaderboard of nuScenes.

Have you tested it in indoor scenarios, would you expect any required/recommended changes in the pipeline? (including augmentations and preproc)

Only S3DIS. On this dataset, we noticed that WaffleIron was overfitting very rapidly the train set. I tried some data augmentation techniques used in the latest works. They helped but were not enough to catch up with the current best techniques. We wanted to try using larger indoor datasets to tune the backbone more easily but I never had time to do so.

On S3DIS, it seemed better to process each room by chunks (eg cubes of 2.5m x 2.5m x 2.5m) rather than the entire room at once as done I believe in some recent works.

I hope this helps. See you!

Kin-Zhang commented 8 months ago

Hi, @gpuy I'm wondering about the training hour for the best score in Table 1. how many GPUs you used, what type of GPU, and the total training hours.

Thanks!

valeoai / WaffleIron