Training on Large Scale Dataset - Caching issue

RauchLukas commented 1 month ago

Sorry to bother you, but maybe this is an easy question for you.

Have you experience with the training of large-scale datasets with this MSECNet training pipeline? I am stuck on the task of finetuning the pre-trained PCPNet model with the Structured3D Indoor dataset.

The Structured3D dataset training split holds about 3000 shapes. If I try to use this much samples for the training, memory consumption crashes the system. This is probably due to the caching and It looks like I could handle the memory if i set the cache_capacity to 0. However, the training won't start because the [train] getting information for shape step in the PointcloudPatchDataset takes forever to preprocess all the samples, just to get this little information (the KDTree is deleted imidiatly since the cache size is 1)

So pre-processing every sample upfront looks unfeasible in case of a training set this big.

Is there a quick and dirty way around to avoid this behavior. Something like lazy loading, to load the data only when it is needed?

martianxiu commented 1 month ago

Hello,

I haven’t trained MSECNet on large datasets nor used Structured3D in my projects. However, here are a few suggestions that might help.

If the issue is that the dataset is too large to fit in memory, consider downsampling strategies, such as random sampling or voxel downsampling, either prior to or during training.

If the main bottleneck is data preprocessing time (particularly if constructing KDTrees is time-consuming), one approach could be to prebuild and save the KDTree for each shape in advance. Then, load them during training as needed.

RauchLukas commented 1 month ago

Thank you @martianxiu for your quick reply. I will consider the second option with saving the KDTree.

I also have a follow-up question to the fundamental preparation before training. If I use multiple GPUs, it looks like all the preprocessing is performed and also saved for every GPU separately. Is this desired behaviour?

martianxiu / MSECNet

Training on Large Scale Dataset - Caching issue #4