lxxue / FRNN

Fixed Radius Nearest Neighbor Search on GPU
180 stars 24 forks source link

GPU memory cache #10

Open hugobl1 opened 2 years ago

hugobl1 commented 2 years ago

Hello Ixxue, Thank you very much for this work! I just have a question about your code. I noticed that when I use your code in a for loop, the GPU memory allocated keeps increasing until it reaches an OOM error. Have you noticed this too? Do you know where this could come from?

Example of code:


dists, idxs, nn, grid = frnn.frnn_grid_points(
        points1, points2, lengths1, lengths2, K, r, grid=None, return_nn=False, return_sorted=True
  )

for i in range(n):
    ## Some operations that do not allocate GPU memory
    dists, idxs, nn, grid = frnn.frnn_grid_points(
        points_i, points2, lengths_i, lengths2, K, r, grid=grid, return_nn=False, return_sorted=True
  )
    ## Some operations that do not allocate GPU memory

Where points_i is a new pointcloud at each iteration

lxxue commented 2 years ago

Sorry for the late reply. I just got an idle GPU now.

I did a local test like this (pts.pth is located under FRNN/tests/):

import torch
import frnn

points1 = torch.load("pts.pth")[None, ...].float().cuda()
points2 = torch.load("pts.pth")[None, ...].float().cuda()
print(points1)
n = 1000
points_list = []
lengths_list = []
for i in range(n):
    points_list.append(torch.load("pts.pth")[None, ...].float().cuda())
    lengths_list.append(points1.shape[1] * torch.ones((1,), dtype=torch.long).cuda())

K = 5
r = (points1.amax(dim=1) - points1.amin(dim=1)) / 10
r = r.amax()[None]

lengths_1 = points1.shape[1] * torch.ones((1,), dtype=torch.long).cuda()
lengths_2 = points1.shape[1] * torch.ones((1,), dtype=torch.long).cuda()

dists, idxs, nn, grid = frnn.frnn_grid_points(
    points1, points2, lengths_1, lengths_2, K, r, grid=None, return_nn=False, return_sorted=True
)

for i in range(n):
    ## Some operations that do not allocate GPU memory
    points_i = points_list[i]
    lengths_i = lengths_list[i]
    dists, idxs, nn, grid = frnn.frnn_grid_points(
        points_i, points2, lengths_i, lengths_2, K, r, grid=grid, return_nn=False, return_sorted=True
    )
    ## Some operations that do not allocate GPU memory

print("done")

and my GPU memory usage is stable at 2413MiB. Can you share an example and data that can reproduce the error?

The GPU memory should be released and reallocated in every iteration if you did not store any results. So I am wondering if you are saving the results (e.g. dists) for later use or if there is an iteration with a very large point cloud that cannot fit in your GPU?

zjumsj commented 6 days ago

I met a similar issue, and here is my solution. However, I'm not sure if this modification alters its original logic or if it might lead to other errors. @lxxue Would you mind take a look? Thanks a lot. 屏幕截图 2024-11-27 183656

lxxue commented 3 days ago

Hi, thank you for pointing this out. Could you provide a minimal example to reproduce this memory leak issue so I can check if it fixes the issue?

zjumsj commented 2 days ago

Here is the test code. I have tested it on both the RTX 4090 and the RTX 3070 Laptop GPU, and I observed similar out-of-memory (OOM) issues on both. I hope you are able to reproduce this issue as well.

'''
    A toy example to demonstrate OOM
'''
import numpy as np
import os

import torch
import torch.nn as nn
import frnn

from tqdm import tqdm

def sample_unitsphere(P):
    ans = np.empty([P,3],dtype='float32')
    rnd_ = np.random.uniform(size=[P,2])
    phi = 2 * np.pi * rnd_[:,0]
    costheta = rnd_[:,1] * 2. - 1.
    sintheta = np.clip( 1 - costheta * costheta, 0., None)
    sintheta = np.sqrt(sintheta)
    ans[:,2] = costheta
    ans[:,0] = sintheta * np.cos(phi)
    ans[:,1] = sintheta * np.sin(phi)
    return ans

class Nearest_Search:
    def __init__(self, K=4, r=0.05, device="cuda:0"):
        self.K = K
        self.r = r
        self.grid = None
        self.device = device

    def build(self):
        opt_points = np.random.rand(70_000, 3).astype(np.float32)
        opt_points = opt_points * 2. - 1. # map to [-1,1]
        opt_points = torch.from_numpy(opt_points).cuda()
        self.opt_points = nn.Parameter(opt_points)

        l = [
            {'params': [self.opt_points], 'lr': 0.01}
        ]
        self.optimizer = torch.optim.Adam(l)

        ref_points = sample_unitsphere(10_000).astype(np.float32)
        ref_points = torch.from_numpy(ref_points).cuda()
        self.ref_points = ref_points

    def run(self):

        self.optimizer.zero_grad()

        # In my case, I don't need dists
        # I only use idxs
        dists, idxs, nn, grid = frnn.frnn_grid_points(
            self.opt_points.unsqueeze(0),  # 1 x P x 3
            self.ref_points.unsqueeze(0), # 1 x N_vertex x 3
            None, None,
            self.K, self.r, grid=self.grid,
            return_nn=False, return_sorted=True
        )
        idxs_ = idxs[0]
        ref_points = self.ref_points[idxs_].mean(dim=1) # (Px4)x3 -> Px3
        diff = (self.opt_points - ref_points)
        loss = (diff * diff).mean()
        loss.backward()

        self.dists = dists
        self.idxs = idxs
        # keep grid to avoid unnecessary computing
        self.grid = grid

        self.optimizer.step()

def train():
    nearest_search = Nearest_Search()
    nearest_search.build()

    n_iters = 40_000
    for ii in tqdm(range(n_iters)):
        nearest_search.run()

if __name__ == "__main__":
    train()

屏幕截图 2024-12-01 142851 屏幕截图 2024-12-01 142835