Comparing with Facebook SparseCNN

Benzlxs commented 5 years ago

Hi Yan,

The spconv is indeed faster than the one of Facebook on the KITTI object detection task, however, when I move the spconv to do the semantic segmentation task on scannet where Unet is built and a lot of deconvolutions layers are required, the speed of spconv (~12s per iteration) is three times slower than the one of Facebook(~4s per iteration) under the similar training setting. Do you have some ideas about the reason?

traveller59 commented 5 years ago

could you provide a simple reproduce example? Are you using SparseConvTranspose? the Deconvolution in SparseConvNet is equivalent to SparseInverseConv, not SparseConvTranspose (this layer is very slow and can't do optimization)

Benzlxs commented 5 years ago

Hey Yanyan, sorry for late reply, the following is the core code for Unet for semantic segmentation.

def U(nPlanes): #Recursive function
    m = spconv.SparseSequential()
    if len(nPlanes) == 1:
        for _ in range(reps):
            m.add( spconv.SparseBasicBlock(nPlanes[0], nPlanes[0],3, indice_key="subm{}".format(len(nPlanes))))
    else:
        m = spconv.SparseSequential()
        for _ in range(reps):
            m.add( spconv.SparseBasicBlock(nPlanes[0], nPlanes[0],3, indice_key="subm{}".format(len(nPlanes))))
        m.add(
            spconv.ConcatTable().add(
                spconv.Identity()).add(
                spconv.SparseSequential().add(
                    spconv.SparseConv3d(nPlanes[0], nPlanes[1], downsample[0], stride = downsample[1], bias=False, indice_key ="conv{}".format(len(nPlanes)))).add(
                        nn.BatchNorm1d(nPlanes[1], eps=1e-3, momentum=0.01)).add(nn.ReLU()
                            ).add(
                    U(nPlanes[1:])).add(
                    spconv.SparseInverseConv3d(nPlanes[1], nPlanes[0],downsample[0], bias=False, indice_key ="conv{}".format(len(nPlanes)))).add(
                        nn.BatchNorm1d(nPlanes[0], eps=1e-3, momentum=0.01)).add(nn.ReLU())))
        m.add(spconv.JoinTable())
        for i in range(reps):
            m.add( spconv.SubMConv3d( nPlanes[0] * 2, nPlanes[0], 3, bias=False, indice_key="end_pp{}".format(len(nPlanes)))).add(
                    nn.BatchNorm1d(nPlanes[0], eps=1e-3, momentum=0.01)).add(nn.ReLU())
            m.add( spconv.SparseBasicBlock(nPlanes[0], nPlanes[0],3, indice_key="end_pp{}".format(len(nPlanes))))
    return m
m = U(nPlanes)
return m

dingfuzhou commented 5 years ago

@Benzlxs Hi, have you solved the problem? By the way, how about the performance of your 3D Unet for 3D semantic segmentation on Scannet?

Benzlxs commented 5 years ago

Hey I have tried spconv-based 3D Unet, unfortunately, the speed is much slower than one of Facebook version, so I did not conduct training experimentation for 3D semantic segmentation on Scannet. The performance should be similar since all of them are 3D sparse CNN, just different implementation approaches.

traveller59 commented 5 years ago

@Benzlxs Could you provide a minimal reproduce code in one file (only need spconv version), its running time and expected running time (SparseConvNet time)? now I have time to work on this.

Benzlxs commented 5 years ago

According to my test, under the similar setting, SparseConvNet is 3.2 s/epoch, while spconv is 7.4 s/epoch. It is probably that this is something wrong with our my reproduce code.

I have attached the below code. Remember to replace the postfix '.txt' with '.py'

unet.txt tables.txt

traveller59 commented 5 years ago

@Benzlxs I can't debug your code. I wish you can provide the code like following thing:

def UNet_vgg():
    pass
net = UNet_vgg().cuda().float() # you may need to wrap this in a simple module
fake_input = # you can use pickle to save a typical scannet input for me.
fake_coords = # you can use pickle to save a typical scannet input for me.
fake_input.cuda().float()
fake_input.requires_grad = True
fake_coords.cuda().int()
fake_coords.requires_grad = True
torch.cuda.synchronize()
t = time.time()
res = net(fake_input, fake_coords)
torch.cuda.synchronize()
print(time.time() - t, "forward time")
t = time.time()
fake_output_feature_grad = torch.ones_like(res.features)
res.features.backward(fake_output_feature_grad)
torch.cuda.synchronize()
print(time.time() - t, "backward time")

please don't add any SparseConvNet code in benchmark code.

traveller59 commented 5 years ago

@Benzlxs I have found a problem in your spconv code:

coors = in0.get_spatial_locations().int()[:,[3,2,1,0]] 
ret  = spconv.SparseConvTensor(in0.features, coors, dense_shape, data.batch_size)

you need to convert the coord to cuda tensor: coord.cuda() to make full use of spconv's gpu rule generation algorithm, spconv's CPU rule generation algorithm don't use omp so our cpu code is slower than SparseConvNet (I have no interest in cpu performance).

Benzlxs commented 5 years ago

@traveller59 I have added coord.cuda(), but the time just decrease slightly. Sorry for the messy code, I have put all the code in a public repository, here is the linking. If you want to debug, you gotta to download some scannet data for a try.

Cheers.

traveller59 commented 5 years ago

@Benzlxs I have no access to scannet data. if the spconv performance on unet/scannet is not important for you, I think it's time to close this issue.

Benzlxs commented 5 years ago

It used to be important, but currently, I am busy with other projects. so sure, I can close this issue. Maybe I will come to fix it later.

traveller59 / spconv

Comparing with Facebook SparseCNN #18