traveller59 / spconv

Spatial Sparse Convolution Library
Apache License 2.0
1.83k stars 362 forks source link

efficiency problem of spconv library in C++ implementation #583

Open pengyusi opened 1 year ago

pengyusi commented 1 year ago

Hello, i'm trying to implement sparse convolution layer in c++ with tensorrt, but facing a great efficiency problem. The core step of sparse conv:

auto conv_res = ConvGemmOps::implicit_gemm(
        alloc2, tuner, input_features_real, Weights, pair_fwd_real,
        pair_mask_splits, mask_argsort_splits, num_act_out_real,
        mask_tensor, arch, false, is_subm,
        reinterpret_cast<std::uintptr_t>(stream),
        tv::CUDAKernelTimer(false), false, false, Bias,
        1.0 /*bias alpha, only used for leaky relu*/,
        0.0 /*unused for now*/, tv::gemm::Activation::kNone);

costed about 20ms for every fp32 calculation when handling 40000 voxels * 128 features subm input. It`s even much slower than the python version. I run it on a Nvidia Orin. could you pls give me some sugestions or do you have any ideas about the inefficiency?

FindDefinition commented 1 year ago

when you run conv in first time, the tuner object will iterate all available algo and choose fastest one, then cache result in tuner. this will cost much time. subsequence runs (keep previous tuner alive) should be faster.

pengyusi commented 1 year ago

Thanks a lot for the quick and helpful sugestion. Now my model runs much faster. I checked the duration of each step and find out that the pair calculation is now the bottleneck:

pair_res = SpconvOps::get_indice_pairs_implicit_gemm(
            alloc, input_indices_real, batch_size, input_dims,
            static_cast<int>(conv_algo), mKernelShape, mStrides, mPads, mDilations,
            {0, 0, 0}, mIsSubM, false/*transpose*/, false /*is_train*/,
            reinterpret_cast<std::uintptr_t>(stream), out_inds_num_limit,
            tv::CUDAKernelTimer(false), use_direct_table);

for each layer it costs 0.4ms. I find that all the inputs sames "static" but still wonder if there are any potential tricks that can accelerate this step?

DayBeha commented 1 year ago

Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?

pengyusi commented 1 year ago

Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?

how do you print it?

pengyusi commented 1 year ago

Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?

how do you print it?

DayBeha commented 1 year ago

Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?

how do you print it?

I rewrite implicit_gemm(), and there is a variable named out_features. So I convert it to cpu, and print some element of it.

    auto out_feas = out_features.cpu();
    auto out_feas_ptr = out_feas.data_ptr<__half>();
    tv::ssprint("output features:");
    for (int i = 0; i < 100000; i+=1000) {
        std::cout << out_feas_ptr[i] << std::endl;
        std::cout << __half2float(out_feas_ptr[i]) << std::endl;
    }
DayBeha commented 1 year ago

when you run conv in first time, the tuner object will iterate all available algo and choose fastest one, then cache result in tuner. this will cost much time. subsequence runs (keep previous tuner alive) should be faster.

I understand the time consume in the first run, but not how it works with tuner. Could you please send me an example with several subsequence runs?

pengyusi commented 1 year ago

my suggestion: try to use e.g. ​tv::ssprint(out_feas[0]) to check if your output is saved in out_feas, so that you know if it`s a calc problem or print problem​​


Von: Tengfei Lu @.> Gesendet: Dienstag, 16. Mai 2023 15:55 An: traveller59/spconv @.> Cc: pengyusi @.>; Author @.> Betreff: Re: [traveller59/spconv] efficiency problem of spconv library in C++ implementation (Issue #583)

Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?

how do you print it?

I rewrite implicit_gemm(), and there is a variable named out_features. So I convert it to cpu, and print some element of it.

auto out_feas = out_features.cpu();
auto out_feas_ptr = out_feas.data_ptr<__half>();
tv::ssprint("output features:");
for (int i = 0; i < 100000; i+=1000) {
    std::cout << out_feas_ptr[i] << std::endl;
    std::cout << __half2float(out_feas_ptr[i]) << std::endl;
}

— Reply to this email directly, view it on GitHubhttps://github.com/traveller59/spconv/issues/583#issuecomment-1549177036, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIMZRRR76QMFGBLKF35MJFLXGMXGBANCNFSM6AAAAAAWNEPUPY. You are receiving this because you authored the thread.Message ID: @.***>

DayBeha commented 1 year ago

my suggestion: try to use e.g. ​tv::ssprint(out_feas[0]) to check if your output is saved in out_feas, so that you know if it`s a calc problem or print problem​​ ____ Von: Tengfei Lu @.> Gesendet: Dienstag, 16. Mai 2023 15:55 An: traveller59/spconv @.> Cc: pengyusi @.>; Author @.> Betreff: Re: [traveller59/spconv] efficiency problem of spconv library in C++ implementation (Issue #583) Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature? how do you print it? I rewrite implicit_gemm(), and there is a variable named out_features. So I convert it to cpu, and print some element of it. auto out_feas = out_features.cpu(); auto out_feas_ptr = out_feas.data_ptr<__half>(); tv::ssprint("output features:"); for (int i = 0; i < 100000; i+=1000) { std::cout << out_feas_ptr[i] << std::endl; std::cout << __half2float(out_feas_ptr[i]) << std::endl; } — Reply to this email directly, view it on GitHub<#583 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIMZRRR76QMFGBLKF35MJFLXGMXGBANCNFSM6AAAAAAWNEPUPY. You are receiving this because you authored the thread.Message ID: @.***>

I tied. It's still zeros