Open pengyusi opened 1 year ago
when you run conv in first time, the tuner object will iterate all available algo and choose fastest one, then cache result in tuner. this will cost much time. subsequence runs (keep previous tuner alive) should be faster.
Thanks a lot for the quick and helpful sugestion. Now my model runs much faster. I checked the duration of each step and find out that the pair calculation is now the bottleneck:
pair_res = SpconvOps::get_indice_pairs_implicit_gemm(
alloc, input_indices_real, batch_size, input_dims,
static_cast<int>(conv_algo), mKernelShape, mStrides, mPads, mDilations,
{0, 0, 0}, mIsSubM, false/*transpose*/, false /*is_train*/,
reinterpret_cast<std::uintptr_t>(stream), out_inds_num_limit,
tv::CUDAKernelTimer(false), use_direct_table);
for each layer it costs 0.4ms. I find that all the inputs sames "static" but still wonder if there are any potential tricks that can accelerate this step?
Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?
Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?
how do you print it?
Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?
how do you print it?
Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?
how do you print it?
I rewrite implicit_gemm(), and there is a variable named out_features. So I convert it to cpu, and print some element of it.
auto out_feas = out_features.cpu();
auto out_feas_ptr = out_feas.data_ptr<__half>();
tv::ssprint("output features:");
for (int i = 0; i < 100000; i+=1000) {
std::cout << out_feas_ptr[i] << std::endl;
std::cout << __half2float(out_feas_ptr[i]) << std::endl;
}
when you run conv in first time, the tuner object will iterate all available algo and choose fastest one, then cache result in tuner. this will cost much time. subsequence runs (keep previous tuner alive) should be faster.
I understand the time consume in the first run, but not how it works with tuner. Could you please send me an example with several subsequence runs?
my suggestion: try to use e.g. tv::ssprint(out_feas[0])
to check if your output is saved in out_feas, so that you know if it`s a calc problem or print problem
Von: Tengfei Lu @.> Gesendet: Dienstag, 16. Mai 2023 15:55 An: traveller59/spconv @.> Cc: pengyusi @.>; Author @.> Betreff: Re: [traveller59/spconv] efficiency problem of spconv library in C++ implementation (Issue #583)
Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature?
how do you print it?
I rewrite implicit_gemm(), and there is a variable named out_features. So I convert it to cpu, and print some element of it.
auto out_feas = out_features.cpu();
auto out_feas_ptr = out_feas.data_ptr<__half>();
tv::ssprint("output features:");
for (int i = 0; i < 100000; i+=1000) {
std::cout << out_feas_ptr[i] << std::endl;
std::cout << __half2float(out_feas_ptr[i]) << std::endl;
}
— Reply to this email directly, view it on GitHubhttps://github.com/traveller59/spconv/issues/583#issuecomment-1549177036, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIMZRRR76QMFGBLKF35MJFLXGMXGBANCNFSM6AAAAAAWNEPUPY. You are receiving this because you authored the thread.Message ID: @.***>
my suggestion: try to use e.g.
tv::ssprint(out_feas[0])
to check if your output is saved in out_feas, so that you know if it`s a calc problem or print problem … ____ Von: Tengfei Lu @.> Gesendet: Dienstag, 16. Mai 2023 15:55 An: traveller59/spconv @.> Cc: pengyusi @.>; Author @.> Betreff: Re: [traveller59/spconv] efficiency problem of spconv library in C++ implementation (Issue #583) Hi, I'm trying to implement in c++, too. But I have trouble in printing the output features. The output features seems all zero. Could you please tell me how do you get exact values of output feature? how do you print it? I rewrite implicit_gemm(), and there is a variable named out_features. So I convert it to cpu, and print some element of it. auto out_feas = out_features.cpu(); auto out_feas_ptr = out_feas.data_ptr<__half>(); tv::ssprint("output features:"); for (int i = 0; i < 100000; i+=1000) { std::cout << out_feas_ptr[i] << std::endl; std::cout << __half2float(out_feas_ptr[i]) << std::endl; } — Reply to this email directly, view it on GitHub<#583 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIMZRRR76QMFGBLKF35MJFLXGMXGBANCNFSM6AAAAAAWNEPUPY. You are receiving this because you authored the thread.Message ID: @.***>
I tied. It's still zeros
Hello, i'm trying to implement sparse convolution layer in c++ with tensorrt, but facing a great efficiency problem. The core step of sparse conv:
costed about 20ms for every fp32 calculation when handling 40000 voxels * 128 features subm input. It`s even much slower than the python version. I run it on a Nvidia Orin. could you pls give me some sugestions or do you have any ideas about the inefficiency?