Closed mali-tintash closed 4 years ago
try to write your own multiplication shader instead.
发自我的iPhone
------------------ Original ------------------ From: Muhammad Ali <notifications@github.com> Date: Tue,Jan 21,2020 5:54 PM To: sjy234sjy234/KinectFusion-ios <KinectFusion-ios@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [sjy234sjy234/KinectFusion-ios] ICPReduceMatrix takes too long to be realtime (#3)
I am trying to optimise it for faster processing on iPhoneX. But from analysis, the major processing time is going into ICPReducerMatrix. I have modified the pipeline to use as less command buffers as I thought possible. But this ReducerMatrix already uses MPSMatrixMultiplication and surprisingly is super slow if we consider it being computed over GPU.
You mentioned that with GPGPU practices we can reach 30FPS on an iPhoneX, but I don't understand what to do with this part of the code. Can you please guide?
My guess for slower processing in multiplication is "probably" stride misalignment on matrices as input? But I could be wrong. What did you base the 30FPS suggestion on?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
How do I optimise for stride? all the routines for getting stride seem to be obsolete. Also, the ICPReduceMatrix takes "OccupiedPixelNumber" as input, which is only available when ICPPrepareMatrix is done "waitUntilCompleted" is there a way around it?
Secondly, because the number of "occupiedPixelNumber" is variable, we have to allocate a new buffer for each iteration. That seems to block the CPU on allocation and we lose about 1.2 ms before the kernel is launched. Do you have any suggestions for that?
Estimate to give fix size buffer in the first place.
发自我的iPhone
------------------ Original ------------------ From: Muhammad Ali <notifications@github.com> Date: Tue,Jan 21,2020 7:02 PM To: sjy234sjy234/KinectFusion-ios <KinectFusion-ios@noreply.github.com> Cc: JiangyangShen(Sam) <420705550@qq.com>, Comment <comment@noreply.github.com> Subject: Re: [sjy234sjy234/KinectFusion-ios] ICPReduceMatrix takes too long to be realtime (#3)
How do I optimise for stride? all the routines for getting stride seem to be obsolete. Also, the ICPReduceMatrix takes "OccupiedPixelNumber" as input, which is only available when ICPPrepareMatrix is done "waitUntilCompleted" is there a way around it?
Secondly, because the number of "occupiedPixelNumber" is variable, we have to allocate a new buffer for each iteration. That seems to block the CPU on allocation and we lose about 1.2 ms before the kernel is launched. Do you have any suggestions for that?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I am unclear on how to estimate the occupied pixels returned from ICPPrepare. Any pointers please?
Do you have transpose + multiply kernel for metal written? if you can share
There is an up bound(wh). Set a fix ratio, such as 0.6 w * h. Multiplication is quite straightforward, try it by yourself.
发自我的iPhone
------------------ Original ------------------ From: Muhammad Ali <notifications@github.com> Date: Wed,Jan 22,2020 7:17 PM To: sjy234sjy234/KinectFusion-ios <KinectFusion-ios@noreply.github.com> Cc: JiangyangShen(Sam) <420705550@qq.com>, Comment <comment@noreply.github.com> Subject: Re: [sjy234sjy234/KinectFusion-ios] ICPReduceMatrix takes too long to be realtime (#3)
I am unclear on how to estimate the occupied pixels returned from ICPPrepare. Any pointers please?
Do you have transpose + multiply kernel for metal written? if you can share
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I am trying to optimise it for faster processing on iPhoneX. But from analysis, the major processing time is going into ICPReducerMatrix. I have modified the pipeline to use as less command buffers as I thought possible. But this ReducerMatrix already uses MPSMatrixMultiplication and surprisingly is super slow if we consider it being computed over GPU.
You mentioned that with GPGPU practices we can reach 30FPS on an iPhoneX, but I don't understand what to do with this part of the code. Can you please guide?
My guess for slower processing in multiplication is "probably" stride misalignment on matrices as input? But I could be wrong. What did you base the 30FPS suggestion on?