psmoveservice / PSMoveService

A background service that communicates with the psmove and stores pose and button data.
Apache License 2.0
586 stars 148 forks source link

Performance improvement by progressive HSV elimination #301

Open gb2111 opened 7 years ago

gb2111 commented 7 years ago

Hi, This is to document idea that we could potentially improve performance by progressively calculate and eliminate H, S and V values. Now HSV values are pre-calculated and the cost of calculation is simply cost of accessing pointer at offset which is quite huge. It might be smaller effort if we calculate HSV manually but progressively. So we calculate H and eliminate 1/3 pixels (very low cost) then S with another 1/3 pixels (low cost) and then finally we need calculate V (bit higher then now). I am only guessing here but would expect even less then 1/3 where we need calculate all three HSV values.

cboulay commented 7 years ago

What do you mean by "calculate"? Do you mean to threshold hue first, then only threshold saturation on the values that passed the hue threshold, and so on? Can you take a look at the source code for cv::inRange to make sure that's not what they are doing anyway? If what you describe really provides a significant benefit then it's something that should happen in OpenCV.

BTW, did you see the ROI branch/pull request? There's a huge performance gain there. By using the ROI, the number of processed pixels that might pass one of the criteria but not others is really quite small. I don't think the method you proposed would save all that much time.

gb2111 commented 7 years ago

Hello Chad, Yes, I heard about ROI, this will be great improvement and many people with i5 were waiting for this!

Regarding this topic, the difference is that currently you HSV for whole bitmap which takes time. Basically the cost is to access array of pre-calculated HSV values indexed by rgb but it is a major cost as I could see when doing performance analysis. In my approach you you do calculation and elimination in one step for each pixel. You first calculate H and check if you can eliminate. If yes, job is done. Then calculate S and also if its out of bounds, job is done. Then eventually calculate H and do full comparison. So basically its a function that takes RGB as input and produces thresholded HSV. Please let me know if difference is clear now and you see it can potentially work.

cboulay commented 7 years ago

OK I see what you're saying, I think. With the ROI implementation, we only do the bgr2hsv on ROI'd pixels. And on top of that there is @HipsterSloth 's method of converting bgr2hsv using a pre-calculated lookup table. For each pixel in the ROI only, the bgr values are combined to get an index, then the hsv from that index is copied.

I'd be really interested to see if the ROI branch works for you and how much CPU savings you have. It is functional now. Please try it out. It's just missing a bit of cleanup before we merge.

cboulay commented 7 years ago

Sorry, after re-reading your last post, I realize that you understand how the custom bgr2hsv works. Is it really a major cost to index the array? I doubt it's less costly to calculate H, then S, then V, otherwise @HipsterSloth would not have used the LUT method. But you're saying that for some portion of pixels, you can get away with calculating only H, and on average that will be less costly than the LUT.

Currently the ROI is set to be twice as wide and twice as tall as the sphere projection. For simplicity, let's say the sphere projection in the middle of the image is a perfect circle. So the area taken up by the sphere is pir^2. The area taken up by the ROI is 16r^2. So the ratio of pixels in the ROI that we know should pass all 3 thresholds is pi/16, or about 20%.

In the remaining 80% of pixels there might be some savings. Let's say it's perfect and none of the non-sphere pixels pass the first step (H). So what you're saying is that t(H)0.8 + t(HSV)0.2 < t(LUT) Or that the time it takes to calculate only H in 80% of the pixels + the time it takes to calculate HSV in 20% of the pixels + the boolean checks is less than the time it takes to do the LUT method in 100% of the pixels.

If you can prove it then go ahead. Edit: Although I'm a bit hesitant to replace standardized and well-documented OpenCV functions with custom functions, so it would need to be a substantial savings. Also, if you compile the opencl version of OpenCV then I believe it has some parallelizations for cv::InRange.

cboulay commented 7 years ago

I was thinking about this a bit on the bus ride home. I think there are some advantages to this proposed approach. As it is now, we already have some custom functions to replace an OpenCV function (cvtColor). Your approach would add a custom function but it would also get rid of a custom function. Another (major?) advantage to this approach is that it would have a much smaller memory footprint. Currently the LUT takes up 50.3 MBytes, and I think there are one of those for each tracker. We can even get rid of the hsvBuffer but that's a much smaller savings (only about ~1MB per PS Eye).

So if you want, then implement a function with a signature like the following: void inRangeHSVofBGR(const cv::Mat &inBgrBuffer, cv::Scalar minHSV, cv::Scalar maxHSV, cv::Mat &outGsBuffer) As long as it is at least as fast as the current method then I would consider this an improvement.

As a starting point for how this function might look, I was trying to decipher opencv's inRange function. There's some weird template and OpenCL magic in there, but what I think might be the core non-opencl algorithm is here. I have no idea how important some of the optimizations are.

gb2111 commented 7 years ago

Hi, Thanks for reply. Regarding implementation, there are few examples on stackoverflow. But I am going to do initial test of performance first on calculating H, HS and HSV vs indexing LUT + inRange. Then we add percentages and see if it is worth.

I'd be happy to test ROI but I got my phone in service now and I will have a forced break from VR :)

cboulay commented 7 years ago

@gb2111 , Did you get a chance to try to implement this yet? I'm curious if it works well.

gb2111 commented 7 years ago

@cboulay , Actually I wanted to wait until kalman and roi gets stable. I wanted to propose move some code around to different place. If you say code is not going to change much I will start looking to implement this.

cboulay commented 7 years ago

Well the Kalman stuff still needs work but that should be almost entirely separate to this work. The ROI work is mostly stable. It could be reorganized a bit but HipsterSloth and I can manage merging your work with whatever happens between now and when it's done.

HipsterSloth commented 7 years ago

Yeah the ROI and Kalman code are stable for now. There are other larger sized improvements that Chad outlined for ROI and the Kalman filter, but it's going to be a few weeks before I get to those. I'm currently trying to finalize DS4 and PSNavi support. All of that work shouldn't touch any of the color or positional filtering code.

gb2111 commented 7 years ago

ok. in this case i will test if that works on disabled ROI.

MHDante commented 7 years ago

That SIMD code is nuts.

From what I can tell, the InRange_ function always receives zeroes for all of its step parameters. So in this case, most of the SIMD optimization is involved in checking the whole block of 16 floats using 128 bit SIMD instructions (4 floats at a time) . other than that, the whole thing runs in a loop using an NAryIterator over the entire matrix (Which is surprising, given that it can be parallelized)

Theoretically, all of that results in a 2-4x speedup as opposed to iterating over the floats individually. However, in your approach, it's not possible, as you're not applying a homogeneous operation over the entire RGB vector. In other words, I don't think you can glean much help from opencv's inRange implementation.

Their cvtColor has a fairly standard HSV algorithm. Although, they iterate over the matrix using a parallelfor. Apparently, it uses whatever threads opencv was initialized with (setNumThreads())

Another benefit to your approach is that it stops you from having to make all those ROI buffers.

cboulay commented 7 years ago

Another benefit to your approach is that it stops you from having to make all those ROI buffers.

Just a point of clarification: the ROI variables don't take up very much EXTRA memory on top of the frame buffer that is backing them because they merely copy a pointer to the data, they don't copy the data itself. So these lines don't impact performance or memory footprint much. However, I think what you meant is that we can get rid of the buffer altogether (and the ROI that uses it), which should lead to a decrease in memory usage.

gb2111 commented 7 years ago

@cboulay , I observed that addressing buffered array takes CPU. So even though you would expect that you pre-calculated values and only indexing array should give performance that is not the case. At least is what I have observed when calculating performance. So actual performance gain would be if you get rid of 80% of array indexing by filtering out for H eventually U, cause it requires only comparison.

I am gonna to do this in next week when I finally migrate to 6.x. Hips fixed something related to controller shaking but disable_roi does not work...

MHDante commented 7 years ago

Oh, since you're optimizing, why not filter out V before S and H. (since V is the least expensive to filter) also, in dark rooms, there's a Hue matching halo around the controller bulb.

cboulay commented 7 years ago

I'm curious, why is V cheaper to filter? Is it because there are less false-positives in V than H so the next steps will have fewer pixels to check? Or something else?

gb2111 commented 7 years ago

I mean to filter them in order how they are calculated.Will see if idea works when I test it.

Wysłano z mojego smartfonu w Play -------- Oryginalna wiadomość --------Od: Chadwick Boulay notifications@github.com Data: 08.01.2017 05:07 (GMT+01:00) Do: cboulay/PSMoveService PSMoveService@noreply.github.com DW: gb2111 knurgb@gazeta.pl, Mention mention@noreply.github.com Temat: Re: [cboulay/PSMoveService] Performance improvement by progressive HSV elimination (#301) I'm curious, why is V cheaper to filter? Is it because there are less false-positives in V than H so the next steps will have fewer pixels to check? Or something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cboulay/PSMoveService","title":"cboulay/PSMoveService","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/cboulay/PSMoveService"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cboulay in #301: I'm curious, why is V cheaper to filter? Is it because there are less false-positives in V than H so the next steps will have fewer pixels to check? Or something else?"}],"action":{"name":"View Issue","url":"https://github.com/cboulay/PSMoveService/issues/301#issuecomment-271128401"}}}

MHDante commented 7 years ago

First off, it's a tiny optimization. 🤷‍♂️

I assume it gives less false positives due to the halo I mentioned earlier. (caused by the bulb illuminating hue neutral surroundings).

However, the real saving is that you know the value of V 5 lines into the HSV calculation, so it's your earliest exit (requires no flops).

Honestly, though, It's probably less than a millisecond per frame.

gb2111 commented 7 years ago

We will see.

Wysłano z mojego smartfonu w Play -------- Oryginalna wiadomość --------Od: Dante Camarena notifications@github.com Data: 08.01.2017 10:20 (GMT+01:00) Do: cboulay/PSMoveService PSMoveService@noreply.github.com DW: gb2111 knurgb@gazeta.pl, Mention mention@noreply.github.com Temat: Re: [cboulay/PSMoveService] Performance improvement by progressive HSV elimination (#301) First off, it's a tiny optimization. 🤷‍♂️ I assume it gives less false positives due to the halo I mentioned earlier. (caused by the bulb illuminating hue neutral surroundings). However, the real saving is that you know the value of V 5 lines into the HSV calculation, so it's your earliest exit (requires no flops). Honestly, though, It's probably less than a millisecond per frame.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cboulay/PSMoveService","title":"cboulay/PSMoveService","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/cboulay/PSMoveService"}},"updates":{"snippets":[{"icon":"PERSON","message":"@MHDante in #301: First off, it's a tiny optimization. 🤷‍♂️ \r\n\r\nI assume it gives less false positives due to the halo I mentioned earlier. (caused by the bulb illuminating hue neutral surroundings).\r\n\r\nHowever, the real saving is that you know the value of V 5 lines into the HSV calculation, so it's your earliest exit (requires no flops).\r\n\r\nHonestly, though, It's probably less than a millisecond per frame. \r\n"}],"action":{"name":"View Issue","url":"https://github.com/cboulay/PSMoveService/issues/301#issuecomment-271139803"}}}