Open ofTheo opened 2 years ago
This is the proposed change:
from:
//----------------------------------------------------------
void ofFbo::readToPixels(ofPixels & pixels, int attachmentPoint) const{
if(!bIsAllocated) return;
#ifndef TARGET_OPENGLES // <-- non OPENGLES
getTexture(attachmentPoint).readToPixels(pixels);
#else
pixels.allocate(settings.width,settings.height,ofGetImageTypeFromGLType(settings.internalformat));
bind();
int format = ofGetGLFormatFromInternal(settings.internalformat);
glReadPixels(0,0,settings.width, settings.height, format, GL_UNSIGNED_BYTE, pixels.getData());
unbind();
#endif
}
to:
//----------------------------------------------------------
void ofFbo::readToPixels(ofPixels & pixels, int attachmentPoint) const{
if(!bIsAllocated) return;
#ifndef TARGET_OPENGLES // <-- non OPENGLES
if(settings.numSamples>0){
getTexture(attachmentPoint).readToPixels(pixels);
}else{
#endif
pixels.allocate(settings.width,settings.height,ofGetImageTypeFromGLType(settings.internalformat));
bind();
int format = ofGetGLFormatFromInternal(settings.internalformat);
glReadPixels(0,0,settings.width, settings.height, format, GL_UNSIGNED_BYTE, pixels.getData());
unbind();
#ifndef TARGET_OPENGLES
}
#endif
}
i made a branch with the proposed change as an alternate method, and an app to test a system's capability of iterations of downloads of a shader-redrawn GL_RGBA 1920x1080 texture within a 1/30s frame.
results can be found here: https://github.com/artificiel/openFrameworks/tree/fbo_download/apps/devApps/FBODownloadTest
in short, the proposed method is consistently faster, more meaningfully so on x64 than arm64.
others might want to test more constrained systems/configs (rPi, windows, etc) but it seems the change is a net win. (it would be nice to see some cutting edge intel too).
however it does raise the question: why is the PBO method not the default implementation? (whenever I need to download textures I grab ofxFboFastReader without hesitation as it is systematically faster (and dramatically more on arch64, perhaps due to the memory layout of M1/M2 chips)).
if there are backward-compatibility or platform-availability or other technical reasons why PBO might be counter-productive, couldn't there be a readToPixelsViaPBO() builtin method so it can be actively selected within the comfort of the core?
I think maybe PBO was not broadly possible in different video cards? Can it be detected? this way we could use PBO as default, and the actual one as a fallback
@artificiel weird that we get such different results for the ofxFastFboReader - when you were timing it did you have the readback in blocking or non blocking mode? As that might have affected the timing.
@dimitre - I am not 100% sure why the PBO method is not default, but I am guessing it could be fairly easily implemented - I just always suggest benchmarking changes first as sometimes our assumption of what is faster doesn't bear out in the data. :)
At the least though we could implement the suggested change and even add PBO based readback as a separate feature if available.
I just re-ran the tests on M2 machine and get similar numbers. @ofTheo not sure about "blocking or non blocking mode"? the other diff is that you worked 4096x4096 while my test is 1920x1080. maybe the gains are different on texture size?
@artificiel you mean similar numbers in both implementations or similar numbers as your previous benchmark in this readme? https://github.com/artificiel/openFrameworks/tree/fbo_download/apps/devApps/FBODownloadTest
ah sorry similar as previous test!
so I gave a look to the benchmark and noticed @ofTheo was testing 4096x4096 while this is based on 1920x1080. so I tried in 4096x4096 and get different results:
4096x4096 (old, new, fastfbo)
fps-based @ 1 iter: 56fps, 93fps, 102fps [1x, 1.7x, 1.8x] 3584Mbps, 6144Mbps, 6528Mbps
iter-based @ 30fps: 2, 7, 9 [1x, 3.5x, 4.5x] 3840Mbps, 13440Mbps, 17280Mbps
the first line mostly recreates @ofTheo's test.
so it's a different "skewing" than 1920x1080, for which the equivalence is:
fps-based @ 1 iter: 139fps, 174fps, 270fps [1x, 1.25x, 1.9x] 1042Mbps, 1305Mbps, 2025Mbps
iter-based @ 30fps: 9, 15, 77 [1x, 1.66x, 8.5x] 2025Mbps, 3375Mbps, 17325Mbps
so it seems 2 things affect the readings: the size of the texture, and wether it's under pressure with multiple readbacks within a frame (this benchmark approach) vs free-running fps. also, this benchmark might be flawed in the "pressure" approach in the sense that maybe some things are optimized/elided/dropped? 17325Mbps is yet a realistic bandwidth with the M2 (supposedly 25000Mbps max).
it's interesting to see both resolution top out at similar bandwidth with the fastFBO so texture size seems irrelevant. with New method, larger textures have higher gains than smaller.
Great! I suppose using more than 3 buffers in ofxFastFboReader can be a good improvement also for big textures
ah buffering, good question. another round, this time with macOS game mode fullscreen (so the results are 10-15 % higher, but perceptively (not measured) much more stable):
there seems to be little gain to bump buffers to 10. however (to confirm it has an effect) 1 is definitely less good. so it seems 3 buffers is a good default.
old, new, (fast1, fast3, fast10)
fps-based @ 1 iter:
4096x4096 : fps: 86, 222, (216, 240, 242) ; Gbps: 43, 112, (108, 120, 121)
1920x1080 : fps: 238, 283 (527, 840, 840) ; Gbps: 14, 17, (32, 51, 51)
iter-based @ 30fps
4096x4096 : iters: 3, 11, (9, 12, 12) ; Gbps: 50, 165, (140, 186, 186)
1920x1080 : iters: 11, 16, (49, 93, 96) ; Gbps: 20, 54, (90, 174, 175)
but the results are consistent: there is a very good gain between Old an New on large texture, less dramatic on HD size (but still a gain). Fast seems to be specifically more effective under pressure where it seems to saturate the memory bus (max 200). (we notice that at 1 buffer it's a bit less performant than New, but at 3 buffers the advantage always shifts to Fast)
@artificiel I think the "blocking" mode Theo is referring to is the async method in ofxFastFboReader. It is asynchronous by default, so it's not immediate like the proposed method. https://github.com/satoruhiga/ofxFastFboReader/blob/12b0069cc54d8496ada287c155018a771f7fc248/src/ofxFastFboReader.h#L15
@NickHardeman ah! ok I checked the implementation and blocking
simply ignores buffering, so in the table above that would be equivalent to fast1
. so New is better than Fast1/blocking with huge textures, but Fast1/blocking is better than New with HD textures.
@artificiel, ok, thank you for the clarification
ALSO the original benchmark compared 4 machines and shows that on Intel (Mac and linux) the relationship is different and there is not such a dramatic bump with Fast.
these latest tests above are on Apple M2 that I have on the desk — perhaps the integrated memory facilitates things in some way... all this to say the impact of New over Old is more positively pronounced on Intel.
Thanks @artificiel ! This is really comprehensive and makes sense why the old results don't match the newer tests ( Intel vs Arm ).
If we ended up integrating PBO read back into OF without an addon I think we'd want it to default to immediate, instead of buffered as that's matching the previous functionality.
But whether to bring this addon in as a core addon, integrate PBO read back into OF or leave it as a 3rd party addon I don't have any strong opinions :)
I think it would be great to have a similar feature to ofxFastFboReader in OF Core. useful any fast access, like recording videos or transmitting NDI. @arturoc proposed an ofPBO object to upload to textures also, useful for reading NDI / video players
+1 for integration; it's a pretty tight and self-contained nugget and is a fundamentally performant feature that is easy to use and does not depend on external stuff (low maintenance risk). (unless there is concern that it does not work on supported platform? "core addon" then).
I took a look at the most recent code in the issue shared by @dimitre and my take on the approach is that it's a bit too low-level (requires allocating buffers yourself, and being familiar with the different types of pbos, etc). on a user-interface level, ofxFastFboReader is great as it solves the simple/single problem of "copy this texture in these pixels as efficiently as possible" with no additional fiddling. maybe think into functions (more than trying to design a "class")? maybe drop the "fast" as in the context of whole OF it's a bit of a strange characteristic
ofFboToPixels(&fbo, &pix);
ofPixelsToFbo(&pix, &fbo);
of course the above "API" is not a strong proposal as some allocation are required so it's not like static functions can work but the idea is to streamline usage as much as possible.
i presume an instance of a "pbo" object has to be dedicated to reading or writing? I makes sense to define 2 classes to make the operational intent explicit:
ofPBOReader::FBOtoPixels(&fbo, &pix); // ofxFastFboReader
ofPBOWriter::PixelstoFBO(&pix, &fbo); // some new thing based on PBO above
// (maybe also to/from textures?)
that removes the possibility of someone accidentally using the same instance of PBO for both directions at the same time. and it preserves the current readback/loaddata within ofFbo.
was looking at integrating ofxFastFboReader approach into ofFbo but after doing some benchmarking I found a faster approach with what we have already.
However while trying some different approaches I found that
When we do multisampling the OPENGLES code stops working.
It can be fixed by blitting and is still faster than the current approach by 5fps. But the added complexity may not be worth it.
Here is the multisample friendly version:
Proposal: I am thinking to use the current approach for multisample and the
TARGET_OPENGLES
approach when numSamples = 0. That way we use the existing code but get a 10fps boost for non multisampled FBO capture.However if we do implement the above approach we would get multisample read back working on iOS / Android when I think it is currently broken.
Thoughts?