faster fbo readback - Githubissues

ofTheo commented 2 years ago

was looking at integrating ofxFastFboReader approach into ofFbo but after doing some benchmarking I found a faster approach with what we have already.

on desktop the current read back for ofFbo uses ofTexture::readToPixels and gives: approx 30fps for a 4096x4096 GL_RGB FBO with 0 samples

//----------------------------------------------------------
void ofFbo::readToPixels(ofPixels & pixels, int attachmentPoint){
    if(!bIsAllocated) return;
#ifndef TARGET_OPENGLES
    getTexture(attachmentPoint).readToPixels(pixels); <-- desktop is currently using this approach 
#else

with ofxFastFboReader addon which has always seemed the fastest approach: approx 33fps for a 4096x4096 GL_RGB FBO with 0 samples

However while trying some different approaches I found that

using the code that OPENGLES uses in OF approx 40fps for a 4096x4096 GL_RGB FBO with 0 samples <-- winner by 10fps

//----------------------------------------------------------
void ofFbo::readToPixels(ofPixels & pixels, int attachmentPoint){
    if(!bIsAllocated) return;
//#ifndef TARGET_OPENGLES
//  getTexture(attachmentPoint).readToPixels(pixels);
//#else
        //this code is faster on desktop than above, but it doesn't support multisampling 
    pixels.allocate(settings.width,settings.height,ofGetImageTypeFromGLType(settings.internalformat));
    bind();
    int format = ofGetGLFormatFromInternal(settings.internalformat);
    glReadPixels(0,0,settings.width, settings.height, format, GL_UNSIGNED_BYTE, pixels.getData());
    unbind();
//#endif
}

When we do multisampling the OPENGLES code stops working.

It can be fixed by blitting and is still faster than the current approach by 5fps. But the added complexity may not be worth it.

Here is the multisample friendly version:

//----------------------------------------------------------
void ofFbo::readToPixels(ofPixels & pixels, int attachmentPoint){
    if(!bIsAllocated) return;
//#ifndef TARGET_OPENGLES
//  getTexture(attachmentPoint).readToPixels(pixels);
//#else
    if(settings.numSamples > 0){

        //we need a non multisample fbo to blit the multisamped fbo to.
        if( !nonMultiSampleFbo ){
            nonMultiSampleFbo = std::make_shared<ofFbo>();
            nonMultiSampleFbo->allocate(settings.width,settings.height,settings.internalformat,0);
        }

        //blit the multisample fbo to non multisample fbo
        glBindFramebufferEXT(GL_READ_FRAMEBUFFER_EXT, fbo);
        glBindFramebufferEXT(GL_DRAW_FRAMEBUFFER_EXT, nonMultiSampleFbo->getId());

        glBlitFramebufferEXT(0, 0, settings.width,settings.height, 0, 0, settings.width,settings.height, GL_COLOR_BUFFER_BIT, GL_NEAREST);

        //do the normal appproach for reading fbo to pixels with the non multisample one
        pixels.allocate(settings.width,settings.height,ofGetImageTypeFromGLType(settings.internalformat));

        nonMultiSampleFbo->bind();
        int format = ofGetGLFormatFromInternal(settings.internalformat);
        glReadPixels(0,0,settings.width, settings.height, format, GL_UNSIGNED_BYTE, pixels.getData());
        nonMultiSampleFbo->unbind();

    }else{
        pixels.allocate(settings.width,settings.height,ofGetImageTypeFromGLType(settings.internalformat));
        bind();
        int format = ofGetGLFormatFromInternal(settings.internalformat);
        glReadPixels(0,0,settings.width, settings.height, format, GL_UNSIGNED_BYTE, pixels.getData());
        unbind();
    }
//#endif
}

Proposal: I am thinking to use the current approach for multisample and the TARGET_OPENGLES approach when numSamples = 0. That way we use the existing code but get a 10fps boost for non multisampled FBO capture.

However if we do implement the above approach we would get multisample read back working on iOS / Android when I think it is currently broken.

Thoughts?

ofTheo commented 2 years ago

This is the proposed change:

from:

//----------------------------------------------------------
void ofFbo::readToPixels(ofPixels & pixels, int attachmentPoint) const{
    if(!bIsAllocated) return;
#ifndef TARGET_OPENGLES // <-- non OPENGLES
    getTexture(attachmentPoint).readToPixels(pixels);
#else
    pixels.allocate(settings.width,settings.height,ofGetImageTypeFromGLType(settings.internalformat));
    bind();
    int format = ofGetGLFormatFromInternal(settings.internalformat);
    glReadPixels(0,0,settings.width, settings.height, format, GL_UNSIGNED_BYTE, pixels.getData());
    unbind();
#endif
}

to:

//----------------------------------------------------------
void ofFbo::readToPixels(ofPixels & pixels, int attachmentPoint) const{
    if(!bIsAllocated) return;

#ifndef TARGET_OPENGLES // <-- non OPENGLES
    if(settings.numSamples>0){
        getTexture(attachmentPoint).readToPixels(pixels);
    }else{
#endif
        pixels.allocate(settings.width,settings.height,ofGetImageTypeFromGLType(settings.internalformat));
        bind();
        int format = ofGetGLFormatFromInternal(settings.internalformat);
        glReadPixels(0,0,settings.width, settings.height, format, GL_UNSIGNED_BYTE, pixels.getData());
        unbind();
#ifndef TARGET_OPENGLES
    }
#endif
}

artificiel commented 1 year ago

i made a branch with the proposed change as an alternate method, and an app to test a system's capability of iterations of downloads of a shader-redrawn GL_RGBA 1920x1080 texture within a 1/30s frame.

results can be found here: https://github.com/artificiel/openFrameworks/tree/fbo_download/apps/devApps/FBODownloadTest

in short, the proposed method is consistently faster, more meaningfully so on x64 than arm64.

others might want to test more constrained systems/configs (rPi, windows, etc) but it seems the change is a net win. (it would be nice to see some cutting edge intel too).

however it does raise the question: why is the PBO method not the default implementation? (whenever I need to download textures I grab ofxFboFastReader without hesitation as it is systematically faster (and dramatically more on arch64, perhaps due to the memory layout of M1/M2 chips)).

if there are backward-compatibility or platform-availability or other technical reasons why PBO might be counter-productive, couldn't there be a readToPixelsViaPBO() builtin method so it can be actively selected within the comfort of the core?

dimitre commented 3 weeks ago

I think maybe PBO was not broadly possible in different video cards? Can it be detected? this way we could use PBO as default, and the actual one as a fallback

ofTheo commented 3 weeks ago

@artificiel weird that we get such different results for the ofxFastFboReader - when you were timing it did you have the readback in blocking or non blocking mode? As that might have affected the timing.

@dimitre - I am not 100% sure why the PBO method is not default, but I am guessing it could be fairly easily implemented - I just always suggest benchmarking changes first as sometimes our assumption of what is faster doesn't bear out in the data. :)

ofTheo commented 3 weeks ago

At the least though we could implement the suggested change and even add PBO based readback as a separate feature if available.

artificiel commented 3 weeks ago

I just re-ran the tests on M2 machine and get similar numbers. @ofTheo not sure about "blocking or non blocking mode"? the other diff is that you worked 4096x4096 while my test is 1920x1080. maybe the gains are different on texture size?

dimitre commented 3 weeks ago

@artificiel you mean similar numbers in both implementations or similar numbers as your previous benchmark in this readme? https://github.com/artificiel/openFrameworks/tree/fbo_download/apps/devApps/FBODownloadTest

artificiel commented 3 weeks ago

ah sorry similar as previous test!

so I gave a look to the benchmark and noticed @ofTheo was testing 4096x4096 while this is based on 1920x1080. so I tried in 4096x4096 and get different results:

4096x4096 (old, new, fastfbo)

fps-based @ 1 iter: 56fps, 93fps, 102fps [1x, 1.7x, 1.8x] 3584Mbps, 6144Mbps, 6528Mbps
iter-based @ 30fps: 2, 7, 9 [1x, 3.5x, 4.5x] 3840Mbps, 13440Mbps, 17280Mbps

the first line mostly recreates @ofTheo's test.

so it's a different "skewing" than 1920x1080, for which the equivalence is:

fps-based @ 1 iter: 139fps, 174fps, 270fps [1x, 1.25x, 1.9x] 1042Mbps, 1305Mbps, 2025Mbps
iter-based @ 30fps: 9, 15, 77 [1x, 1.66x, 8.5x] 2025Mbps, 3375Mbps, 17325Mbps

so it seems 2 things affect the readings: the size of the texture, and wether it's under pressure with multiple readbacks within a frame (this benchmark approach) vs free-running fps. also, this benchmark might be flawed in the "pressure" approach in the sense that maybe some things are optimized/elided/dropped? 17325Mbps is yet a realistic bandwidth with the M2 (supposedly 25000Mbps max).

it's interesting to see both resolution top out at similar bandwidth with the fastFBO so texture size seems irrelevant. with New method, larger textures have higher gains than smaller.

dimitre commented 3 weeks ago

Great! I suppose using more than 3 buffers in ofxFastFboReader can be a good improvement also for big textures

artificiel commented 3 weeks ago

ah buffering, good question. another round, this time with macOS game mode fullscreen (so the results are 10-15 % higher, but perceptively (not measured) much more stable):

there seems to be little gain to bump buffers to 10. however (to confirm it has an effect) 1 is definitely less good. so it seems 3 buffers is a good default.

old, new, (fast1, fast3, fast10)

fps-based @ 1 iter: 
4096x4096 : fps: 86, 222, (216, 240, 242) ; Gbps: 43, 112, (108, 120, 121)
1920x1080 : fps: 238, 283 (527, 840, 840) ; Gbps: 14, 17, (32, 51, 51)

iter-based @ 30fps
4096x4096 : iters: 3, 11, (9, 12, 12)     ; Gbps: 50, 165, (140, 186, 186)
1920x1080 : iters: 11, 16, (49, 93, 96)   ; Gbps: 20, 54, (90, 174, 175)

but the results are consistent: there is a very good gain between Old an New on large texture, less dramatic on HD size (but still a gain). Fast seems to be specifically more effective under pressure where it seems to saturate the memory bus (max 200). (we notice that at 1 buffer it's a bit less performant than New, but at 3 buffers the advantage always shifts to Fast)

NickHardeman commented 3 weeks ago

@artificiel I think the "blocking" mode Theo is referring to is the async method in ofxFastFboReader. It is asynchronous by default, so it's not immediate like the proposed method. https://github.com/satoruhiga/ofxFastFboReader/blob/12b0069cc54d8496ada287c155018a771f7fc248/src/ofxFastFboReader.h#L15

artificiel commented 3 weeks ago

@NickHardeman ah! ok I checked the implementation and blocking simply ignores buffering, so in the table above that would be equivalent to fast1. so New is better than Fast1/blocking with huge textures, but Fast1/blocking is better than New with HD textures.

NickHardeman commented 3 weeks ago

@artificiel, ok, thank you for the clarification

artificiel commented 3 weeks ago

ALSO the original benchmark compared 4 machines and shows that on Intel (Mac and linux) the relationship is different and there is not such a dramatic bump with Fast.

these latest tests above are on Apple M2 that I have on the desk — perhaps the integrated memory facilitates things in some way... all this to say the impact of New over Old is more positively pronounced on Intel.

ofTheo commented 3 weeks ago

Thanks @artificiel ! This is really comprehensive and makes sense why the old results don't match the newer tests ( Intel vs Arm ).

If we ended up integrating PBO read back into OF without an addon I think we'd want it to default to immediate, instead of buffered as that's matching the previous functionality.

But whether to bring this addon in as a core addon, integrate PBO read back into OF or leave it as a 3rd party addon I don't have any strong opinions :)

dimitre commented 3 weeks ago

I think it would be great to have a similar feature to ofxFastFboReader in OF Core. useful any fast access, like recording videos or transmitting NDI. @arturoc proposed an ofPBO object to upload to textures also, useful for reading NDI / video players

https://github.com/openframeworks/openFrameworks/issues/1913

artificiel commented 3 weeks ago

+1 for integration; it's a pretty tight and self-contained nugget and is a fundamentally performant feature that is easy to use and does not depend on external stuff (low maintenance risk). (unless there is concern that it does not work on supported platform? "core addon" then).

I took a look at the most recent code in the issue shared by @dimitre and my take on the approach is that it's a bit too low-level (requires allocating buffers yourself, and being familiar with the different types of pbos, etc). on a user-interface level, ofxFastFboReader is great as it solves the simple/single problem of "copy this texture in these pixels as efficiently as possible" with no additional fiddling. maybe think into functions (more than trying to design a "class")? maybe drop the "fast" as in the context of whole OF it's a bit of a strange characteristic

ofFboToPixels(&fbo, &pix);
ofPixelsToFbo(&pix, &fbo);

artificiel commented 3 weeks ago

of course the above "API" is not a strong proposal as some allocation are required so it's not like static functions can work but the idea is to streamline usage as much as possible.

i presume an instance of a "pbo" object has to be dedicated to reading or writing? I makes sense to define 2 classes to make the operational intent explicit:

ofPBOReader::FBOtoPixels(&fbo, &pix); // ofxFastFboReader
ofPBOWriter::PixelstoFBO(&pix, &fbo); // some new thing based on PBO above
// (maybe also to/from textures?)

that removes the possibility of someone accidentally using the same instance of PBO for both directions at the same time. and it preserves the current readback/loaddata within ofFbo.

openframeworks / openFrameworks

faster fbo readback #7111