Open hansm629 opened 1 month ago
Newer does not always mean better...
@twaik I know. However, I requested it because I thought there might be some improvement in the latest src.
It seems that GPU acceleration is really difficult for SoCs other than Adreno GPU....
Especially in the case of Exynos 2400 Xclipes 940
(AMD RDNA3 base custom 6WGP GPU), whether the angle-android
environment or the virglrenderer-android
environment.
and
It did not work properly even in the mesa-zink + virglrenderer-mesa-zink
environment...
I'm hoping to see compatibility and performance improvements in the latest commits to angle or virglrenderer.
@licy183 what do you think about that?
angle-android
and virglrenderer-android
can be updated, but I'm afraid that there won't be too much improvement of the compatibility. The compatibility is related to the Android's GL.There seems to be no way to improve the compatibility unless Android's GL could have more extensions that vtest needs.
I don't know how much performance will lose or gain from the latest version. More tests should be done about the performance...
I think I can improve performance for termux-x11 in the case if virpipe will report drawable it for textures. I mean I can make virglrenderer use ANativeWindow
s shared with Termux:X11 and in that case we will get read of glReadPixels and displaying last fragment will be zero-copy.
I can share details if you are in.
virpipe
is actually not GPU-based. If I understand correctly, when rendering, GL libraries on the client side (Termux's mesa) will send the commands to the vtest_server, and then, the server renders to an offscreen frame buffer, reads the data back and sends it to the client side, then the GL libraries on the client side will use swrast driver to talk with the appilication.
Patching virpipe
to work with ANativeWindow is much difficult, because it actually uses the swrast driver, which is designed for software rendering...
Patching virpipe to work with ANativeWindow is much difficult, because it actually uses the swrast driver
I did not mean patch virpipe
to work with ANativeWindows. I meant patching virpipe
to report XID of drawable to vtest_server
and not to use memcpy to copy pixels to memory region shared to X server (only sending present_pixmap event to trigger displaying drawable on X server side).
It is possible to make vtest_server_android
draw to SurfaceTexture-backed ANativeWindow. And X server can (or will be able to, after some patches) get image directly from SurfaceTexture instead of shared memory region.
Of course, this change will work only in the case of termux-x11, but I may try to port it to Xvfb or TigerVNC servers (only in termux, not in proot/chroot).
I'll explain why virglrenderer-android+virpipe is so bad. Texture that is going to be displayed is not rendered directly to X11 window. It is:
glReadPixels
to shared memory fragment (created with ashmem
, attached with mmap
).MIT-SHM
, attached with shmat
) using series of simple memcpy
calls (via mesa's util_copy_rect
function). There is pretty much simple reasons why this can not be avoided by replacing MIT-SHM shared-memory fragment with attaching ashmem
fragment directly to X server: ashmem fragment's image width is aligned by 64 bytes so width does not equal stride, and xcb_shm_put_image can not specify stride, only width.pixman_blt
function (NEON accelerated) during GC's CopyArea
operation, triggered by XShmPutImage
function on X11 client side. BUT mesa's X11 sw backend does not wait for vblank or even does not wait for image drawing completion. It simply sends frames as fast as possible so if X server gets 12 XShmPutImage requests and has only one vblank during 16 ms it will draw image only once, so CPU time used for copying image another 11 times is simply wasted. It is waste of resources.
I think there are at least three options to achieve better performance in virpipe
+ virglrenderer-android
configuration.
util_copy_rect
can be avoided. Code implementing DRI3
in Termux:X11 can be easily ported to Xvfb, XWayland, TigerVNC and other termux X servers (or even proot
or chroot
versions) (if it is critical for other users).virpipe
wait for vblank
event before sending the next frame. So the resources needed for drawing image on X server side will be not wasted if there was no vblank
event after last frame drawing.virglrenderer
work with some kind of bufferqueue and talk to X server directly.Currently virglrenderer
can not communicate to X server. When comes time to draw something in X11 window it simply waits for commands from virpipe
to download contents of frontbuffer and puts it to shared memory fragment.
My idea is to implement some kind of simple IPC to make virglrenderer
communicate with X server directly through Unix socket (socketpair
?) that can be sent to X server in the case if it is connected with Unix socket too.
In this case virglrenderer
will be able to send X server AHardwareBuffer
s (attached as front and back buffers), get events about window resizing (to update AHardwareBuffer
s) and vblanks, do some other stuff. And X server will be able to attach these AHardwareBuffer
s directly to window pixmap so it will be kind of zero-copy. I mean virpipe
+virglrenderer
configuration will not need to download image from frontbuffer to shared memory fragment with glReadPixels
, no need for blit image to another shared memory fragment with util_copy_rect
, even no need to blit image from pixmap drawable to window drawable inside X server while doing XShmPutImage
or xcb_present_pixmap
calls. Last available buffer may be picked automatically and copied to root window pixmap during compositing which happens when comes vblank
time.
This solution may be the best one because it should work with proot
and chroot
and because it omits almost all copy actions except the actual compositing inside X server.
But it will require from me to do a goddamn big work on that. And it will require some time. I did not implement anything similar before.
Probably I've got some roadmap for this.
VIRGL_CAP_V2_*
flag because all 32 slots for VIRGL_BIND_*
are already taken. They can be stored in caps->v2.capability_bits_v2
. These changes can be added to vrend_renderer_fill_caps_v2
in vrend_renderer.c
and somewhere to virgl_hw.h
in virglrenderer code and to virgl_hw.h
and virgl_vtest_winsys.c
code on mesa/virpipe side. This way we will ensure that we use patched versions of both mesa/virpipe and virglrenderer and not make segfaults if only one of them is patched.x2x
which connect two X servers simultaneously, even without DISPLAY, by reading DISPLAY from command line args). I am considering to use getpeername
applied to fd to extract path of socket, connect it again and pass fd of socket to virglrenderer process. And virglrenderer process will use xcb_connect_to_fd
to obtain XCB connection from this fd.displaytarget
-related calls (displaytarget_{create,destroy,map,unmap,display}
) to virglrenderer process using new types of requests created in step 1 and processing them in virglrenderer process. Still no AHardwareBuffers here. In this step we will only make virglrenderer work with X pixmaps the same way as it is done in mesa/virpipe with three basic changes: using libxcb
instead of libX11
, creating pixmap using dri3 (to let us use stride parameter which is vital because GLES textures data read with glReadPixels is aligned to 64 bytes and XShmPutImage
does not allow to specify this, considering stride=width) instead of using shmget
+XShmCreateImage
, and finally drawing it with xcb_copy_area
instead of XShmPutImage
.
That will let us avoid at least one full-screen copying with util_copy_rect
/memcpy
so it will be a pretty much huge improvement optimising a lot of CPU time needed to draw a picture to screen.memcpy
or cross-device (GPU-to-CPU) alternative somewhere in its internals. Probably this optimisation will not be so huge for devices where graphical accelerator chip do not share RAM with CPU (because AHardwareBuffer_lock will use memcpy
, but now in X server process).
I am considering to use AHardwareBuffer
s with format 5 (which stands for non-SDK AHARDWAREBUFFER_FORMAT_B8G8R8A8_UNORM
, that is a native X11 pixmap format so we will not require flipping bits or something, but probably I'll write some processing to draw it upside down with almost the same performance as regular drawing) and probably usage AHARDWAREBUFFER_USAGE_CPU_READ_RARELY | AHARDWAREBUFFER_USAGE_CPU_WRITE_NEVER
(because it is written by GPU all the time, but read by CPU only once per X drawing operation).@licy183 @tareksander maybe you can suggest something? Do you have any additions or corrections?
Probably it would be good to make virglrenderer use ahardwarebuffers to avoid cpu swizzling rgba->bgra so we will save some cpu time.
@licy183 Can you please take a look?
Everything seems to be fine when I use AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM buffers, but if I use buffers with format 5 (HAL_PIXEL_FORMAT_BGRA_8888) I get "Failed to complete framebuffer 0x8cd6 glmark2" (which means GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT) in vrend_set_framebuffer_state
after vrend_hw_set_color_surface
. Can we fix it?
virglrenderer-android.tar.gz
Interesting thing. I checked performance on my phone and: Regular virglrenderer: 59-60. virglrenderer+ahardwarebuffer+swizzling: 81. virglrenderer+ahardwarebuffer: 89. virglrenderer+ahardwarebuffer+without copying to display: 126.
Not very much, but it is 33% of performance improvement, or it will be 50% of improvement in the case we fix rendering for HAL_PIXEL_FORMAT_BGRA_8888
.
The case with ahardwarebuffers but without copying to display is needed to see how much performance hit we get with copying.
@tareksander maybe you can help to fix HAL_PIXEL_FORMAT_BGRA_8888
framebuffers?
What is there to fix?
@tareksander https://github.com/termux/termux-packages/issues/19529#issuecomment-2063409442
Open vrend_renderer.c file and change AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM
to 5
. In this case virgl_test_server_android will fail with error message Failed to complete framebuffer 0x8cd6 glmark2
(0x8cd6 = GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT
).
@licy183 Can you please take a look? Everything seems to be fine when I use AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM buffers, but if I use buffers with format 5 (HAL_PIXEL_FORMAT_BGRA_8888) I get "Failed to complete framebuffer 0x8cd6 glmark2" (which means GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT) in
vrend_set_framebuffer_state
aftervrend_hw_set_color_surface
. Can we fix it?
Emmm... I'll check it next weekend. I'm looking for an internship these days...
Emmm... I'll check it next weekend
Hello. Any updates?
Hello. Any updates?
Sorry for my late reply. I have no idea about why it happens...
Can you please please investigate it a bit? Probably in the case there is no solution I will try to make shader and some additional code for pixel swizzling and drawing it to rgba texture, but it will do some overhead which may be avoided with bgra textures.
Emmm... According to the Android docs, format
param should be some value of AHardwareBuffer_Format
, which doesn't have the choice of 5/0x5
.
https://developer.android.com/ndk/reference/struct/a-hardware-buffer-desc
Public attributes | |
---|---|
format | uint32_t One of AHardwareBuffer_Format. |
https://developer.android.google.cn/ndk/reference/group/a-hardware-buffer#ahardwarebuffer_format
AHardwareBuffer_Format{
AHARDWAREBUFFER_FORMAT_R8G8B8A8_UNORM = 1,
AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM = 2,
AHARDWAREBUFFER_FORMAT_R8G8B8_UNORM = 3,
AHARDWAREBUFFER_FORMAT_R5G6B5_UNORM = 4,
AHARDWAREBUFFER_FORMAT_R16G16B16A16_FLOAT = 0x16,
AHARDWAREBUFFER_FORMAT_R10G10B10A2_UNORM = 0x2b,
AHARDWAREBUFFER_FORMAT_BLOB = 0x21,
AHARDWAREBUFFER_FORMAT_D16_UNORM = 0x30,
AHARDWAREBUFFER_FORMAT_D24_UNORM = 0x31,
AHARDWAREBUFFER_FORMAT_D24_UNORM_S8_UINT = 0x32,
AHARDWAREBUFFER_FORMAT_D32_FLOAT = 0x33,
AHARDWAREBUFFER_FORMAT_D32_FLOAT_S8_UINT = 0x34,
AHARDWAREBUFFER_FORMAT_S8_UINT = 0x35,
AHARDWAREBUFFER_FORMAT_Y8Cb8Cr8_420 = 0x23,
AHARDWAREBUFFER_FORMAT_YCbCr_P010 = 0x36,
AHARDWAREBUFFER_FORMAT_R8_UNORM = 0x38,
AHARDWAREBUFFER_FORMAT_R16_UINT = 0x39,
AHARDWAREBUFFER_FORMAT_R16G16_UINT = 0x3a,
AHARDWAREBUFFER_FORMAT_R10G10B10A10_UNORM = 0x3b
}
Yeah but 0x5 works in the case if it is regular texture. I know it because it works pretty much well in termux-x11. But not in virglrenderer in the case if it is bound to framebuffer and it is a rendering target (but I may be wrong here, and I am not sure if it is called renderbuffer).
In case 0x5 can't be made to work with virglrenderer, since you're already doing custom code in Termux:X11 to support it, couldn't you just use an R8G8B8A8_UNORM
buffer? It basically just says "the contents are a 4 channel image with 8 bits per pixel", what you do with the content is up to you. Since you know X clients will draw flipped and with swapped colors, you can handle that. Ideally you would just adjust the texture sample code in the shader you use to display the buffer so the y axis is flipped and the color channels are switched. That way, all processing takes place on the GPU. I don't know how easy that is integrate into the X rendering process though.
Yeah, I can make additional shader copying with swapping colors, but it will not be zero-copy solution like in the case with BGRA textures... I'll just do the same thing virglrenderer does on CPU but on the GPU side. The main idea is to avoid it at all.
The GPU side is highly efficient for these kinds of things though, it's what GPUs were made for after all. The bitshifts happen in parallel like everything else, and swapping the Y axis in a texture lookup is also trivial.
Ok, I ran glmark2 normally, with hardware buffers with and without glFinish, with and without blitting to compare performance. I've got intriguing results. Here we are.
run | 1 | 2 | 3 | 4 | avg |
normal run | 81 | 83 | 83 | 82 | 82.25 |
with AHB, no glFinish, with blitting+swizzling | 117 | 111 | 101 | 101 | 107.5 |
with AHB, no glFinish, no blitting/swizzling | 111 | 111 | 116 | 118 | 114 |
with AHB, with glFinish, with blitting+swizzling | 71 | 72 | 74 | 73 | 72.5 |
with AHB, with glFinish, no blitting/swizzling | 115 | 110 | 110 | 105 | 110 |
Blitting+swizzling is an additional step where current framebuffer is blitted to AHB with shader, flipping BGRA to RGBA. Without this AHB is simply filled with black (not an active action, filled by system at allocation step) and outputted. I did it intentionally to check performance hit of blitting/swizzling itself.
It seems like in the cases without glFinish we get outdated frame (like last frame before current iteration) but the performance is much better (rendering is done async?). And I do not know why AHB+glFinish+blitting gives results the worst result (it should be normal run since it uses glReadPixels). Should I use glFinish in this case? @licy183 @tareksander I must hear your opinion. Sources for reference: virglrenderer-android.tar.gz
glFinish should be a fallback, the best performance possible would be with the Android native sync EGL extension, which allows you to create a sync object with a fd, send that fd to another process and reconstruct a sync object there. That sync object would then be waited for before rendering (or the last image would be displayed until rendering is finished). That also needs X server support though. With thta you'd have async rendering and the correct frame data.
Of course the additional blitting+swizzling step makes a performance hit, you're essentially writing double the memory. The question is, is it faster than doing that step on the CPU?
That also needs X server support though.
I want to make virglrenderer support AHardwareBuffers before making it support direct writing to X server (without drawing through virpipe).
The question is, is it faster than doing that step on the CPU?
it's what GPUs were made for after all. The bitshifts happen in parallel like everything else
would then be waited for before rendering (or the last image would be displayed until rendering is finished).
How can I check if rendering is finished or not? Without waiting.
EGL can wait on a sync object, though that blocks the entire thread. There's an extension that inserts the wait into the client command steam, so you can issue further GL calls an be sure they'll happen after the sync completed. For implementation in the X server (which renders at predefined intervals, I think?) just polling the sync object before drawing the next frame would be enough.
Do you mean invoking poll
C function without timeout or there is another EGL related API for that?
tareksander may be referring this API EGL_ANDROID_native_fence_sync, but I don't actually know how to use this...
tareksander may be referring this API EGL_ANDROID_native_fence_sync, but I don't actually know how to use this...
Exactly.
You need to create a sync object with type EGL_SYNC_NATIVE_FENCE_ANDROID
and after calling glFlush
the native fence fd is initialized and can be duplicated with eglDupNativeFenceFDANDROID
. From that fd you can reconstruct a fence in another process by creating a fence of type EGL_SYNC_NATIVE_FENCE_ANDROID
with EGL_SYNC_NATIVE_FENCE_FD_ANDROID
set to the fd number.
Then you can use eglClientWaitSyncKHR
or eglWaitSyncKHR
to block the thread or gl command processing on teh current context respectively, or check the EGL_SYNC_STATUS_KHR
attribute to see if it has been signaled.
Wait, so I need to use eglCreateSyncKHR after using glFlush
? I thought the whole thing is to avoid glFlush...
And at this step I still do not send sync fd to another process, I am only implementing stuff inside virglrenderer process.
Wait, so I need to use eglCreateSyncKHR after using
glFlush
? I thought the whole thing is to avoid glFlush...
No, before flush. And flush is not finish: Flush makes sure the commands are delivered to the GL implementation, finish waits until the commands have finished executing.
Yeah, you are right, I was confused with some other stuff.
So can I use waiting for sync instead of using glFinish?
That depends on how you want the graphics pipelining: If you just want to chew out frames as fast as possible (like a game maybe would), no synchronization could be fine. If you want to ensure the most recent frame is definitely drawn, you need some synchronization. The waiting would be an optimization on the server side: The server would accept a new buffer to be displayed, uses a client sync to ensure the buffer is finished writing and issues the GL commands to draw it into the compositor space, then flushes. The GPU will then wait for the buffer to be available on its own before continuing to draw, so the application and server can both continue to queue up more commands.
Why is it worth to add this package?
@licy183
I understand that quite a bit of time has passed since the angle-android package was released last year.
For currently released
angle-android & virglrenderer-android
Adreno GPU, Mali GPU, Xclipes GPU all Satisfactory compatibility or performance is not achieved.
I am requesting this because I believe that compatibility and performance will improve slightly in angle-android & virglrenderer-android that reflect the latest src.
and I wonder if glvnd related compatibility may have been added in Angle latest src.
Home page URL
https://github.com/google/angle
Source code URL
https://github.com/google/angle
Packaging policy acknowledgement
Additional information
No response