[Package]: Requesting new versions of angle-android & virglrenderer-android.

hansm629 commented 1 month ago

Why is it worth to add this package?

@licy183

I understand that quite a bit of time has passed since the angle-android package was released last year.

For currently released angle-android & virglrenderer-android

Adreno GPU, Mali GPU, Xclipes GPU all Satisfactory compatibility or performance is not achieved.

I am requesting this because I believe that compatibility and performance will improve slightly in angle-android & virglrenderer-android that reflect the latest src.

and I wonder if glvnd related compatibility may have been added in Angle latest src.

Home page URL

https://github.com/google/angle

Source code URL

https://github.com/google/angle

Packaging policy acknowledgement

[X] I certify that I have read Termux Packaging Policy and understand that my request will be denied in case of violation.

Additional information

No response

twaik commented 1 month ago

Newer does not always mean better...

hansm629 commented 1 month ago

@twaik I know. However, I requested it because I thought there might be some improvement in the latest src.

It seems that GPU acceleration is really difficult for SoCs other than Adreno GPU....

Especially in the case of Exynos 2400 Xclipes 940 (AMD RDNA3 base custom 6WGP GPU), whether the angle-androidenvironment or the virglrenderer-android environment.

and It did not work properly even in the mesa-zink + virglrenderer-mesa-zink environment...

I'm hoping to see compatibility and performance improvements in the latest commits to angle or virglrenderer.

twaik commented 1 month ago

@licy183 what do you think about that?

licy183 commented 1 month ago

angle-android and virglrenderer-android can be updated, but I'm afraid that there won't be too much improvement of the compatibility. The compatibility is related to the Android's GL.There seems to be no way to improve the compatibility unless Android's GL could have more extensions that vtest needs.

I don't know how much performance will lose or gain from the latest version. More tests should be done about the performance...

twaik commented 1 month ago

I think I can improve performance for termux-x11 in the case if virpipe will report drawable it for textures. I mean I can make virglrenderer use ANativeWindows shared with Termux:X11 and in that case we will get read of glReadPixels and displaying last fragment will be zero-copy. I can share details if you are in.

licy183 commented 1 month ago

virpipe is actually not GPU-based. If I understand correctly, when rendering, GL libraries on the client side (Termux's mesa) will send the commands to the vtest_server, and then, the server renders to an offscreen frame buffer, reads the data back and sends it to the client side, then the GL libraries on the client side will use swrast driver to talk with the appilication.

Patching virpipe to work with ANativeWindow is much difficult, because it actually uses the swrast driver, which is designed for software rendering...

twaik commented 1 month ago

Patching virpipe to work with ANativeWindow is much difficult, because it actually uses the swrast driver

I did not mean patch virpipe to work with ANativeWindows. I meant patching virpipe to report XID of drawable to vtest_server and not to use memcpy to copy pixels to memory region shared to X server (only sending present_pixmap event to trigger displaying drawable on X server side).

It is possible to make vtest_server_android draw to SurfaceTexture-backed ANativeWindow. And X server can (or will be able to, after some patches) get image directly from SurfaceTexture instead of shared memory region.

Of course, this change will work only in the case of termux-x11, but I may try to port it to Xvfb or TigerVNC servers (only in termux, not in proot/chroot).

twaik commented 1 month ago

I'll explain why virglrenderer-android+virpipe is so bad. Texture that is going to be displayed is not rendered directly to X11 window. It is:

downloaded from GLES texture attached as a frontbuffer via glReadPixels to shared memory fragment (created with ashmem, attached with mmap).
copied with simple memcpy to memory fragment shared with X server (created with MIT-SHM, attached with shmat) using series of simple memcpy calls (via mesa's util_copy_rect function). There is pretty much simple reasons why this can not be avoided by replacing MIT-SHM shared-memory fragment with attaching ashmem fragment directly to X server: ashmem fragment's image width is aligned by 64 bytes so width does not equal stride, and xcb_shm_put_image can not specify stride, only width.
copied to window pixmap on X server side via pixman_blt function (NEON accelerated) during GC's CopyArea operation, triggered by XShmPutImage function on X11 client side. BUT mesa's X11 sw backend does not wait for vblank or even does not wait for image drawing completion. It simply sends frames as fast as possible so if X server gets 12 XShmPutImage requests and has only one vblank during 16 ms it will draw image only once, so CPU time used for copying image another 11 times is simply wasted.
~~copied one more time during compositing image to root window pixmap.~~ Window pixmap is actually screen pixmap, but X server calculates intersecting with other windows and copies only actual regions (in previous paragraph).
finally displayed to screen.

It is waste of resources. I think there are at least three options to achieve better performance in virpipe + virglrenderer-android configuration.

Termux:X11 has DRI3 implementation able to attach fd with specifying width, height, offset and stride. So util_copy_rect can be avoided. Code implementing DRI3 in Termux:X11 can be easily ported to Xvfb, XWayland, TigerVNC and other termux X servers (or even proot or chroot versions) (if it is critical for other users).
Make virpipe wait for vblank event before sending the next frame. So the resources needed for drawing image on X server side will be not wasted if there was no vblank event after last frame drawing.
Making virglrenderer work with some kind of bufferqueue and talk to X server directly.

More details about 3.

Currently virglrenderer can not communicate to X server. When comes time to draw something in X11 window it simply waits for commands from virpipe to download contents of frontbuffer and puts it to shared memory fragment.

My idea is to implement some kind of simple IPC to make virglrenderer communicate with X server directly through Unix socket (socketpair?) that can be sent to X server in the case if it is connected with Unix socket too.

In this case virglrenderer will be able to send X server AHardwareBuffers (attached as front and back buffers), get events about window resizing (to update AHardwareBuffers) and vblanks, do some other stuff. And X server will be able to attach these AHardwareBuffers directly to window pixmap so it will be kind of zero-copy. I mean virpipe+virglrenderer configuration will not need to download image from frontbuffer to shared memory fragment with glReadPixels, no need for blit image to another shared memory fragment with util_copy_rect, even no need to blit image from pixmap drawable to window drawable inside X server while doing XShmPutImage or xcb_present_pixmap calls. Last available buffer may be picked automatically and copied to root window pixmap during compositing which happens when comes vblank time.

This solution may be the best one because it should work with proot and chroot and because it omits almost all copy actions except the actual compositing inside X server.

But it will require from me to do a goddamn big work on that. And it will require some time. I did not implement anything similar before.

twaik commented 1 month ago

Probably I've got some roadmap for this.

[x] Updating virglrenderer-android to the last available stable release.
[x] Testing with active github and discord users and fixing possible incompatibilities.
[ ] Making virglrenderer-android report it can manage Termux:X11 or compatible X server connections. Probably I will add VIRGL_CAP_V2_* flag because all 32 slots for VIRGL_BIND_* are already taken. They can be stored in caps->v2.capability_bits_v2. These changes can be added to vrend_renderer_fill_caps_v2 in vrend_renderer.c and somewhere to virgl_hw.h in virglrenderer code and to virgl_hw.h and virgl_vtest_winsys.c code on mesa/virpipe side. This way we will ensure that we use patched versions of both mesa/virpipe and virglrenderer and not make segfaults if only one of them is patched.
[ ] Adding new request types to both virglrenderer (here or here) and mesa/virpipe (here or here) protocols. They will not be used if virglrenderer-android does not report them in caps or in the case if mesa in proot or termux is not patched.
[ ] Making mesa/virpipe open duplicating XCB connection to DISPLAY. It is a problem since we can not simply make program duplicate fd and pass it to another process. X server will consider it as the same connection and communication over this duplicated socket fd from virglrenderer process will interfere with mesa/virpipe-hosting process. And we can not simply open new connection depending only on "DISPLAY" variable because there may be two or more DISPLAY's (see programs like x2x which connect two X servers simultaneously, even without DISPLAY, by reading DISPLAY from command line args). I am considering to use getpeername applied to fd to extract path of socket, connect it again and pass fd of socket to virglrenderer process. And virglrenderer process will use xcb_connect_to_fd to obtain XCB connection from this fd.
[ ] Making mesa/virpipe pass displaytarget-related calls (displaytarget_{create,destroy,map,unmap,display}) to virglrenderer process using new types of requests created in step 1 and processing them in virglrenderer process. Still no AHardwareBuffers here. In this step we will only make virglrenderer work with X pixmaps the same way as it is done in mesa/virpipe with three basic changes: using libxcb instead of libX11, creating pixmap using dri3 (to let us use stride parameter which is vital because GLES textures data read with glReadPixels is aligned to 64 bytes and XShmPutImage does not allow to specify this, considering stride=width) instead of using shmget+XShmCreateImage, and finally drawing it with xcb_copy_area instead of XShmPutImage. That will let us avoid at least one full-screen copying with util_copy_rect/memcpy so it will be a pretty much huge improvement optimising a lot of CPU time needed to draw a picture to screen.
[ ] Testing with licy183, hansm629 and some other people from github and discord. All changes between 3 to 6 were incomplete, but after 6 they will be usable and can be merged to master (yeah, because it is a huge optimisation and improvement).
[ ] Making virglrenderer utilise Termux:X11's ability to attach AHardwareBuffer's to X pixmaps. This step will be a huge optimisation too since glReadPixels used to obtain contents of image is pretty much slow and uses memcpy or cross-device (GPU-to-CPU) alternative somewhere in its internals. Probably this optimisation will not be so huge for devices where graphical accelerator chip do not share RAM with CPU (because AHardwareBuffer_lock will use memcpy, but now in X server process). I am considering to use AHardwareBuffers with format 5 (which stands for non-SDK AHARDWAREBUFFER_FORMAT_B8G8R8A8_UNORM, that is a native X11 pixmap format so we will not require flipping bits or something, but probably I'll write some processing to draw it upside down with almost the same performance as regular drawing) and probably usage AHARDWAREBUFFER_USAGE_CPU_READ_RARELY | AHARDWAREBUFFER_USAGE_CPU_WRITE_NEVER (because it is written by GPU all the time, but read by CPU only once per X drawing operation).
[ ] Testing it with users again and probably merging to master since it is a good optimisation too.
[ ] Making virglrenderer output pixmap using PRESENT extension. mesa/virpipe does not wait for VBLANK or something. It simply outputs pixmap as fast as possible, considering SwapInterval is always 0. Probably we will need to patch mesa code to report SwapInterval to virpipe and make virpipe report it to virglrenderer. Probably that will give no profit to synthetic tests like glmark2 (only in the case you force waiting for vblank without actually reporting it to mesa/virpipe). But it should make games consume less CPU and GPU resources (and maybe battery too).
[ ] Testing and merging again.
[ ] PROFIT?

@licy183 @tareksander maybe you can suggest something? Do you have any additions or corrections?

twaik commented 1 month ago

Probably it would be good to make virglrenderer use ahardwarebuffers to avoid cpu swizzling rgba->bgra so we will save some cpu time.

twaik commented 4 weeks ago

@licy183 Can you please take a look? Everything seems to be fine when I use AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM buffers, but if I use buffers with format 5 (HAL_PIXEL_FORMAT_BGRA_8888) I get "Failed to complete framebuffer 0x8cd6 glmark2" (which means GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT) in vrend_set_framebuffer_state after vrend_hw_set_color_surface. Can we fix it? virglrenderer-android.tar.gz

twaik commented 4 weeks ago

Interesting thing. I checked performance on my phone and: Regular virglrenderer: 59-60. virglrenderer+ahardwarebuffer+swizzling: 81. virglrenderer+ahardwarebuffer: 89. virglrenderer+ahardwarebuffer+without copying to display: 126.

Not very much, but it is 33% of performance improvement, or it will be 50% of improvement in the case we fix rendering for HAL_PIXEL_FORMAT_BGRA_8888. The case with ahardwarebuffers but without copying to display is needed to see how much performance hit we get with copying.

virglrenderer-android.tar.gz

@tareksander maybe you can help to fix HAL_PIXEL_FORMAT_BGRA_8888 framebuffers?

tareksander commented 4 weeks ago

What is there to fix?

twaik commented 4 weeks ago

@tareksander https://github.com/termux/termux-packages/issues/19529#issuecomment-2063409442 Open vrend_renderer.c file and change AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM to 5. In this case virgl_test_server_android will fail with error message Failed to complete framebuffer 0x8cd6 glmark2 (0x8cd6 = GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT).

licy183 commented 4 weeks ago

@licy183 Can you please take a look? Everything seems to be fine when I use AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM buffers, but if I use buffers with format 5 (HAL_PIXEL_FORMAT_BGRA_8888) I get "Failed to complete framebuffer 0x8cd6 glmark2" (which means GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT) in vrend_set_framebuffer_state after vrend_hw_set_color_surface. Can we fix it?

Emmm... I'll check it next weekend. I'm looking for an internship these days...

twaik commented 2 weeks ago

Emmm... I'll check it next weekend

Hello. Any updates?

licy183 commented 2 weeks ago

Hello. Any updates?

Sorry for my late reply. I have no idea about why it happens...

twaik commented 2 weeks ago

Can you please please investigate it a bit? Probably in the case there is no solution I will try to make shader and some additional code for pixel swizzling and drawing it to rgba texture, but it will do some overhead which may be avoided with bgra textures.

licy183 commented 2 weeks ago

Emmm... According to the Android docs, format param should be some value of AHardwareBuffer_Format, which doesn't have the choice of 5/0x5.

https://developer.android.com/ndk/reference/struct/a-hardware-buffer-desc

Public attributes
format	uint32_t One of AHardwareBuffer_Format.

https://developer.android.google.cn/ndk/reference/group/a-hardware-buffer#ahardwarebuffer_format

AHardwareBuffer_Format{
  AHARDWAREBUFFER_FORMAT_R8G8B8A8_UNORM = 1,
  AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM = 2,
  AHARDWAREBUFFER_FORMAT_R8G8B8_UNORM = 3,
  AHARDWAREBUFFER_FORMAT_R5G6B5_UNORM = 4,
  AHARDWAREBUFFER_FORMAT_R16G16B16A16_FLOAT = 0x16,
  AHARDWAREBUFFER_FORMAT_R10G10B10A2_UNORM = 0x2b,
  AHARDWAREBUFFER_FORMAT_BLOB = 0x21,
  AHARDWAREBUFFER_FORMAT_D16_UNORM = 0x30,
  AHARDWAREBUFFER_FORMAT_D24_UNORM = 0x31,
  AHARDWAREBUFFER_FORMAT_D24_UNORM_S8_UINT = 0x32,
  AHARDWAREBUFFER_FORMAT_D32_FLOAT = 0x33,
  AHARDWAREBUFFER_FORMAT_D32_FLOAT_S8_UINT = 0x34,
  AHARDWAREBUFFER_FORMAT_S8_UINT = 0x35,
  AHARDWAREBUFFER_FORMAT_Y8Cb8Cr8_420 = 0x23,
  AHARDWAREBUFFER_FORMAT_YCbCr_P010 = 0x36,
  AHARDWAREBUFFER_FORMAT_R8_UNORM = 0x38,
  AHARDWAREBUFFER_FORMAT_R16_UINT = 0x39,
  AHARDWAREBUFFER_FORMAT_R16G16_UINT = 0x3a,
  AHARDWAREBUFFER_FORMAT_R10G10B10A10_UNORM = 0x3b
}

twaik commented 2 weeks ago

Yeah but 0x5 works in the case if it is regular texture. I know it because it works pretty much well in termux-x11. But not in virglrenderer in the case if it is bound to framebuffer and it is a rendering target (but I may be wrong here, and I am not sure if it is called renderbuffer).

tareksander commented 2 weeks ago

In case 0x5 can't be made to work with virglrenderer, since you're already doing custom code in Termux:X11 to support it, couldn't you just use an R8G8B8A8_UNORM buffer? It basically just says "the contents are a 4 channel image with 8 bits per pixel", what you do with the content is up to you. Since you know X clients will draw flipped and with swapped colors, you can handle that. Ideally you would just adjust the texture sample code in the shader you use to display the buffer so the y axis is flipped and the color channels are switched. That way, all processing takes place on the GPU. I don't know how easy that is integrate into the X rendering process though.

twaik commented 2 weeks ago

Yeah, I can make additional shader copying with swapping colors, but it will not be zero-copy solution like in the case with BGRA textures... I'll just do the same thing virglrenderer does on CPU but on the GPU side. The main idea is to avoid it at all.

tareksander commented 2 weeks ago

The GPU side is highly efficient for these kinds of things though, it's what GPUs were made for after all. The bitshifts happen in parallel like everything else, and swapping the Y axis in a texture lookup is also trivial.

twaik commented 1 week ago

Ok, I ran glmark2 normally, with hardware buffers with and without glFinish, with and without blitting to compare performance. I've got intriguing results. Here we are.

run	1	2	3	4	avg
normal run	81	83	83	82	82.25
with AHB, no glFinish, with blitting+swizzling	117	111	101	101	107.5
with AHB, no glFinish, no blitting/swizzling	111	111	116	118	114
with AHB, with glFinish, with blitting+swizzling	71	72	74	73	72.5
with AHB, with glFinish, no blitting/swizzling	115	110	110	105	110

Blitting+swizzling is an additional step where current framebuffer is blitted to AHB with shader, flipping BGRA to RGBA. Without this AHB is simply filled with black (not an active action, filled by system at allocation step) and outputted. I did it intentionally to check performance hit of blitting/swizzling itself.

It seems like in the cases without glFinish we get outdated frame (like last frame before current iteration) but the performance is much better (rendering is done async?). And I do not know why AHB+glFinish+blitting gives results the worst result (it should be normal run since it uses glReadPixels). Should I use glFinish in this case? @licy183 @tareksander I must hear your opinion. Sources for reference: virglrenderer-android.tar.gz

tareksander commented 1 week ago

glFinish should be a fallback, the best performance possible would be with the Android native sync EGL extension, which allows you to create a sync object with a fd, send that fd to another process and reconstruct a sync object there. That sync object would then be waited for before rendering (or the last image would be displayed until rendering is finished). That also needs X server support though. With thta you'd have async rendering and the correct frame data.

Of course the additional blitting+swizzling step makes a performance hit, you're essentially writing double the memory. The question is, is it faster than doing that step on the CPU?

twaik commented 1 week ago

That also needs X server support though.

I want to make virglrenderer support AHardwareBuffers before making it support direct writing to X server (without drawing through virpipe).

The question is, is it faster than doing that step on the CPU?

it's what GPUs were made for after all. The bitshifts happen in parallel like everything else

twaik commented 1 week ago

would then be waited for before rendering (or the last image would be displayed until rendering is finished).

How can I check if rendering is finished or not? Without waiting.

tareksander commented 1 week ago

EGL can wait on a sync object, though that blocks the entire thread. There's an extension that inserts the wait into the client command steam, so you can issue further GL calls an be sure they'll happen after the sync completed. For implementation in the X server (which renders at predefined intervals, I think?) just polling the sync object before drawing the next frame would be enough.

twaik commented 1 week ago

Do you mean invoking poll C function without timeout or there is another EGL related API for that?

licy183 commented 1 week ago

tareksander may be referring this API EGL_ANDROID_native_fence_sync, but I don't actually know how to use this...

tareksander commented 1 week ago

tareksander may be referring this API EGL_ANDROID_native_fence_sync, but I don't actually know how to use this...

Exactly.

You need to create a sync object with type EGL_SYNC_NATIVE_FENCE_ANDROID and after calling glFlush the native fence fd is initialized and can be duplicated with eglDupNativeFenceFDANDROID. From that fd you can reconstruct a fence in another process by creating a fence of type EGL_SYNC_NATIVE_FENCE_ANDROID with EGL_SYNC_NATIVE_FENCE_FD_ANDROID set to the fd number.

Then you can use eglClientWaitSyncKHR or eglWaitSyncKHR to block the thread or gl command processing on teh current context respectively, or check the EGL_SYNC_STATUS_KHR attribute to see if it has been signaled.

twaik commented 1 week ago

Wait, so I need to use eglCreateSyncKHR after using glFlush? I thought the whole thing is to avoid glFlush... And at this step I still do not send sync fd to another process, I am only implementing stuff inside virglrenderer process.

tareksander commented 1 week ago

Wait, so I need to use eglCreateSyncKHR after using glFlush? I thought the whole thing is to avoid glFlush...

No, before flush. And flush is not finish: Flush makes sure the commands are delivered to the GL implementation, finish waits until the commands have finished executing.

twaik commented 1 week ago

Yeah, you are right, I was confused with some other stuff.

twaik commented 1 week ago

So can I use waiting for sync instead of using glFinish?

tareksander commented 1 week ago

That depends on how you want the graphics pipelining: If you just want to chew out frames as fast as possible (like a game maybe would), no synchronization could be fine. If you want to ensure the most recent frame is definitely drawn, you need some synchronization. The waiting would be an optimization on the server side: The server would accept a new buffer to be displayed, uses a client sync to ensure the buffer is finished writing and issues the GL commands to draw it into the compositor space, then flushes. The GPU will then wait for the buffer to be available on its own before continuing to draw, so the application and server can both continue to queue up more commands.

termux / termux-packages