lvgl / lvgl

Embedded graphics library to create beautiful UIs for any MCU, MPU and display type.
https://lvgl.io
MIT License
16.2k stars 3.18k forks source link

[Feature] Seperate Decoder List for Every GPU #4382

Closed XuNeo closed 11 months ago

XuNeo commented 1 year ago

Introduce the problem

Current decoders are organized in a global list including bin, png, gif etc. For GPU, not only the decoded image needs to meet some requirements(stride, address alignment etc.), but GPU also needs to register special decoders(ETC etc.).

Examples and cases

Suggested solution

@terry0012 suggests to add a decoder list to every GPU(sw renderer included), so when GPU needs to draw an image, it firstly decodes it from supported decoders. @FASTSHIFT implemented this idea and code looks cleaner.

kisvegabor commented 1 year ago

I'm trying to understand what the problem is with using the global decoder list:

Note that #4241 will unify the memory alignment and stride of all buffers related to drawing.

In your suggestion the the decoder list are stored in the lv_draw_unit_t?

XuNeo commented 1 year ago

It can be a valid use case as the output of the 2 PNG decoders might not be compatible.

I thinks that's the use case. #4241 is a solution, that we take all reasonable limitations of GPU into consideration. Strides width, memory alignement, and memory flags.

@FASTSHIFT @terry0012 Could you comment more? Will #4241 be enough?

XuNeo commented 1 year ago

There is one limitation still not mentioned, but looks like we don't need seperated decoder list.

Premultiply RGB with alpha after decoding. vglite for example, needs it. One possible solution is:

Current software rendere does't support premultiplied image. We can simply disable it just like tiled image and treat it as GPU only format.

Do you think it's doable? Does premultiplied image benefit software rendere as well?

kisvegabor commented 1 year ago

It can be useful in SW decoder as well, as we can save a multiplication during blending. I need to check if we can add it without too much code duplication.

To indicate if an image premultipled I'd would add

LV_COLOR_FORMAT_RGB565A8_PREMULTIPLIED,
LV_COLOR_FORMAT_ARGB8888_PREMULTIPLIED

What do you think?

kisvegabor commented 1 year ago

If you agree with adding LV_COLOR_FORMAT_..._PREMULTIPLIED I can add these and also update SW rendering.

For testing could you send some images as C array in premultiplied format for both ARGB8888 and RGB565A8?

XuNeo commented 1 year ago

Sorry for the late reply. I was thinking adding an extra uint32_t in image header for flags including:

- premultiplied: 1;
- tiled: 1; /*Image in tiled mode or linear mode*/
- rlecompressed: 1; /*Image data is RLE compressed*/
- reserved: 13; /*Reserved for lvgl use*/
- user_flags: 16; /*lvgl won't use it, use it for any purpose you want*/

Premultiply could also apply to LV_COLOR_FORMAT_I1/2/4/8.

For testing could you send some images as C array in premultiplied format for both ARGB8888 and RGB565A8?

I'm devoloping a new image tool for lvgl. I can add premultiplied image support quickly once decided.

https://github.com/lvgl/lvgl/compare/master...XuNeo:lvgl:lvgl-image-tool?expand=1

PeterBee97 commented 1 year ago

Hi Gabor, I made some slight modification to the original lvgl image converter and put the output premultiplied files here.

kisvegabor commented 1 year ago

Thank you @PeterBee97! I'm still fighting with the mem align and stride update, but I'll try this out right after that.

@XuNeo

I'm devoloping a new image tool for lvgl. I can add premultiplied image support quickly once decided.

Amazing! :rocket: Please don't forget to add an option to set the stride.
Having the converter in Python is fine for now, but in the end we need to make it in JS to allow hosting it on https://lvgl.io/tools/imageconverter

I was thinking adding an extra uint32_t in image header for flags including:

I'm not sure it's the best way to handle this. rlecompressed and tiled are not as standard things as premultiplied. So supporting some RLE compression in SW render doesn't mean that the same RLE compression works for a GPU too.

What I recommend is reserving several LV_COLOR_FORMAT_...s for custom formats. This way you can do

#define LV_COLOR_FORMAT_MY_SPECIAL_RLE     LV_COLOR_FORMAT_CUSTOM_1 

And your GPU will immediately see that this format only for it, and not for SW or any other generic decoder.

What do you think?

XuNeo commented 1 year ago

rlecompressed and tiled are not as standard things as premultiplied.

We can remove tiled from lvgl since we haven't seen the benefits now(potential faster rotation speed).

I think RLE can make lvgl much more friendly for low end MCUs by reducing flash size dramatically. See statistics here. RLE compress method can vary between GPUs, but the one lvgl software renderer uses won't need to be supported by other GPUs. For now, we only see Renesas uses it.

The suggestion to adding one additional uint32_t comes from the fact that those bits can combinate with cf. The method you mentioned will ends with more cfs to be defined:

LV_IMG_CF_RGBA8888_PREMULTIPLIED,
LV_IMG_CF_RGB565A8_PREMULTIPLIED,
LV_IMG_CF_INDEXED_1BIT_PREMULTIPLIED,
LV_IMG_CF_INDEXED_2BIT_PREMULTIPLIED,
LV_IMG_CF_INDEXED_4BIT_PREMULTIPLIED,
LV_IMG_CF_INDEXED_8BIT_PREMULTIPLIED,

If you also agree to add RLE, then cfs will double again.

Beside, for flags like tiled, we need a bit reserved for user. Adding it to image header looks more unified.

kisvegabor commented 1 year ago

I also agree with supporting RLE but it seems more like an image format (similar to PNG) and not a color format. So we can develop a simple lvgl-rle format, e.g.

XuNeo commented 1 year ago

I also agree with supporting RLE but it seems more like an image format (similar to PNG)

That can work.

In v8, there's decoder callback read_line, RLE can be used to compress everyline, and decompress on read. I'm not sure if it will be removed. The decoder I impemented in v8 doesn't support compress by line, but it's doable.

For tiled etc. flags, where do you think is better to place? Will lv_img_decoder_dsc_t work?

kisvegabor commented 1 year ago

In v8, there's decoder callback read_line, RLE can be used to compress everyline, and decompress on read. I'm not sure if it will be removed. The decoder I impemented in v8 doesn't support compress by line, but it's doable.

I think it'd be useful to keep it, but read_line will be replaced by get_area to have a more generic API. So it can get multiple lines, tiles or whatever the decoder wants.

For tiled etc. flags, where do you think is better to place? Will lv_img_decoder_dsc_t work?

Tiling is also not a color format, but rather an image format (how the pixels are stored). As tiling can be also different in GPUs and standards, I think we cannot add generic support for it. In the feat/draw_buf branch (related to mem. align and stride) I already updated lv_img_header_t. It's 6 bytes now, but obviously we can make it 8 bytes and some bytes for custom formats. But as it will be specific for a given GPU or decoder I'd keep it generic and up to the user how to handle it. E.g.

typedef struct {
    uint8_t always_zero;
    uint8_t cf;          /*Color format: See `lv_color_format_t`*/
    uint8_t format;
    uint8_t reserved;
    uint16_t w;        /*Width of the image*/
    uint16_t h;        /*Height of the image*/
} lv_img_header_t;

Actually in format we can store some standard formats too, e.g. LV_IMG_FORMAT_PNG/JPG/BMP/GIF etc.

XuNeo commented 1 year ago

Tiling is also not a color format, but rather an image format (how the pixels are stored).

Exactly. The same is true for premultiplied. So using an image format flags is better than IMAGE_FORMAT enum I think. Because an image can be both tiled and premultiplied.

typedef struct {
    uint8_t always_zero;
    uint8_t cf;          /*Color format: See `lv_color_format_t`*/
    //uint8_t format;
    uint8_t premultiplied:1;
    uint8_t reserved: 7;
    //uint8_t reserved;
    uint8_t userflag; /* can be tiled:1, reserved:7 for some GPU */
    uint16_t w;        /*Width of the image*/
    uint16_t h;        /*Height of the image*/
} lv_img_header_t;

reserved is not same as userflag, so I also suggest add specific field for user and reserved field for future lvgl use.

Actually in format we can store some standard formats too, e.g. LV_IMG_FORMAT_PNG/JPG/BMP/GIF etc.

I can't see the benefits now. For those images (file or c array), it doesn't have lv_img_header_t.

Edit: user --> userflag

kisvegabor commented 1 year ago

If INDEXED is a color format, I think premultiplied can be considered a color format too. Compared to tiled, premultiplied is standard (there is only one way to do premultiplication) and affects only a few color formats.

To put it differently: I think premultiplied is a color format not an image format is because it's local to a single pixel. While tiled or RLE describes the whole image.

In the practice when pass a buffer to a function (e.g. sw_transform) we also need to tell its color format. Is we stored pre-multipled as a new bit we needed to have a new parameter for the "modifiers".

So I like the idea of having LV_COLOR_FORMAT_..._PREMULTIPLIED.

Actually in format we can store some standard formats too, e.g. LV_IMG_FORMAT_PNG/JPG/BMP/GIF etc.

I can't see the benefits now. For those images (file or c array), it doesn't have lv_img_header_t.

E.g. for PNG as array we need to have lv_img_dsc_t. See here .

reserved is not same as userflag, so I also suggest add specific field for user and reserved field for future lvgl use.

True, so my updated proposal:

typedef struct {
    uint8_t always_zero;
    uint8_t cf;           /*Color format: See `lv_color_format_t`*/
    uint8_t format;       /*Raw pixels, LVGL-RLE, PNG, JPG, Tiled, etc*/
    uint8_t user_flags;   /* can be tiled:1, reserved:7 for some GPU */
    uint16_t width;       /*Width of the image*/
    uint16_t height;      /*Height of the image*/
} lv_img_header_t;
XuNeo commented 1 year ago

Hi Gabor,

My thought is to keep always_zero field same as v8. The bit length used for format and user can be tweaked.

typedef struct {
    uint32_t cf : 5;          /*Color format: See `lv_color_format_t`*/
    uint32_t always_zero : 3; /*It the upper bits of the first byte. Always zero to look like a
                                 non-printable character*/

    uint32_t format:8;        /*Image format? To be defined by LVGL*/
    uint32_t user:8;
    uint32_t reserved: 8;   /*Reserved to be used later*/

    uint32_t w: 16;
    uint32_t h: 16;
} lv_img_header2_t;
} lv_img_header_t;
kisvegabor commented 1 year ago

The image header is updated here : 4fe593363

kisvegabor commented 1 year ago

get_area is added in https://github.com/lvgl/lvgl/tree/feat/img-decoder-get-area See lv_bmp as an example. I'm not perfectly happy with it yet, but you can see the idea.

lvgl-bot commented 1 year ago

We need some feedback on this issue.

Now we mark this as "stale" because there was no activity here for 14 days.

Remove the "stale" label or comment else this will be closed in 7 days.

XuNeo commented 1 year ago

Not stale.

Image basic info is firstly inspected before dispatching to draw units. Now the issue becomes when to decode the image and should we use the same decoder list for all draw units. How to use cache etc.

Need more time, since cache hasn't be added back to software draw unit.

lvgl-bot commented 1 year ago

We need some feedback on this issue.

Now we mark this as "stale" because there was no activity here for 14 days.

Remove the "stale" label or comment else this will be closed in 7 days.

FASTSHIFT commented 1 year ago

Can we implement serial decoding of decoders? The PNG decoder is responsible for processing the PNG file into an ARGB8888 image. The GPU decoder then processes the ARGB8888 image into a memory and stride-aligned image (and releases the memory requested by the PNG decoder), and then processes the image. Perform alpha premultiplication, and the final cache only retains images processed by the GPU decoder.

FASTSHIFT commented 1 year ago

Can we refer to the design of FFmpeg's AVFilterGraph, which takes into account complex serial connections. We can imagine the decoder and cache as an object that can be input and output, and can be flexibly connected together.

kisvegabor commented 1 year ago

Are the "stride adjuster" and the "alpha premultiplication" are real decoders? Can't we say they are just helpers functions?

E.g. in my_gpu_draw_img

  1. we can get the cache entry of the image
    • if it wasn't cached (a simple image as a pixel map is not cached, but read from flash directly) create a cache entry now and also allocate memory for the image data in RAM
    • if it was cached resize the cache entry (no such API now but would be possible to add)
  2. call a stride_realign() or an alpha premultiplication function.

There is one catch though. Now we can find the images in the cache by a pointer (usually an lv_image_dsc_t) or string (usually file name). Let say it was a file, so in the cache it was stored by name. What do we do in my_gpu_draw_img ?

Check if "filename" is cached? If yes, how do we know if it's the output of the PNG decoder or our modified data?
lv_cache_find has a param1 and a param2 parameter. I though we can use them to store the frame_id and the color (for A8) but let's say in param1 we could store an ID like STRIDE_ALIGNED and try finding the cache with param1 = STRIDE_ALIGNED. So it would look like this:

IMO it might work, but instead of param1 and param2 we need a more flexible way to differentiate cache entries as in reality we can't use param1 for this as it might be used for frame_id. But it's an other questions.

In principle do you agree with this approach, or you can imagined is differently?

FASTSHIFT commented 1 year ago

@kisvegabor Do you mean to do post-processing in the image cache? Is it more appropriate to implement post-processing logic at the lv_draw_buf layer?

FASTSHIFT commented 1 year ago

We can implement the lv_draw_buf_post_processing method, and users can adjust buf parameters and content themselves. Users can call lv_draw_buf_post_processing every time before drawing in their own draw unit, and the internal implementation of lv_draw_buf_post_processing can use flags to mark the processing status of buf. If it has been processed, it can be returned directly. This solution will not affect the image cache logic, because lv_draw_buf only cares about memory and not about src. Both parties can concentrate on their own affairs.

kisvegabor commented 1 year ago

What if we handle this post processing in the decoders by having a global lv_image_decoder_post_process function which is called after any decoder_open?

FASTSHIFT commented 1 year ago

What if we handle this post processing in the decoders by having a global lv_image_decoder_post_process function which is called after any decoder_open?

I agree with this.

kisvegabor commented 1 year ago

I'm still trying to wrap my head around this feature and wondering if it's enough to have a global lv_image_decoder_open_post_process callback which is called after all decoding?

Similarly we might need an lv_image_decoder_open_pre_process callback too, to be sure we return the right entry from the cache. E.g. a simple pixel map image doesn't check the cache at all, so the processed image won't be used by default.

FASTSHIFT commented 1 year ago

I'm still trying to wrap my head around this feature and wondering if it's enough to have a global lv_image_decoder_open_post_process callback which is called after all decoding?

Similarly we might need an lv_image_decoder_open_pre_process callback too, to be sure we return the right entry from the cache. E.g. a simple pixel map image doesn't check the cache at all, so the processed image won't be used by default.

We can use flags in the lv_draw_buf_t to record and mark the processing status.

kisvegabor commented 12 months ago

Just to be sure I understand the intention: the goal is to allow a platform specific customization on the decoded data, right?

FASTSHIFT commented 12 months ago

Just to be sure I understand the intention: the goal is to allow a platform specific customization on the decoded data, right?

Yes, the goal is to have all decoded images meet the alignment and premultiplication requirements of the GPU.

kisvegabor commented 12 months ago

In this case a global callback should really do the job. Do you agree?

FASTSHIFT commented 12 months ago

In this case a global callback should really do the job. Do you agree?

Yes, I think global callback is enough for the task.

kisvegabor commented 12 months ago

Great, could you send a PR?

FASTSHIFT commented 11 months ago

Great, could you send a PR?

No problem. Over here: https://github.com/lvgl/lvgl/pull/4636 I've added a process_flag here to record the processing state to avoid duplicate processing, what do you think?

lvgl-bot commented 11 months ago

We need some feedback on this issue.

Now we mark this as "stale" because there was no activity here for 14 days.

Remove the "stale" label or comment else this will be closed in 7 days.

kisvegabor commented 11 months ago

4636 is merged so I close this issue. Thank you!