w3c / largest-contentful-paint

Specification for the LargestContentfulPaint API
https://w3c.github.io/largest-contentful-paint/
Other
88 stars 16 forks source link

Reward progressive images #68

Closed jonsneyers closed 3 years ago

jonsneyers commented 3 years ago

If the LCP is an image, it doesn't really make sense to give the same score to a 100 kB AVIF, a 100 kB WebP, and a 100 kB progressive JPEG. The AVIF will only be shown after 100 kB has arrived (it gets rendered only when all data is available), while you'll see most of the image after 90 kB of the WebP (since it is rendered sequentially/incrementally), or after 70 kB of the JPEG.

Instead of defining LCP as the time when the image has fully loaded, it could be defined as the time when the image gets rendered approximately, within some visual distance threshold compared to the final image.

The threshold should be high enough to not allow a tiny extremely blurry placeholder to count as the LCP, but low enough to e.g. allow an image that only misses some high-frequency chroma least significant bits, e.g. the final passes of a default mozjpeg-encoded jpeg.

yoavweiss commented 3 years ago

That makes sense at a high level, but it would be interesting to dig into the details and see how we can define something like that in terms that are:

eeeps commented 3 years ago

Exciting!

If we have/had:

  1. A format-agnostic way to determine the effective resolution of a partially painted file
  2. The layout size of the image

...it would be nice to be able to determine when "enough" resolution had been loaded in terms of image density (rendered image pixels ÷ painted CSS pixels). This way, the notion of "enough" works nicely across layout sizes and doesn't depend on final image density. After one minute of experimentation I'll throw out an initial, completely subjective value for "enough" of between 0.05x and 0.1x.

jonsneyers commented 3 years ago

I would propose to define "enough" resolution as 1:8 (so 0.125x), where you look at the maximum of the layout dimensions and the intrinsic image dimensions.

So if the image gets upscaled, you still need more than the 1:8 of the image itself. If it doesn't get rescaled, then it's 1:8 of the image itself. If it gets downscaled, it's still 1:8 of the image itself - in theory if the image is twice as wide as the render width, a 1:16 version of the image would be enough already to get a 1:8 preview, but I don't think we should reward oversized images, which is why I would propose to still require 1:8 of the image itself in the oversized image case.

The format-agnostic definition could be something like this: LCP is the event when the image is rendered with sufficient precision such that at the maximum of the layout dimensions and intrinsic image dimensions, when comparing a 1:8 downscale of the final image to a 1:8 downscale of the rendered preview, the PSNR is above 30 dB (or whatever other simple-to-define difference threshold).

Format-specific versions of this are easy to implement, at least in the no-upscaling case:

addyosmani commented 3 years ago

+1. Ensuring the approach we land on is format agnostic feels key. I like the two requirements proposed of establishing effective resolution and layout size with only the partial image as a starting point.

Jon's exercise in suggesting a 1:8 resolution as "good enough" also resonates and mapping this requirement to popular and emerging image formats is exactly the type of approach I'd like to see so we aren't constraining the solution to a particular subset of formats.

kornelski commented 3 years ago

+1 to defining good enough as 1:8th of the resolution. It's simple enough to be applicable to future formats. It's very important that it perfectly matches JPEG's DC-only rendering, because progressive JPEG is going to be the most popular format for this case for a while.

vsekhar commented 3 years ago

I'd love for this to be somewhat user-defined or at least user-validated.

Short of a user study, we could perhaps run a known image set through a recent-ish classification model at full and various subsampled ratios to see how far we can go with minimal AUC loss, corresponding to our intent that LCP is the point at which the user "gets" what the image is just as well as if they saw the full resolution image.

The risk is this test might trigger model-specific quirks that don't reflect human perception, possibly mitigated by using a few different models.

yoavweiss commented 3 years ago

Indeed! Personally, I doubt 1/8th will provide a level that's good enough. I understand why it would be easy to implement, but don't think this should be a major consideration.

jonsneyers commented 3 years ago

If 1:8 is not considered 'good enough', you could also go for 1:4 or 1:2. In JPEG, that roughly corresponds to progressive scans where DCT coefficients 0-5 or 0-14 are available.

What is 'good enough' will depend on the kind of image: if the most important part of the image is text in a small, thin font, then you'll probably need the full 1:1 image to read it. If it's a high-res photo, then probably the 1:8 image is enough to understand what the image is depicting (though of course to fully enjoy it, you'll need more detail). If it's a thumbnail photo, then probably 1:8 is not enough.

So perhaps the detail required should be linked to the area of the image: a full viewport width image may be usable with 1:8, while if the LCP happens to be just a 20vw wide image, maybe 1:2 is needed.

kornelski commented 3 years ago

I'm afraid that the diversity of use-cases for images is just way to broad for any single cut off point to be unarguably good enough for everyone.

For example, images with small text in them may need to be 1:1 resolution, otherwise the small text will be a mushed mess. A shop that sells subtly patterned fabrics may require 100% quality and nothing less.

But then there's lots of websites with stock-photo like hero images. With images that merely need to provide a matching background color. News sites that have a photo from a press conference or a shot of people on the street. In such photos you only need the general idea of the subject.

Or if a user is browsing an online shop looking for orange socks, then even a single blue pixel preview is fully functional in telling them it's not the product they're looking for.

There are couple more circumstances to keep in mind:

jonsneyers commented 3 years ago

Is there a way to get a representative sample of LCP images, in order to manually estimate if they require 1:1, 1:2, 1:4, 1:8, 1:16, 1:32 to be 'enough' to use the page?

My gut feeling is that for the bulk of the images, 1:8 will be OK to convey the main visual message, with exceptions where more detail is needed or where less detail is still OK.

anniesullie commented 3 years ago

To get a representative sample, I think the HTTP Archive is the best source.

I queried the lighthouse table, which outputs the LCP along with an html snippet, to get all the sources from <img src tags:

SELECT
  RTRIM(url, '/') as origin,
  REGEXP_EXTRACT(JSON_EXTRACT(report, "$.audits.largest-contentful-paint-element.details.items[0].node.snippet"),
                 r'src=\\\"([^\\]*)') AS src
FROM `httparchive.lighthouse.2020_10_01_mobile`
WHERE 
  REGEXP_EXTRACT(JSON_EXTRACT(report, "$.audits.largest-contentful-paint-element.details.items[0].node.snippet"),
                 r'src=\\\"([^\\]*)') is not null;

Not all the urls have an image LCP, but I did get 3 million results. Here it is in CSV format. Note that I didn't attempt to correct for the fact that some src tags are relative, but it should be pretty easy to combine the origin and the relative tags.

jonsneyers commented 3 years ago

Thanks, @anniesullie ! I'm taking a look at the csv, and the shorter urls can be used, but the longer ones are truncated and end with three dots, like this: https://cpnonline.co.uk/wp-content/uploads/2020/10/Audrey-Courant-of-Ducke…

But I'll just look at the shorter urls, there's no way I'm going to look at 3 million results anyway :)

jonsneyers commented 3 years ago

I quickly hacked something to take a look at what a 1:8 preview looks like for these images. This is what I did:

#!/bin/bash

page=0
count=0
IFS=","
while read -a l; do
    origin=${l[0]}
    src=${l[1]}
    if [[ $origin == origin ]]; then continue; fi
    if [[ $src == *… ]]; then continue; fi
    if [[ $count -eq 0 ]]; then
        page=$(( page + 1 ))
        echo "<html><body><a href='foo$((page - 1)).html'>previous page</a> | <a href='foo$((page + 1)).html'>next page</a><table>" > foo$page.html
    fi
    if [[ $src == http* ]]; then
        img=$src
    else
        img=$origin/$src
    fi
    echo "<tr><td><img alt='original' src='$img' width='500'></td>" >> foo$page.html
    echo "<td><img alt='1:8' src='https://res.cloudinary.com/jon/image/fetch/w_0.125,f_auto,q_auto/$img' width='500'></td></tr>" >> foo$page.html
    count=$(( count + 1 ))
    if [[ $count -eq 20 ]]; then
        count=0
        echo "</table></body></html>" >> foo$page.html
    fi
done < mob*.csv

and the output of that, you can see here: http://sneyers.info/foo/foo1.html (I only uploaded the first 5000 pages or so, that should be more than enough)

The script skipped the truncated urls, but still there are some that don't work for whatever reasons. Just to get an idea though, this should be good enough.

Note that both the original and the 1:8 image get scaled to an arbitrary width of 500 css pixels, which can be larger or smaller than how it is on the actual page, so that's not perfect but still, it gives an idea. Also the browser upsampling of the 1:8 is probably not as nice as the state of the art DC upsampling methods.

jonsneyers commented 3 years ago

After going through the first 25 pages of this, I think the following heuristic makes sense:

This ensures that the 1:8 image is at least 37 pixels in both dimensions.

300 pixels is of course an arbitrary threshold and can be replaced by some other number. Or perhaps a better heuristic would be to look at total (intrinsic) pixels, corrected for aspect ratio where less-square images need to be larger before the "1:8 is good enough" rule holds. In particular, when you have a very wide – or occasionally, very tall – image (e.g. 600x80), it tends to contain text and then the 1:8 version (eg. 75x10) will not be 'good enough'. But then again it also happens that you have a very wide photographic image where the 1:8 version is in fact OK, even if the image is only 200 or 300 pixels high.

So maybe something like this: let the intrinsic image dimensions be W and H, then

   corrected_area = W*H * sqrt(W>H ? H/W : W/H)
   if (corrected_area > 100000) progressive_LCP_OK = true
   else progressive_LCP_OK = false

For example, a 100x1000 image needs to be full resolution before LCP is counted (since the corrected area is only 31k = 100k sqrt(0.1)), and so would a 200x1000 image (corrected area is 89k = 200k sqrt(0.2)), but a 300x1000 image would be considered OK at 1:8 (corrected area is 164k = 300k * sqrt(0.3)).

With this heuristic, the smallest image where 1:8 would be considered OK would be a square with dimensions ⌈sqrt(100k)⌉, so a 317 x 317 image. A 1000x220 image would also (just barely) be OK, and so would a 300x400 image or a 280x500 image, but e.g. a 250x600 image would not be OK. So it's not very different from the "minimum 300 pixels on both dimensions" heuristic, but it does allow a dimension to go a bit below 300 if the total number of pixels gets large enough. But it does not just look at the total area, because then you might get cases like 100x1200 pixels which are probably not OK at 1:8 even though they have a lot of pixels.

Here's a visualization of potential dimension heuristics, where the x and y axis are the image width and height (up to 1000 pixels), and green means "1:8 is OK" while red means "need full image". On the left is the heuristic "both dimensions have to be at least 300", in the middle is "area has to be at least 100k pixels", on the right is "corrected area has to be at least 100k pixels".

t

jyrkialakuijala commented 3 years ago

This feels like a good direction to me. Could you ask Moritz to give you the same upsampling algorithm that is used in libjpeg-turbo? I believe the results are going to feel substantially better as it is not going to produce as much gridding like bicubic. (In your example it might be just bilinear scaling which looks quite a lot worse than even bicubic.)

jonsneyers commented 3 years ago

In my examples you're seeing browser upscaling from a 1:8 downscaled image with lossy compression, so that's quite a bit worse than what upsampled DC can look like. But it would probably be close to what you'd see if e.g. an AVIF contains a 1:8 preview...

npm1 commented 3 years ago

@jonsneyers thanks for the detailed investigation! IMO the images on the 1:8 side (right side) of http://sneyers.info/foo/foo1.html look too blurry. The idea here is that the image should look ready, so I think we should consider a later stage, where the image is almost undistinguishable from when it's fully loaded. WDYT?

jonsneyers commented 3 years ago

If the idea is that the image should look ready, almost undistinguishable from fully loaded, then I think something more like 1:2 (or maybe 1:4) would be needed.

Here's a comparison to get an idea: every row is original, 1:2, 1:4, 1:8 http://sneyers.info/foo/bar/foo1.html

Note that there's quite a big difference between "image looks ready" and "image is rendered with enough detail to start using the page". Obviously, unless horribly oversized images are used, a 1:8 image should not "look ready". But in many cases it will be "good enough" to start using the page.

If you go for a relatively strict "image looks ready" definition, i.e. something like 1:2, then that will mean that progressive JPEG will have a big advantage over other image formats – e.g. with AVIF you could embed a preview image to get some kind of progression (as @jakearchibald suggested), but if this embedded preview has to be good enough to be accurate relative to a 1:2 downscale, then it will not be much smaller than the image itself, making it kind of pointless.

It would make sense to require more detail to be available as the image gets smaller, instead of having a hard cutoff between "1:1 is needed" and "1:8 is OK". And also to take DPR into account and require less detail as the density goes up.

So with these refinements, here is my next iteration of a proposal for an updated LCP definition that properly rewards progressive previews, in a format-agnostic yet implementation-friendly and current-formats-compatible way:

Proposed definition:

If the LCP involves an image, then LCP is the event when the image gets rendered with sufficient precision such that at the maximum of the layout dimensions and intrinsic image dimensions, when comparing a 1:N downscale of the final image to a 1:N downscale of the rendered preview, the PSNR is above 30 dB, where N is defined as follows: let corrected_area = W * H * sqrt(W>H ? H/W : W/H) * dpr where W and H are the intrinsic image dimensions and dpr is the device pixel ratio. Then

eeeps commented 3 years ago

@npm1 I think "almost undistinguishable" is too-high a bar; "usable" is better. LCP rewards font-display: swap which ≅ progressive text rendering; if there is value there, there is value to rewarding image rendering that lets users see and comprehend a "rough sketch" of image content, before it looks complete.

As @kornelski points out, "comprehension" is context-and-content dependent, and probably impossible for LCP to capture. But doing something imperfect to reward progressive image rendering >> doing nothing, IMO.

jyrkialakuijala commented 3 years ago

Moritz proposed to use djpeg (from libjpeg-turbo 2.1) and truncated files for looking at interpolation:

for swiss_flag.jpg progressive jpeg:

NUM_BYTES=`hexdump -v -e "1/1 \"%_ad %02x, \"" swiss_flag.jpg | grep -oE '[0-9]+ ff, [0-9]+ da,' | sed -n 2p | cut -d " " -f1` 
head -c $NUM_BYTES swiss_flag.jpg > swiss_flag${NUM_BYTES}.jpg
djpeg -outfile swiss_flag.ppm swiss_flag${NUM_BYTES}.jpg 

This is what would happen in reality and is substantially sharper and without the artefacts shown in Jon's earlier simulation. Note that it is necessary to have libjpeg-turbo 2.1 for this to work properly.

jonsneyers commented 3 years ago

Updated/tweaked the script quite a bit, new script is here: http://sneyers.info/foo/bar/baz/make_html Results are here: http://sneyers.info/foo/bar/baz/foo1.html

What you see is the original image on the left, with its dimensions and the LCP calculation (assuming dpr 1 and that the image dimensions are not smaller than the layout dimensions, i.e. no upscaling). Then there's an avif preview at the minimum scale required. For the 1:8 case, I've also included what happens if I convert the original image to a progressive JPEG (using jpegtran if it's a jpeg which is usually the case, or imagemagick if it's something else) and use the latest libjpeg-turbo to simulate the DC-only progressive preview.

In my previous proposed definition, I said PSNR above 30 dB, but that threshold is too low, especially if it's going to be upsampled 8x. Using low-quality previews at 1:8 resulted in quite poor results after upsampling, especially compared to what JPEG DC upsampling can give (since the DC is basically lossless modulo color conversion). So I bumped up the threshold to 32+N, which probably makes more sense. Still need to define what colorspace/weights to use for PSNR, but the general idea is that the preview needs to have more precision as the 'acceptable downscale factor' N goes up.

Proposed definition:

If the LCP involves an image, then LCP is the event when the image gets rendered with sufficient precision such that at the maximum of the layout dimensions and intrinsic image dimensions, when comparing a 1:N downscale of the final image to a 1:N downscale of the rendered preview, the PSNR is above 32+N dB, where N is defined as follows: let corrected_area = W * H * sqrt(W>H ? H/W : W/H) * dpr where W and H are the intrinsic image dimensions and dpr is the device pixel ratio. Then

jonsneyers commented 3 years ago

After looking at more examples and thinking a bit more about it, I have some further thoughts.

A lossy preview at 1:8 that gets upscaled does not look as good as the upsampled JPEG DC – as any compression artifacts get magnified 8x, it's not quite the same as upsampling from an effectively lossless 1:8 like a JPEG DC.

Conceptually, PSNR thresholds are reasonable, but if you need to compare the preview with the final image in order to retroactively trigger the LCP event, it's perhaps not ideal from an implementation point of view.

So perhaps it makes more sense to just assume that a sufficiently high-resolution preview is 'good enough', without actually checking that the preview is within some error threshold of the final image. I think that simplifies things quite a bit. But I would distinguish between an 'intrinsic' preview (like the DC or a progressive scan of a JPEG), which is part of the actual image, and an 'independent' preview (like a preview embedded in Exif or in the avif container), which is not part of the actual image and which is (most likely) lossy so upscaling it will result in magnified artifacts. So it could be wise to require e.g. a 1:4 embedded avif preview in cases when the 1:8 upsampled-DC preview is just barely good enough – especially if for practical/implementation reasons, checking the accuracy of the preview is not actually going to happen.

To be on the safe side, I'm also bumping up the threshold for saying 1:8 is 'good enough' from a corrected_area of 400k pixels to one of 600k pixels, and the one for 1:16 from 2 megapixels to 3 megapixels.

Here is another bunch of random samples of LCP images (original on the left) where 1:8 would be called 'good enough' according to the definition below, with on the right a simulation of what the DC-only preview would look like: http://sneyers.info/qux/foo1.html

Proposed definition:

If the LCP involves an image, then LCP is the event when the image or a preview of the image gets rendered with sufficient resolution to get (at least) a 1:N downscaled version of the image (w.r.t. to the maximum of the layout dimensions and intrinsic image dimensions), where N is defined as follows: Let corrected_area = W * H * sqrt(W>H ? H/W : W/H) * dpr * independent where W and H are the intrinsic image dimensions, dpr is the device pixel ratio, and independent is equal to 0.5 if the preview is independent of the final image and equal to 1 if the preview is intrinsically linked to the final image. Then

gunta commented 3 years ago

This is the single most important issue that will definitely decide the future of who wins the image-format wars.

By logical thinking, I agree that the measure should account for Image size, Layout size, and DPI.

But also, and more importantly for perceptive quality correctness:

Without knowing the user environment where is it being rendered, Image size and Layout size may mean very little since the image could be presented in

  1. A small browser window on a big monitor
  2. A small high dpi display smartphone (like iPhone 12)
  3. A small low dpi display Android feature phone
  4. A big 4K TV display looked from far away and the blockiness experienced in each may vary a lot more than expected.

There is a hard limit on the data you can possibly know from the User-Agent.

However, having any thresholds may mislead and motivate developers to optimize on those thresholds. Ie. if the image was a 280px progressive jpeg, and the threshold was 300px, developers might end up upscaling offline the image to 301px in order to get better scores. 😅We definitely do not want to reward this.

So, it is very important to not only think of the synthetic part of the problem but how it is experienced in real life and how developers might end up end using (or abusing) the score.

Always prefer KISS metrics over Complex metrics for the web. After all, this is the reason why LCP won over FMC, developers couldn't figure out what to optimize.

Three Proposed solutions

1. Largest Contentful Paint Minimum Viable Preview (LCP-MVP) metric

KISS enough. Just use a single value to define an "enough resolution". Can be used instead of the current LCP.

From Jon's samples 1:4 looks like a good balance between "text is almost already readable" and "enough so that when the user starts glaring at it, it's gonna become 1:2 in the next milliseconds where it will be already perfectly readable by then".

2. Add a new Largest Contentful Paint Start (LCP-S) metric

Keep LCP as-is. Instead, add a new metric so that the developer knows exactly when the element starts painting, thus how much time is spent since the first-pixel paint till full paint. By knowing this, the developer will have a real value at hand in order to decide the tradeoff for each particular use-case. This is important since it means that not only file size, but progressive-ness and paint/CPU decoding are taken into account and now can be actionable.

Alternative naming: Time to First Largest Contentful Paint (TTF-LCP)

3. Largest Contentful Paint Mean (LCP-M) metric

Measure at different time points and give the average value. Can/Should be used instead of the current LCP.

For example:

  1. 50 ms: Element ready but nothing painted
  2. 100 ms: Image is 1:8 loaded... (Proposed LCP-S)
  3. 200 ms: Image is 1:4 loaded... (Proposed LCP-MVP)
  4. 300 ms: Image is 1:2 loaded...
  5. 400 ms: Image is fully loaded (Current LCP) The mean here would be (100+200+300+400)/4.

This means that Proposed LCP-M is 250ms which is a pretty easy-to-understand value.

The positive side of this is that it motivates developers to think, get creative, and optimize more.

Now instead of either just optimizing for small file size (current LCP), or just plainly using progressive images, the developer will get better scores for including early optimizations such as LQIP, SQIP or BlurHash which improves the UX more than rather just simply using images.

Example

I've made a WebPageTest-like example here to illustrate the 3 proposals concepts

HowImagesRenderLCPNotEnough2x

We need to step back a bit from the technical details and discuss first: What we want the developers to optimize for, and what do we consider to be "good UX"; since the tradeoffs to decide here are very dependant on network speed.

jonsneyers commented 3 years ago

Thanks for the very insightful comments, @gunta !

Splitting it up in LCP-Start - LCP-MVP - LCP-End makes a lot of a sense to me. That concept also applies to other (non-image) types of LCP, e.g. for text the LCP-MVP could be when the text is rendered in a fallback font and LCP-Full when it is rendered in the actual correct font.

LCP-Mean seems a bit redundant if you already have LCP-Start, LCP-MVP and LCP-End. Just taking some weighted average of those three is probably good enough. If we want to have a single LCP metric to optimize for, it would be that, then, imo. For this weighted sum, I would suggest weights that emphasize LCP-MVP but still take the other two (LCP-Start and LCP-End) sufficiently into account to make it worthwhile to also optimize those.

Defining what is 'good enough' for LCP-Start may be a bit tricky – I think it should be better than just a solid color or a two-color gradient, but something like a WebP 2 triangulation or a blurhash or a 1:16 / 1:32 preview should be good enough. A 1:8 preview should certainly be good enough for this. I'm not sure how to define it in a way that can be measured though. Maybe just assume that any first paint besides just setting the background to a solid color or gradient is an LCP-Start? (that's rather easy to abuse though by just doing the same placeholder thing for every image)

Putting LCP-MVP at 1:4 makes sense to me – and I agree with your suggestion (in the illustration) to count that as either 1:4 resolution of the full image, or the top 3/4 of the image loaded in the case of sequential formats like WebP (or sequential JPEG, I assume).

gunta commented 3 years ago

Definitely.

How to define LCP-Mean

In the end, it will depend on how easy to implement it is, but if LCP-Mean is gonna be the new LCP, it could be defined either as:

  1. LCP-Mean Simple: = (Start + End)/2 Easy to understand. Easy to implement. Not too fair and easy to cheat, but better than the current LCP.

  2. LCP-Mean MVP-included: = (Start + MVP + End)/3 Not extremely hard to understand. MVP might be hard to implement depending on MVP definition.

  3. LCP-Mean MVP-weighted: = (Start * 25% + MVP * 50% + End * 25%)/3 Same as MVP-included but weighted for MVP. Better priorities. Might be the best way theoretically. It may not be fully understood by developers if the MVP definition is too complex. Need to watch also that the implementation doesn't get complex.

  4. LCP-Mean Interval-included: = (Start + Shot every 100 ms + End)/n Easy to understand. The implementation may be compute-time costly. May not get consistent results each run.

How to define LCP-Start

For LCP-Start yes it is hard to define, so first, we need to draw the lines of how much cheating do we allow.

We have 2 main personas here:

  1. Regular web developers Might set the background color/gradient in CSS. Normal use case. It can be easily detected by User-Agent.

  2. Cheaters Might set a random placeholder by using canvas or another replacing method. This is hard to detect and may start an unwanted rat race.

I'd say we could cater to Regular web developers first, and improve the measurement method over time if cheating becomes something normal.

The definition then could be something around the lines of:

  1. First pixel painted
  2. First 1% of the pixels painted
  3. First 1:32 of the preview painted
  4. First left-top maximally perceptually-distinct color 30% divergence limit of the first n pixels painted
jyrkialakuijala commented 3 years ago

Talking about cheating with a say fully white preview image artificially reducing the LCP:

This kind of cheating is not possible within the use of progressive jpeg and progressive jpeg xl. A cheater has no possibility to emit data that will not also dominate the appearance of the final image.

jyrkialakuijala commented 3 years ago

All image formats can be combined with blur hash like approaches, where a very low resolution preview is thrown away when the image rendering starts. My thinking is that blur hash-like things are more for comfort than for utility, and my belief is that LCP tries to measure the point where the user can start digesting the information on the web page.

jyrkialakuijala commented 3 years ago

My a personal preference (based on my own experimentation) with large images (and we are talking about LCP, where L means largest), I'm not a great fan of ~50 byte blur hash, nor a ~200 byte preview. However, when we are able to invest 2000 bytes for a preview, there can already be some utility in it as main elements are starting to be consistently recognisable, while 50-200 byte images need to be blurred so much that they are exclusively for comfort in my viewpoint.

jonsneyers commented 3 years ago

I'm in favor of @gunta's option 3: LCP-Mean MVP-weighted: (Start * 25% + MVP * 50% + End * 25%)

Perhaps the weight of the Start should be less than that of the End though. I agree with @jyrkialakuijala that Start is more about comfort than about utility, while MVP and End are about utility. Something like (Start * 15% + MVP * 50% + End * 35%) might be better.

Both LCP-Start and LCP-MVP are tricky to define. Especially with LCP-Start, I think it's important to distinguish an arbitrary first paint (e.g. a background color / gradient / default placeholder that is the same for every image on the page) and a content-related first paint (something like an image-related blurhash, or a 1:32 preview image). This is not really a matter of catching cheating, it's just to distinguish two cases that naturally happen without concious cheating.

For LCP-MVP, for now I don't think we need to worry too much about cheating. In principle you could cheaply add a 1:4 white dummy preview to an avif just to lower the LCP-MVP, but that would be something you specifically maliciously do to fool the metric – we can worry about that if and when it happens (there would be simple heuristics to block such cheating, like a suspiciously small bitstream size of the compressed preview compared to the full image bitstream size). It's not something that is currently happening, afaik.

Here's an attempt at reasonably simple and implementable definitions:

Proposed progressive LCP definition for images

LCP-Start

When the first paint happens on the element, the time and a downscaled image of the element is saved at just 3x3 pixels (9 pixels). When the LCP-End event happens (image fully loaded), the saved 3x3 image is compared with the final image downscaled to 3x3 pixels, and if for these 9 pixels, the mean RGB difference per pixel is below 30%, then LCP-Start is retroactively triggered at the time of the first paint. If the pixels don't meet this matching criterion, first paint doesn't count and LCP-Start coincides with LCP-MVP.

LCP-MVP

When a paint happens with at least 1:4 effective resolution (w.r.t. the maximum of intrinsic image dimensions and layout dimensions), then LCP-MVP is triggered. When at least 75% of the pixels are rendered as in the final image, LCP-MVP is also triggered. When a preview with minimum dimensions of 128 pixels (for both width and height) is rendered, then LCP-MVP is also triggered (so for a 1 megapixel image, 1:8 is enough). If there is no paint that satisfies these criteria, LCP-MVP coincides with LCP-End.

LCP-End

When the image is fully decoded and painted, LCP-End is triggered.

LCP

LCP is defined as a weighted sum of the above three times: LCP = (LCP-Start * 15% + LCP-MVP * 50% + LCP-End * 35%).

Non-image elements

The above approach can be extended to other content element types:

gunta commented 3 years ago

My thinking is that blur hash-like things are more for comfort than for utility, and my belief is that LCP tries to measure the point where the user can start digesting the information on the web page.

BlurHash-like methods are for User Experience and not for Utility, however, this depends on the context. A large simple background image might be ok being blurred out since in that case experience is more important than the information itself.

Even JPEG progressive decoding, each technique has its place and purpose, so let's define them clearly first. This will help us define the proper LCP weighted sum ratio later on.

Purpose Meaning Desired Effect
Prevent Flash These techniques are to prevent unpleasant Flash of Invisible Content Acccesibility: Prevent Seizures, Perceptually Faster Interaction
Show Progress Show progress so that the user knows that something is happening Manage User Loading Expectation, Provide Feedback
Preview Content Show low-quality partial info so the user can tell if it's useful Manage User Content Expectation, Perceptually Faster Interaction
Provide Joy Do something interesting to keep the user engaged while waiting Increase Engagement

All these techniques improve UX in a different way. However, all of them will have a big impact on Abandonment.

For instance, we can be sure that the designers at Medium choose to implement an LQIP method to improve their engagement.

PerceptualSpeedMethods

So as we can see, some methods have multiple purposes. Should we reward equally Baseline and Progressive? Of course not.

Progressive should be rewarded the most since it has 3 different purposes: Prevents Flash, Shows Progress, and Previews Content.

We should reward all things that improve UX because by LCP definition it's a User-centric metric for measuring perceived load speed.

Given User-centric performance metrics key-questions, we can categorize the methods to reward and the new metrics as follows:

Question Action to Reward Metric
Is it happening? Has the server responded? Show Progress LCP-Start
Is it useful? Has enough content rendered that users can engage? Preview Content LCP-MVP, LCP
Is it usable? Can users interact with the page, or is it busy? Show Progress LCP-Start
Is it delightful? Are the interactions smooth and natural? Prevent Flash, Provide Joy LCP-Start, LCP

So before deciding which metric we want to reward more, we need to decide which behavior we want to reward more:

Reward ratio example for each behavior

Behavior Reward
Prevent Flash 15%
Show Progress 25%
Preview Content 50%
Provide Joy 10%

Once we can make a consensus on these numbers, then it will be easy to agree on the weight for each new metric.

jonsneyers commented 3 years ago

Thanks for the above UX analysis, @gunta! Some remarks:

About 'show progress'

The 'show progress' UX aspect is imo a complicated one. To some extent, this depends on how browsers choose to present images as they're loading: e.g. in theory they could show a progress bar in every unloaded image to indicate what percentage of the image file is loaded, and that would 'show progress' even in a codec that cannot do incremental decoding. But they don't, as far as I know.

One could also argue that baseline/sequential/incremental loading is actually better at showing progress than progressive, since the image itself is basically a vertical progress bar, while in a progressive image it can get really hard at the end to see any change and know whether the image is actually done loading or is still going to get more detail ("is that a blurry picture or should I wait for it to load more detail?"). I would say progressive is good at showing progress in the early stages (from LCP-Start to LCP-MVP) but less good at it in the later stages. But again, that is also a matter of how browsers show loading images: I could imagine a browser overlaying a semi-transparent little progress bar on progressive images that are still late-stage loading. Maybe this should be an optional setting, for those who care deeply about the 'show progress' aspect of the UX.

I agree that LCP-Start is important for the 'show progress' aspect, but I would say LCP-MVP also plays a big role in this.

About 'provide joy'

The 'provide joy' aspect seems hard to translate into a measurable metric. I can see how an 'artistic' placeholder is more interesting and joyful to look at than a 'boring' blurry version of the image, but I wouldn't know how to capture that aspect in a metric. Also what is 'joy' and 'surprise' at one point might 'get old' quickly, and the transition from such a 'playful' preview to the actual image is probably harsher, so in terms of 'prevent flash' it may be worse.

So I think it's hard to quantify if and how much LCP-Start contributes towards 'provide joy' – I think it contributes towards 'prevent flash' and 'show progress', and to a very small extent also towards 'preview content', but whether it will 'provide joy' seems a rather subjective thing and hard to measure algorithmically. One could assume though that LCP-End should in any case count towards 'providing joy': seeing the image in all its glory and fine details should bring some joy to the page :)

About 'prevent flash' and layout jank

Regarding the 'prevent flash' aspect, I think besides the rendering of the element itself, there's also the potential problem of layout jank. If the element is rendered nicely, but its position on the page is changing due to the loading of other elements that cause layout changes, then that results in a very annoying UX. I would say LCP-Start can only count if the element is not going to move anymore – if the LCP changes position after any of the three LCP stages (Start,MVP,End), then its LCP time should be the time when it has reached its final position, not the time when it was first rendered (but in the wrong place). But I suppose that's not within the scope of this issue.

Actions to reward and metrics to measure them

Revising @gunta's table a bit to reflect the above comments:

Question Action to Reward Metric
Is it happening? Has the server responded? Show Progress LCP-Start (100%)
Is it useful? Has enough content rendered that users can engage? Preview Content LCP-MVP (80%), LCP-End (20%)
Is it usable? Can users interact with the page, or is it busy? Show Progress LCP-Start (30%), LCP-MVP (70%)
Are the interactions smooth and natural? Prevent Flash LCP-Start
Is it delightful? Are the interactions smooth and natural? Provide Joy LCP-End

The 'reward ratio example' you gave seems fine with me. Let's say if we go with those numbers, what would the final weights be?

Behavior Reward Metrics
Prevent Flash 15% LCP-Start
Show Progress 25% LCP-Start (65%), LCP-MVP (35%)
Preview Content 50% LCP-MVP (80%), LCP-End (20%)
Provide Joy 10% LCP-End

So that would result in a weighted LCP computed as follows:

LCP =
(0.15 + 0.25*0.65) * LCP-Start
+ (0.25*0.35 + 0.50*0.80) * LCP-MVP
+ (0.50*0.20 + 0.10) * LCP-End
= LCP-Start * 31.25% + LCP-MVP * 48.75% + LCP-End * 20%

That's giving more weight to the LCP-Start and less weight to the LCP-End than the weights I proposed before, while the weight for LCP-MVP stays roughly the same. (My previous proposal was LCP = (LCP-Start * 15% + LCP-MVP * 50% + LCP-End * 35%))

Non-UX and indirect aspects

I think we've rightfully focused on the UX aspects of this, which I do think are the most important considerations. But "user-centric" is not necessarily the same as "UX-centric". The user might care about other things than just getting the best possible user experience.

For example, there are still cases where network usage is metered and expensive, and total bandwidth consumption is perhaps the main concern of the user. In that respect, LCP-End may be more important (since it says something about the total weight of the image), and adding a redundant LQIP to improve LCP-Start might actually be (slightly) counterproductive since it just adds (some) bytes without changing the final image.

Also the cpu/battery aspect can be important; showing LCP-Start can probably be done cheaply, while LCP-End may require relatively significant decoding effort (and LCP-MVP may also require some extra decode/paint effort), which probably has an effect on battery life and the general responsiveness of the browser; the total decode time is only included in LCP-End. Finally there's the local browser cache effect: since it's the full image file that gets cached, the total file size matters to avoid quickly filling up the browser cache. The smaller the image files, the better the local browser cache can help to improve the UX in general. Again, LCP-End is the main one of the three submetrics that incentivizes smaller total image file sizes.

So I think overall, it is justified to assign a somewhat larger weight to LCP-End than what we would arrive at just from the 'direct UX' analysis.

Proposed weights

Considering all of the above, I would propose to use the following 'round numbers' as weights: LCP = (LCP-Start * 25% + LCP-MVP * 50% + LCP-End * 25%)

mo271 commented 3 years ago

In order to get a better understanding of how good progressive previews are compared to sequential previews (and no previews), I took 700 random progressive jpeg with file size bigger than 50k and evaluated psnr between the completely loaded image and a preview generated by 5%, 10%, 15%,..., 95% of all bytes. This uses the lates libjpeg-turbo version 2.1. progressive

I did the same for a set of 700 random sequential jpegs: Sequential

Other than making sure that the images are larger than 50k, I didn't make any specific selection. Here is a scatter plot of the image dimensions of the two sets of 700 images: image_sizes

jyrkialakuijala commented 3 years ago

This is great for making decisions more data driven! Are these 700 jpegs respective to the sizes and bitrates of images that appear in LCP role. Do we know any the statistics of such images (for example image sizes in pixels, BPP, or kB range)?

What if we limit the analysis to images of more than one million pixels and between 1-4 bpp?

jonsneyers commented 3 years ago

The approach of looking at difference between preview and final image is better than just requiring "1:4 resolution" where it's not clear what the quality of this 1:4 resolution preview actually is (especially if that preview is itself a lossy encoding).

I would propose to define the 'idealized' LCP-MVP as a preview that reaches a PSNR of like 25 to 30 dB compared to the final image - let's say 28 dB. That kind of difference should be small enough to make the preview "useful" enough to engage with the page. It probably roughly is the same thing as having an accurate 1:4 preview, but it's more agnostic about the actual resolution of the preview (a lower quality 1:2 preview would also work, for example).

In actual implementations, you probably don't want to actually save the full preview and measure the actual PSNR w.r.t. the final image, but rather translate the 'idealized' format-agnostic definition into a format-specific heuristic based on an analysis like @mo271 did above. For progressive JPEG, for example, you could say that on average, say, 35% of the bytes are enough to reach this point, so you could trigger LCP-MVP when 35% of the bytes are available – this will not be exact for all images, but it will be a good enough approximation on average. For sequential JPEG, you can't really trigger LCP-MVP before LCP-End, or maybe you can when 97% of the bytes have arrived or something like that.

Rephrasing the proposed LCP definition:

Proposed progressive LCP definition for images

LCP-Start

When the first paint happens on the element, the time and a downscaled image of the element is saved at just 3x3 pixels (9 pixels). When the LCP-End event happens (image fully loaded), the saved 3x3 image is compared with the final image downscaled to 3x3 pixels, and if for these 9 pixels, the mean RGB difference per pixel is below 30%, then LCP-Start is retroactively triggered at the time of the first paint. If the pixels don't meet this matching criterion, first paint doesn't count and LCP-Start coincides with LCP-MVP. (Question: is it a problem that LCP-Start can only be known when LCP-End has happened?)

LCP-MVP

When a paint happens that is expected to be within a PSNR of 28 dB compared to the final image, then LCP-MVP is triggered. If there is no paint that satisfies these criteria, LCP-MVP coincides with LCP-End.

LCP-End

When the image is fully decoded and painted, LCP-End is triggered.

LCP

LCP is defined as a weighted sum of the above three times: LCP = (LCP-Start * 25% + LCP-MVP * 50% + LCP-End * 25%).

mo271 commented 3 years ago

This is great for making decisions more data driven! Are these 700 jpegs respective to the sizes and bitrates of images that appear in LCP role. Do we know any the statistics of such images (for example image sizes in pixels, BPP, or kB range)?

What if we limit the analysis to images of more than one million pixels and between 1-4 bpp?

If we limit the analysis to only those image (I consider 123 of those here), the progressive plot looks like this: progressive Not really all that different to considering all images larger than 50k.

jyrkialakuijala commented 3 years ago

Not really all that different to considering all images larger than 50k.

Do you have a data-driven explanation for the variance in the swarming behaviour of temporal PSNR? For example if it is scan-script related or image type (text/photographic/high noise) related.

jonsneyers commented 3 years ago

I would drop the very large images (say above 5 MP) from the analysis, those shouldn't be the common case on the web.

Also, instead of looking at random images, you could look at the LCP images @anniesullie extracted (link), and either take only the progressive ones, or use jpegtran -optimize -progressive to make all of them progressive (perhaps it would be good to use both the jpegtran from libjpeg-turbo and the one from mozjpeg, because I suppose they might produce different results).

jyrkialakuijala commented 3 years ago

After looking at the images, their original compression quality, scan-scripts, I conclude that the most likely explanation for the variance in swarming is the noise (and texture) level in the image. Images with higher noise levels are trailing behind (lower in db) in temporal PSNR, images with low noise are ahead (higher in db) of the mean/average curves.

Another interesting data point to support decision-making could be that at which progressive fraction images containing text become legible without significant extra effort.

mo271 commented 3 years ago

I would drop the very large images (say above 5 MP) from the analysis, those shouldn't be the common case on the web.

Also, instead of looking at random images, you could look at the LCP images @anniesullie extracted (link), and either take only the progressive ones, or use jpegtran -optimize -progressive to make all of them progressive (perhaps it would be good to use both the jpegtran from libjpeg-turbo and the one from mozjpeg, because I suppose they might produce different results).

I'll look into filtering out the progressive images in the list, thanks!

mo271 commented 3 years ago

I ran the same analysis as above with 1000 progressive and 1000 sequential images from the list of @anniesullie and the results. 100progressive

1000sequential

To get a better understanding of the images @anniesullie's dataset, I sample 10,000 from them. Out of those 26.16% were progressive. A histogram of their all of their sizes (with part of the long tail cut off) looks like this:

histogram

The sizes (in bytes) differ slightly between progressive and sequential Mean Median
progressive 210,685 106,042
sequential 239,380 104,178
all 231,874 104,692
jyrkialakuijala commented 3 years ago

Andrew Galloni's and Kornel Lesiński's blog post states that a preview is available at 10–15 % of bytes, and that "at 50% of the data the image looks almost as good as when the whole file is delivered". At 20 % of the mean PSNR curve there is a slowdown of the PSNR growth. At 25 % a progressive image can look pretty good already.

Based on the temporal PSNR curves and the common rules of thumb for progressive jpeg (like the one's in the blog post by Galloni and Lesiński) I propose that we look into triggering LCP for progressive JPEGs at 25 % or at 50 % of the bytes have been rendered.

mo271 commented 3 years ago

Here's a github gist that takes 50 random progressive images from the links in @anniesullie's list and for each image shows how it looks like when using 25% and 50% percent of the bytes. This uses an up-to-date version of libjpeg-turbo. For comparison the gist also includes the original jpeg, so each images appears three times:

https://gist.github.com/mo271/f4c6a0807d15ab0b0410078c3687670f

In many cases it is hard to make out a difference between 50% and fully rendered and often the 25% version is already pretty good, although one can find a few examples where not even the first scan is finished. (In those cases this could be improved by using a more different scan script) Judging from this sample of images, the percentages percentages suggested by @jyrkialakuijala are definitely in the right ballpark.

kornelski commented 3 years ago

In the cases where 15% of the file is not enough to show the first scan, almost always it's because of hugely bloated metadata or large color profiles. Web-optimized files are not supposed to have such baggage, of course.

jonsneyers commented 3 years ago

Based on the above data and examples, I'd suggest, in the case of progressive JPEGs, to trigger LCP-Start at 15% (unless it can already be triggered earlier because of a blurhash or whatever other LQIP approach; probably-full-DC is more than what's needed for LCP-Start), LCP-MVP at 40%, and of course LCP-End at 100%.

For other formats, the format-agnostic definition I proposed earlier can be the guideline to come up with an easy-to-implement heuristic. At the moment, the only other progressive-capable web format would be JPEG 2000 (only in Safari), and I suppose interlaced GIF or PNG could also be counted as progressive, but those might be rather rare and not really worth it – Adam7 PNG interlacing tends to hurt compression quite a bit, especially on the kind of images where PNG is actually a good choice.

yoavweiss commented 3 years ago

Thanks all for the great analysis!!

A few comments:

Going back to the criteria I stated above, a byte-based approach looks promising from the "interoperability" and "performance" perspective. If we properly define image format categories with different cut-offs ("progressive", "sequential with intermediate paints", "renders when image is complete", etc), we may be able to also make it format agnostic.

Finally, as @vsekhar said, we would need to somehow validate the approach and the cut-off rates on actual humans, to ensure the PSNR analysis matches their perception. Might also be useful to run other image comparison metrics and see if they align to similar cut-offs.

jonsneyers commented 3 years ago

Not counting metadata is indeed important. In principle, you could make a 2 MB 'progressive JPEG' that actually turns out to be a CMYK image that starts with a 1 MB ICC profile, and contains proprietary Adobe metadata blobs with clipping paths and stuff like that. Obviously the cut-offs would not make sense on such an image.

Similarly, if there happens to be metadata at the end (say some copyright strings), you could declare LCP-End before that metadata has been received. On the other hand, that's probably not worth the complication – there shouldn't be anything large after the end of the image data, and if there is, we probably want to discourage that rather than encourage it by not counting its download time.

There's another caveat: some metadata influences rendering, in particular ICC profiles and Exif orientation (and Exif-based intrinsic sizes). While I agree to discount metadata for computing the cut-offs (having a lot of Exif camera makernotes is not going to render the image any sooner), it may be worth checking that the render-critical metadata is actually available when showing the progressive previews. You could in principle make a progressive jpeg, but the icc profile and/or exif orientation is only signaled at the very end of the bitstream. That would of course not be a nice experience. (Then again I have never seen this in the wild so maybe this is issue too theoretical to matter)

The approach of byte-based cut-offs per 'format category' (or rather, 'format implementation category', since even the most progressive format could be implemented only in a "renders when all data arrived" way) seems to be simplest, where cut-offs are based on the actual image payload, excluding metadata that may also be present in the container.

So for example, we could have something like this:

Format implementation category LCP-Start cut-off LCP-MVP cut-off LCP-End cut-off
renders only when all data is available 100% 100% 100%
renders sequentially 70% 90% 100%
renders progressively 15% 40% 100%

where afaik the current state of e.g. Chrome is this:

Format Format implementation category
sequential JPEG renders sequentially
progressive JPEG renders progressively
WebP renders sequentially
AVIF renders only when all data is available
non-interlaced PNG/GIF renders sequentially
interlaced PNG/GIF renders progressively (but probably worse cut-offs)
SVG renders sequentially?

The progressive cut-offs are probably format-dependent, and even within a format, there could be different ways to do progressive. For example, in JPEG you can make a simple scan script that first does DC and then does all AC at once (or rather in 3 scans, because you have to do it component per component), which would result in cut-offs that are more like (15%, 80%, 100%), or you can do a more complicated scan script that has more refinement scans and results in an earlier cut-off for LCP-MVP. The latter is what you currently typically get in the wild, since it's what both MozJPEG and libjpeg-turbo do by default.

We could also add the concept of 'preview images' that may be embedded in some containers, e.g. WebP2 is planning to have it, the AVIF container can do it, and in principle Exif metadata can also store a preview. Such previews would count as an earlier LCP-Start.

kornelski commented 3 years ago

BTW, I don't think it's possible to have a JPEG with EXIF or ICC at the end. libjpeg handles app markers before jpeg_start_decompress.

eeeps commented 3 years ago

Also, the CSSWG resolved a while back that implementers should ignore render-impacting metadata that comes after the image data. The relevant spec changes are here.

jyrkialakuijala commented 3 years ago

Yoav Weis: 'to ensure the PSNR analysis matches their perception'

We are lucky that a lot of research has been done on this already. General consensus on simple metrics like Y-PSNR and Y-MS-SSIM is that they do indicate image quality in the case where no image specific Y-PSNR or Y-MS-SSIM optimizations have been done. Some TID2013 'full corpus' SROCC values (higher value is better):

SSIM | 0.4636 PSNR | 0.4700 PSNRHVSM | 0.4818 PSNRHVS | 0.5077 MSSIM | 0.6079 Butteraugli ‘3rd norm’ | 0.655 SSIMULACRA | 0.673 DSSIM | 0.861

Multi-scale metrics (PSNRHVSM, PSNRHVS, MSSIM, Butteraugli, SSIMULACRA, DSSIM) perform better than single-scale metrics (PSNR and SSIM). Multi-scale metrics give a further boost for progressive since progressive coding sends the lower (more important) frequencies first.

jyrkialakuijala commented 3 years ago

Some food for thought on sending some high frequencies later:

One analogy with progressive encoding having not yet sent the high frequency components are the four other common ways of not sending high frequency components: YUV420 never sends 50 % of the coefficients. They are just effectively zeroed. The images look worse, get higher generation loss, etc., but not that much worse that it would always matter in common internet use. I believe about 75 % of JPEGs in the internet are YUV420.

The second strategy (possibly adobe's jpeg encoder) that I have seen is to zero out the last 28 coefficients (44 %) in zigzag order. These coefficients are just not sent (or have so high quantization matrix values) that they are always zeroed in many jpeg images found in the wild.

The third is that in AVIF there is a way to describe a 64x64 block transform with 75 % of values zeroed out. If you use this transform in YUV420 context, you will have 87.5 % of the high frequency data zeroed out.

Fourth is in AV1 (AVIF, too?) -- there is a way to describe a resampling of the image (x-direction only). If you specify a 4/3 you will lose another 25 % of the high frequency data. YUV420, x-resampling and 64x64 blocks, lead to a situation where less than 10 % of the original data is kept. These approaches remove high frequencies.

Why do we not care about high frequencies? Because images work well without all the higher frequencies, in both normal image coding and in progressive phases, especially so for photographic content.