paulirish / speedline

Calculate the speed index from devtools performance trace
MIT License
795 stars 44 forks source link

Indexes view layout instability as positive progress #50

Open patrickhulce opened 7 years ago

patrickhulce commented 7 years ago

see https://github.com/pmdartus/speedline/pull/49#issuecomment-300024141

Classical speed index certainly suffers from this much more so than PSI, but PSI also fails in similar situations to identify disruptive layout shifts as negative events rather than positive ones (which is why I say it rewards the jank, even if the PSI metric will be inflated because of it). Here's an example gist of a page that has multiple elements coming in above the primary content to simulate an ad popping in over an article. Because the target is the ultimately disrupted page, any progress toward the target will be positive in the eyes of PSI and the graph over time of SSIM is indistinguishable from a standard progressively enhanced load. You can also see this play out in real world cases like theverge.com when a video ad is injected after the header but before the content. While PSI is appropriately inflated due to the later structural changes, there's still no signal that some of the progress was disruptive to the user.

My primary point is that if you're looking for a signal of layout stability and visual churn throughout the page lifecycle separate from load time, other signals are required beyond speed index today. I've been playing around with something that examines lost edges and frame-to-frame similarity rather than progress toward a target, but if anyone has insights or work here I'd love to see them :)

GIF of Load timeline

Speedline image

pahammad commented 7 years ago

We could look into this issue more in-depth. As such, any algorithm will have some failure rate. If we have alternative visual progress measures to test, how do we evaluate that these new pairwise measures perform better at detecting layout stability compared to the SSIM idea? We may need to collect some good benchmark data to perform testing. Any pointers on how to go about this issue, apart from dealing with anecdotes?