Open BigLep opened 1 year ago
Agree with @BigLep, we should add 1 paragraph that defines what we mean by "The time it took to fully load the page.".
I know @guseggert was looking into doing way more than measuring how long HTML took to transfer, and more than DOMContentLoaded β he was waiting for DOMContentLoaded
AND also for all images on a page to load as well. Not sure what ended up being the final metric, but we should document it to remove guesswork/confusion.
Sidenote: If measuring images is too complex/unreliable, DOMContentLoaded
is what people usually measure, because it is when page is in a state that allows for user interaction (and images load async).
+ small nits:
https://en.wikipedia-on-ipfs.org/wiki/
to reduce biasGreat, thanks for your the feedback!
- What is the configuration of the node monitoring these sites? For example, is it stalk Chromium phantoms node? (I think we should be explicit that Companion (for intercepting IPFS URLs) is not in the mix.
We run website measurements every six hours. Each of these measurements we consider a measurement run.
In each run, we start six Kubo nodes around the world in different AWS regions. As soon as their API is reachable, we wait for 10s to let the node settle, and then request these websites one by one. We request each website three times. Then we wait 10 minutes and request the set of websites again. We thought that this may simulate a warm(er) Kubo node that has a "better" routing table. The graphs in the weekly report don't yet distinguish between cold and warm nodes.
Actually, we're not only starting six Kubo nodes but twelve because we want to test the most recent stable Kubo version (v0.18.0) and the most popular according to our network crawls (v0.17.0 - up until last week, now it's also v0.18.0).
Another detail: in each run, we are also requesting every website via plain HTTP without going through the local Kubo node. This means we could compare both protocols.
We can easily change all of the above parameters (4x a day, 6 regions, settle times, 3 retries, 2 Kubo versions, etc.).
We are running the Kubo nodes on AWS's t3.medium
instances and request the website through Kubo's local gateway. E.g., a website request looks like this: http://127.0.0.1:8080/ipns/filecoin.io
. We thought that this would come close to the Companion's performance. However, IIUC if a user browses to, e.g., https://filecoin.io
, the x-ipfs-path
header gets intercepted and only the remaining resources will be loaded via IPFS. I think, with our experiments, we're simulating the case where a user would directly browse to ipns://filecoin.io
.
- Is the cache cleared between each run?
Yes, between each retry we run ipfs repo gc
. However, the first website request is likely slower than the subsequent ones because BitSwap will discover the provider immediately in later requests.
- I assume "Page Load" is https://github.com/macbre/phantomas/blob/devel/docs/metrics.md#performancetimingpageload
That's right :+1:
That said, I assume this is the "Load" metric that shows up in one's web inspector (screenshot - red vertical bar).
Not sure which screenshot you're referring to :/
Regardless, I'd love to be more specific than "page load"
Totally agree! Just to clarify what the performanceTimingPageLoad
measures. I just looked it up (source) and it measures the difference between
loadEventStart
- representing the time immediately before the current document's load
event handler starts. The load
event is fired when the whole page has loaded, including all dependent resources such as stylesheets, scripts, iframes, and images. This is in contrast to DOMContentLoaded, which is fired as soon as the page DOM has been loaded, without waiting for resources to finish loading.navigationStart
- a depracated feature "representing the moment [...] right after the prompt for unload terminates on the previous document"@lidel phantomas
already measures the DOMContentLoaded
π So I'll just replace the performanceTimingPageLoad
metric with the DOMContentLoaded
one?
Week over week trends - It would be great to have a mechanism to detect if this radically changes week over week. One idea would be to pick a few sites and a few regions and plot the p50 and p90 of time to first bye (since that shouldn't be susceptible to the content of the page).
Yup, I also wanted to have that but couldn't come up with a nice visualization that nicely captures all dimensions (latency, datetime, region, website, data points). What you're suggesting is actually a nice trade-off I think.
Suggestion: take ipfs.tech
+ en.wikipedia-on-ipfs.org/wiki/
p50 + p90 timeToFirstByte
latencies from eu-central-1
and report these numbers every week.
Other sites I could imagine adding:
Updated list of websites: https://github.com/protocol/probelab-infra/pull/17
@dennis-tra : again, this is awesome, and thanks for the great reply.
A few followups...
However, IIUC if a user browses to, e.g., https://filecoin.io, the x-ipfs-path header gets intercepted and only the remaining resources will be loaded via IPFS. I think, with our experiments, we're simulating the case where a user would directly browse to ipns://filecoin.io.
Yeah, you're right that there are some differences with companion vs. hitting the local Kubo HTTP gateway, but I think your approach is good/clean/simple. Companion with time learns which domains are available via non-HTTP, and so future domain accesses is able to re-route the URL to the configured IPFS HTTP gateway. (I believe there's more nuance here, but I think what you're capturing here is good/fine.)
http://127.0.0.1:8080/ipns/ipfs.tech
, but if the top-level ipfs.tech page has <img src="https://libp2p.io/logo.png"/>
that will be fetched via HTTP since Phanotms doesn't have the intelligence to know that libp2p.io assets can be fetched via non-http. I think this is fine, but I think this caveat should be documented.Doh, I added the screenshot to my original post - thanks.
load vs DOMContentLoaded - I've got a few things to say around this but am not necessarily laying out the points in the best way.
I like the idea of what you're suggesting but would add a bit:
Sites
Regions:
Modes: (I added this, and maybe we do one graph for HTTP and one for non-http).
Metric
The reason I think we want to plot HTTP vs non-http is:
We have lots of good/important notes about this topic at this point. Where is the durable place it's going to live? While I do think we need some explanatory text in the report, I'm also fine for it to link to a more thorough document for more info. You don't need to block on my review, but please include me when this is available for review.
Again, good stuff, and thanks!
I know @guseggert was looking into doing way more than measuring how long HTML took to transfer, and more than DOMContentLoaded β he was waiting for
DOMContentLoaded
AND also for all images on a page to load as well. Not sure what ended up being the final metric, but we should document it to remove guesswork/confusion.
I think Dennis has described what event we're looking at, just wanted to add here that I gave up on this because it was too complicated and seemed unlikely to be maintainable by us (and since it has to infer images above-the-fold it will probably require some maintenance to keep up with browser changes).
(sorry didn't mean to close)
I have expanded the scope of this issue to be feedback on the various website-monitoring reports that have come in during 202302 and 202303. I'll consider this done when we have a first draft that I would feel comfortable sharing with other leaders and not needing to be there to answer/explain it. After that we can develop a separate process for how we report ongoing observations, questions, and suggestions.
Fresh observations from looking at week 9
In practice, here is what I'm suggesting:
Feel free to disagree and I'm up for discussing other options.
Also, what are the thoughts capturing all the meta-details? Per before, we have lots of good/important notes about this topic. I want to make sure we have a durable place for it. This could be in the report itself or listed somewhere else.
I think we also need a way where we can capture notes/investigations that were done. I don't think slack threads will scale since it won't make it easy to see prior answers/investigations. Ideally there is a self-surface way for anyone to see what investigations have done for a given week, or what our callouts/observations are for that week. I think that likely means having an accompanying github issue, page, or Notion doc for each weekly report that we link to from the report. Some content will carry forward between reports and that's ok. We also want to make it self-service for someone to know where they ask questions. (I think it's ideally with a comment in the linked doc.). I'm happy to discuss this more.
Hi @BigLep,
sorry for the late reply. I have been working on improving our measurement setup. I spent some time last week putting our website measurement infrastructure on a new foundation and I'm much more confident about the data we are gathering now. I plan to document the setup in this repository (note that we have created a ProbeLab GitHub organization, so this and other private repositories will eventually be migrated to that org). I have also explained my reasoning for the new setup here.
Because of this new setup, we don't have enough data to report in this week's report.
Some notes regarding the metrics we want to report: Further up in this issue, we focussed on the TTFB
and domContentLoaded
metrics. While working on our website monitoring infrastructure last week I read up on how to measure website performance and came across this list:
https://developer.mozilla.org/en-US/docs/Learn/Performance/Perceived_performance
To quote the website:
Performance metrics
There is no single metric or test that can be run on a site to evaluate how a user "feels". However, there are a number of metrics that can be "helpful indicators":
First paint The time to start of first paint operation. Note that this change may not be visible; it can be a simple background color update or something even less noticeable.
First Contentful Paint (FCP) The time until first significant rendering (e.g. of text, foreground or background image, canvas or SVG, etc.). Note that this content is not necessarily useful or meaningful.
First Meaningful Paint (FMP) The time at which useful content is rendered to the screen.
Largest Contentful Paint (LCP) The render time of the largest content element visible in the viewport.
Speed index Measures the average time for pixels on the visible screen to be painted.
Time to interactive Time until the UI is available for user interaction (i.e. the last long task of the load process finishes).
I think the relevant metrics on this list for us are First Contentful Paint
, Largest Contentful Paint
, and Time to interactive
. First Meaningful Paint
is deprecated (you can see that if you follow the link) and they recommend: "[...] consider using the LargestContentfulPaint API instead.".
First paint
would include changes that "may not be visible", so I'm not particularly fond of this metric.
Speed index
seems to be very much website-specific. With that, I mean that the network wouldn't play a role in this metric. We would measure the performance of the website itself. I would argue that this is not something we want.
Besides the above metrics, we should still measure timeToFirstByte
. According to https://web.dev/ttfb/ the metric would be the time difference between startTime
and responseStart
:
In the above graph you can also see the two timestamps domContentLoadedEventStart
and domContentLoadedEventEnd
. So I would think that the domContentLoaded
metric would just be the difference between the two. However, this seems to only account for the processing time of the HTML (+ deferred JS scripts).
We could instead define domContentLoaded
as the time difference between startTime
and domContentLoadedEventEnd
.
The revised measurement setup currently gathers the following data:
timeToFirstByte
- as defined abovefirstContentfulPaint
largestContentfulPaint
PerformanceNavigationTiming
objectWe could also include:
Time to interactive
domContentLoaded
- as defined aboveI believe we won't be able to report all the above metrics, so if I had the choice between only two, I would choose timeToFirstByte
and largestContentfulPaint
.
Just want to note that the ask for week-over-week graphs was not unheard! I'm also working on this and will come back here when I have news. I'll try to address all your remarks from the last comment.
Also, I don't have a better place to discuss these things right now. Instead of GH we could use Notion or discuss.ipfs.io
. I'll chat with @yiannisbot and @iand and come back here with a proposal.
Explained the new website measurement methodology here: https://github.com/dennis-tra/tiros
Thanks @dennis-tra for the update.
Good call on getting good underpinning.
Thanks for sharing https://github.com/dennis-tra/tiros
I like all the defenses to prevent caching.
The only issue I'm wondering there is about: what advantage is Kubo getting for successive runs because it is already connected with the providers of all the website content. I assume it's effectively getting to bypass content routing via DHT/IPNI and rely on Bitswap's content discovery.
Agreed on TTFB since that does measure HTTP vs. non-HTTP and is not dependent on the site's initial HTML payload. It is comparable across sites.
Per before, "I don't think with these reports that we want to get into the business of helping site creators realize that their sites could be better optimized or are slow compared to other properties." I'm a bit worried we're heading into these waters by talking about firstContentfulPaint
, largestContentfulPaint
, etc. I think we should message our rationale with something like "We're including this metric because it helps a site owner see about the impact of using IPFS protocols over HTTP before their site is interactive. HTTP vs. IPFS protocols have some impact on this metric, but they aren't the only culprit. There have been many tools developed over the last decades of the modern web to help." With a message like that, I guess it makes me think we should aslo prefer Time to interactive
rather than largestContentfulPaint
. (I can see arguments either way but I would give preference ot interactivity rather than rendering of largestContentfulPaint
because of how annoying the user experience is when you can't interact with the page to scroll or click. Anyways, I'll defer here.)
I like the idea of having a discuss post per week (e.g., https://discuss.ipfs.tech/t/ipfs-measurement-report-calendar-week-10-2023/16114/2 ). A couple of things:
The only issue I'm wondering there is about: what advantage is Kubo getting for successive runs because it is already connected with the providers of all the website content. I assume it's effectively getting to bypass content routing via DHT/IPNI and rely on Bitswap's content discovery.
We (ProbeLab) also discussed this previously and also assume that a subsequent request will likely be served directly via Bitswap. Since we're tracking if it's the first, second, third, etc., request in the Kubo node's lifetime, we could produce a graph that only considers the first requests. The sample size would be very small, though. Alternatively, we could actively disconnect from the content provider after each request. However, I don't think Kubo gives us information from which peer it fetched the data. If that were the case, we could certainly do that. Then we'd always measure the worst-case performance where we'd need to reach out to the DHT (although we could still, by chance, be connected to another providing peer).
On another note, we're complementing these website measurements with DHT performance measurements, where we directly measure the publication and lookup latencies.
I don't think with these reports that we want to get into the business of helping site creators realize that their sites could be better optimized or are slow compared to other properties.
I also think so, and that's exactly why I argued against the Speed index metric. firstContentfulPaint
, and largestContentfulPaint
will include latencies for subsequent requests from e.g., script
, img
, link
tags, which may then also be served by Kubo (though, I saw that some websites make cross-origin requests which would then not be served by Kubo). That's why I think these two metrics will depend on Kubo's performance and are worth measuring - but it's certainly muddy.
However, that's also the case for the TTI metric. From the docs:
Time to Interactive (TTI) is a non-standardized web performance 'progress' metric defined as the point in time when the last Long Task finished and was followed by 5 seconds of network and main thread inactivity.
And a "Long Task":
Long tasks that block the main thread for 50ms or more cause, among other issues:
- Delayed Time to interactive (TTI).
- High/variable input latency.
- High/variable event handling latency.
- Janky animations and scrolling.
A long task is any uninterrupted period where the main UI thread is busy for 50ms or longer. Common examples include:
- Long-running event handlers.
- Expensive reflows and other re-renders.
- Work the browser does between different turns of the event loop that exceeds 50 ms.
Especially the list of common examples sounds very website specific to me. I could imagine a SPA spending too much time on the main thread rendering the page. This wouldn't have something to do with Kubo's performance IMO.
I think measuring the TTI definitely won't hurt, so I'll try to track it regardless of whether we will eventually report it π
We should maybe make this its own Discuss category so someone can subscribe to the category for notifications.
Totally! We could also rename "Testing & Experiments" to something like "Measurements" (just to limit the number of categories). Who owns the forum? I believe I don't have the necessary permissions to create new categories.
(nit) but I think it would be good to get an ISO date into the title: "π IPFS Measurement Report - 2023-03-12"
seems like I can't edit the post anymore :/ will do for the next π
Thanks @dennis-tra:
For now I created a measurements subcategory: https://discuss.ipfs.tech/c/testing-and-experiments/measurements/39
I moved and renamed https://discuss.ipfs.tech/t/ipfs-measurement-report-2023-week-10-2023-03-12/16114
I also gave you moderation access.
Apologies that I haven't come around to addressing your feedback from here yet. Just want to signal that it's not forgotten π
It does hit on that we need to have a way for someone to find out all the persistent callouts/caveats about a given graph (i.e., someone should be able to see our hypothesis that runs 2+ for a given site will be more performant because of existing connections). They don't have to be "in your face", but it should be self-service for someone to learn all the caveats (similar to what we're doing in https://www.notion.so/pl-strflt/IPFS-KPIs-f331f51033cc45979d5ccf50f591ee01 ).
Totally, with our planned probelab website we want to give detailed description about the graphs and their measurement methodology.
I am not reading the intentions behind tracking these metrics, is it SEO? or just understanding of how these pages perform in real-world?
Thoughts:
PS: Can someone create a thread summary for tl;dr;
?
Thanks for the feedback @whizzzkid! SEO is not a goal here.
If we only want to track performance related to interactions with IPFS, then interactions with the page should be dev's problem
We totally agree that we should not be fiddling around with dev's problems - that's not the goal here. However, we do want to measure performance from the interactions with IPFS. We want to find a metric that would tell us how quickly the website loads from a user's perspective when loading it over IPFS. We are capturing the TTFB, which is an indication, but not the whole story.
I didn't look at "web-vitals" yet, but will do.
I have expanded the scope of this issue to be feedback on the various website-monitoring reports that have come in during 202302 and 202303. I'll consider this done when we have a first draft that I would feel comfortable sharing with other leaders and not needing to be there to answer/explain it. After that we can develop a separate process for how we report ongoing observations, questions, and suggestions.
This concerns https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-7/ipfs/README.md#website-monitoring
First off, thanks for adding this! Good stuff.
A few things that I think would be helpful to document:
What is the configuration of the node monitoring these sites? For example, is it stalk Chromium phantoms node? (I think we should be explicit that Companion (for intercepting IPFS URLs) is not in the mix.
Is the cache cleared between each run?
I assume "Page Load" is https://github.com/macbre/phantomas/blob/devel/docs/metrics.md#performancetimingpageload . I don't find their docs helpful. There is so much that goes into loading a page. That said, I assume this is the "Load" metric that shows up in one's web inspector (screenshot - red vertical bar). I could imagine it would be better to get DOMContentLoaded (blue vertical) since that isn't as susceptible to the JS processing on the page I believe (but does capture the network traffic up front fetching JS). (That said, this isn't my expertise and I know there are a lot of intricacies. @lidel will likely have a good suggestion here.). Regardless, I'd love to be more specific than "page load", or at least point people to something like https://developer.mozilla.org/en-US/docs/Web/API/PerformanceNavigationTiming so they have more insight into what that means)
Week over week trends - It would be great to have a mechanism to detect if this radically changes week over week. One idea would be to pick a few sites and a few regions and plot the p50 and p90 of time to first bye (since that shouldn't be susceptible to the content of the page).
Other sites I could imagine adding: