Website Monitoring feedback 202302 and 202303

probe-lab / network-measurements

MIT License

50 stars 13 forks source link

Website Monitoring feedback 202302 and 202303 #34

Open BigLep opened 1 year ago

BigLep commented 1 year ago

I have expanded the scope of this issue to be feedback on the various website-monitoring reports that have come in during 202302 and 202303. I'll consider this done when we have a first draft that I would feel comfortable sharing with other leaders and not needing to be there to answer/explain it. After that we can develop a separate process for how we report ongoing observations, questions, and suggestions.

This concerns https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-7/ipfs/README.md#website-monitoring

First off, thanks for adding this! Good stuff.

A few things that I think would be helpful to document:

What is the configuration of the node monitoring these sites? For example, is it stalk Chromium phantoms node? (I think we should be explicit that Companion (for intercepting IPFS URLs) is not in the mix.
Is the cache cleared between each run?
I assume "Page Load" is https://github.com/macbre/phantomas/blob/devel/docs/metrics.md#performancetimingpageload . I don't find their docs helpful. There is so much that goes into loading a page. That said, I assume this is the "Load" metric that shows up in one's web inspector (screenshot - red vertical bar). I could imagine it would be better to get DOMContentLoaded (blue vertical) since that isn't as susceptible to the JS processing on the page I believe (but does capture the network traffic up front fetching JS). (That said, this isn't my expertise and I know there are a lot of intricacies. @lidel will likely have a good suggestion here.). Regardless, I'd love to be more specific than "page load", or at least point people to something like https://developer.mozilla.org/en-US/docs/Web/API/PerformanceNavigationTiming so they have more insight into what that means)
Week over week trends - It would be great to have a mechanism to detect if this radically changes week over week. One idea would be to pick a few sites and a few regions and plot the p50 and p90 of time to first bye (since that shouldn't be susceptible to the content of the page).
Other sites I could imagine adding:
- ipfs.tech
- blog.ipfs.tech
- docs.libp2p.io
- blog.libp2p.io Or maybe it would be worth just syncing to whatever set of sites are being pinned to the collab cluster

lidel commented 1 year ago

Agree with @BigLep, we should add 1 paragraph that defines what we mean by "The time it took to fully load the page.".

I know @guseggert was looking into doing way more than measuring how long HTML took to transfer, and more than DOMContentLoaded – he was waiting for DOMContentLoaded AND also for all images on a page to load as well. Not sure what ended up being the final metric, but we should document it to remove guesswork/confusion. Sidenote: If measuring images is too complex/unreliable, DOMContentLoaded is what people usually measure, because it is when page is in a state that allows for user interaction (and images load async).

+ small nits:

add https://en.wikipedia-on-ipfs.org/wiki/ to reduce bias
replace ipfs.io with ipfs.tech (context: https://github.com/protocol/bifrost-infra/issues/2018)

dennis-tra commented 1 year ago

Great, thanks for your the feedback!

What is the configuration of the node monitoring these sites? For example, is it stalk Chromium phantoms node? (I think we should be explicit that Companion (for intercepting IPFS URLs) is not in the mix.

We run website measurements every six hours. Each of these measurements we consider a measurement run.

In each run, we start six Kubo nodes around the world in different AWS regions. As soon as their API is reachable, we wait for 10s to let the node settle, and then request these websites one by one. We request each website three times. Then we wait 10 minutes and request the set of websites again. We thought that this may simulate a warm(er) Kubo node that has a "better" routing table. The graphs in the weekly report don't yet distinguish between cold and warm nodes.

Actually, we're not only starting six Kubo nodes but twelve because we want to test the most recent stable Kubo version (v0.18.0) and the most popular according to our network crawls (v0.17.0 - up until last week, now it's also v0.18.0).

Another detail: in each run, we are also requesting every website via plain HTTP without going through the local Kubo node. This means we could compare both protocols.

We can easily change all of the above parameters (4x a day, 6 regions, settle times, 3 retries, 2 Kubo versions, etc.).

We are running the Kubo nodes on AWS's t3.medium instances and request the website through Kubo's local gateway. E.g., a website request looks like this: http://127.0.0.1:8080/ipns/filecoin.io. We thought that this would come close to the Companion's performance. However, IIUC if a user browses to, e.g., https://filecoin.io, the x-ipfs-path header gets intercepted and only the remaining resources will be loaded via IPFS. I think, with our experiments, we're simulating the case where a user would directly browse to ipns://filecoin.io.

Is the cache cleared between each run?

Yes, between each retry we run ipfs repo gc. However, the first website request is likely slower than the subsequent ones because BitSwap will discover the provider immediately in later requests.

I assume "Page Load" is https://github.com/macbre/phantomas/blob/devel/docs/metrics.md#performancetimingpageload

That's right :+1:

That said, I assume this is the "Load" metric that shows up in one's web inspector (screenshot - red vertical bar).

Not sure which screenshot you're referring to :/

Regardless, I'd love to be more specific than "page load"

Totally agree! Just to clarify what the performanceTimingPageLoad measures. I just looked it up (source) and it measures the difference between

loadEventStart - representing the time immediately before the current document's load event handler starts. The load event is fired when the whole page has loaded, including all dependent resources such as stylesheets, scripts, iframes, and images. This is in contrast to DOMContentLoaded, which is fired as soon as the page DOM has been loaded, without waiting for resources to finish loading.
navigationStart - a depracated feature "representing the moment [...] right after the prompt for unload terminates on the previous document"

@lidel phantomas already measures the DOMContentLoaded 👍 So I'll just replace the performanceTimingPageLoad metric with the DOMContentLoaded one?

Week over week trends - It would be great to have a mechanism to detect if this radically changes week over week. One idea would be to pick a few sites and a few regions and plot the p50 and p90 of time to first bye (since that shouldn't be susceptible to the content of the page).

Yup, I also wanted to have that but couldn't come up with a nice visualization that nicely captures all dimensions (latency, datetime, region, website, data points). What you're suggesting is actually a nice trade-off I think.

Suggestion: take ipfs.tech + en.wikipedia-on-ipfs.org/wiki/ p50 + p90 timeToFirstByte latencies from eu-central-1 and report these numbers every week.

Other sites I could imagine adding:

Updated list of websites: https://github.com/protocol/probelab-infra/pull/17

BigLep commented 1 year ago

@dennis-tra : again, this is awesome, and thanks for the great reply.

A few followups...

Questions around the definition of a run

How are you combining all the datapoints for a <region,site,run>? If I understand correctly, the <region,site,run> tuple is collecting 12 datapoints: 2 kubo version x 3 tries x 2 modes in http (direct-to-domain )and non-http (through local Kubo HTTP gateway)?
However, IIUC if a user browses to, e.g., https://filecoin.io, the x-ipfs-path header gets intercepted and only the remaining resources will be loaded via IPFS. I think, with our experiments, we're simulating the case where a user would directly browse to ipns://filecoin.io.

Yeah, you're right that there are some differences with companion vs. hitting the local Kubo HTTP gateway, but I think your approach is good/clean/simple. Companion with time learns which domains are available via non-HTTP, and so future domain accesses is able to re-route the URL to the configured IPFS HTTP gateway. (I believe there's more nuance here, but I think what you're capturing here is good/fine.)

Another implication of your "companion" approximation by loading through the local Kubo HTTP gateway is that page resources that don't live under the root domain will be fetched via HTTP. In the companion case, they may get intercepted and fetched using non-http.
- As a contrived example, for ipfs.tech, any assets (images, CSS, JS, etc) under ipfs.tech will continue to be fetched via non-http when you do http://127.0.0.1:8080/ipns/ipfs.tech, but if the top-level ipfs.tech page has <img src="https://libp2p.io/logo.png"/> that will be fetched via HTTP since Phanotms doesn't have the intelligence to know that libp2p.io assets can be fetched via non-http. I think this is fine, but I think this caveat should be documented.
It's great we're clearing Kubo's cache. What about the browser (phantoms) cache?

Events we're measuring around the page lifecycle

Doh, I added the screenshot to my original post - thanks.

load vs DOMContentLoaded - I've got a few things to say around this but am not necessarily laying out the points in the best way.

I think "load" event is actually what we want because it is doing what Lidel and Gus were looking at of waiting for all the images, JS, CSS, etc. to load. We want all this network traffic for comparing fetching a site via non-http (local Kubo HTTP gateway) and HTTP (hitting the domain direct).
That said, in its current presentation, it will cause confusion because "load" times aren't necessarily comparable across sites. Yes, smaller load time is better, but it also generally doesn't mater from a UX regard if the above the fold content is interactable sooner. There are more objective assertions that can be made around DOMContentLoaded where studies have done to show that users report a site is "fast" if that event happens within X ms. (I'm not up on the latest studies here.) Something like DOMContentLoaded is more comparable in that sense. That said, our goal here isn't to provide feedback to site creators on whether their site is meeting ideal UX targets....
By default when some people see https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-7/ipfs/README.md#performance-timing-page-load, they're going to think "What? How is protocol.ai so slow? It takes 1 second for me to load in my browser? There must be something up with the measurements, or fetch via non-http is really slow?"
1. Most aren't going to know that the "load" event means. We're going to need to caveat this.
2. Some may not realize that the "load" time is heavily affected by your browser cache and whether you've visited this site recently.
3. The critical thing we want to show is how does "load" time compare when using http vs. non-http. At the minimum, non-http should have higher availability. (If it doesn't, then what good is our tech?). It's ideal for it to also be more performant, but I'm not expecting we'll hit that given 1) most website assets are small and don't take advantage of block deduplication or tricks of fetching blocks from multiple peers and 2) so much has gone into optimizing HTTP for websites over the decades with CDNs, etc.
My big picture comment here is that I actually think the metrics you have are good, but we have to provide a view comparing http with non-http for a given site.
- We can compare TTFB across the sites because that is more compatible (because not affected by page content), and it may lead to interesting questions as to "why is TTFB 3x slower for website X than website Y".

Week-over-week reporting

I like the idea of what you're suggesting but would add a bit:

Sites

ipfs.tech
en.wikipedia-on-ipfs.org/wiki/

Regions:

eu-central-1
us-east-1 or 2 (I added this given geography can matter a lot when it comes to latency)

Modes: (I added this, and maybe we do one graph for HTTP and one for non-http).

non-http
HTTP

Metric

p50 timeToFirstByte Latency
p90 timeToFirstByte Latency

The reason I think we want to plot HTTP vs non-http is:

non-http captures improvements in Kubo (or whatever IPFS implementation is used) (e.g., code optimizations, network routing improvements, data transfer improvements)
http mode benefits from all of the above plus HTTP specific things like CDNs.
Breaking them out lets credit be given where credit is due. For example, I don't want Kubo/IPFS getting credit when the website TTFB goes down just because someone improved the HTTP CDN.

DOMContentLoaded

Per before, I agree that DOMContentLoaded is a more comparable metric between sites because we want all sites to "DOMContentLoad" fast since that better mirrors the customer experience. That said, this has a lot to do with a site's page layout, included JS, etc. I don't think with these reports that we want to get into the business of helping site creators realize that their sites could be better optimized or are slow compared to other properties. As a result, I don't think we should be surfacing here. (I'm sorry I wasn't more clear on that last time, and certainly feel free to disagree.)
(Ignore if we drop this graph.) If we keep it, why are we saying "These numbers don't account for the network latencies!". I thought the stopwatch starts before the network calls have been made to load the page: https://developer.mozilla.org/en-US/docs/Web/API/PerformanceNavigationTiming

HTTP vs. Kubo

Do you think our sample sizes are big enough to use p99? Truncate to p90 for now?
Lets capture some of the known reasons why HTTP is smaller than non-HTTP. One of the key items I believe is that:
- "While we clear client-side caches, we aren't clearing the server-side cache on the HTTP side. This means that HTTP requests are likely hitting a CDN or other HTTP webserver caching which causes the IPFS/Kubo side of the HTTP gateway not to even get invoked."
- "Webpage assets (especially the initial page HTML) is small and not going to see any benefit from concurrent block download that IPFS can yield on larger assets."
I think this is most meaningful to see week-over-week trends so we can see if there issues in Kubo, DHT, Gateway provider infra.
- One idea would be to have a p90 graph for each tuple of <region,site>. That graph would have two lines: HTTP and non-http. This would be a matrix of graphs. They could be small (200x200 pixels), but big enough for someone to see a trend line. They can then click it to load a larger version of it.
- I agree that we also want to make clear the sample size if looking at percentiles (thanks for adding counts to it). Maybe we can add it as a column to the table of graphs above so someone can spot-check to know that that sample size is reasonable.

In practice, here is what I'm suggesting:

Feel free to disagree and I'm up for discussing other options.

BigLep commented 1 year ago

Also, what are the thoughts capturing all the meta-details? Per before, we have lots of good/important notes about this topic. I want to make sure we have a durable place for it. This could be in the report itself or listed somewhere else.

I think we also need a way where we can capture notes/investigations that were done. I don't think slack threads will scale since it won't make it easy to see prior answers/investigations. Ideally there is a self-surface way for anyone to see what investigations have done for a given week, or what our callouts/observations are for that week. I think that likely means having an accompanying github issue, page, or Notion doc for each weekly report that we link to from the report. Some content will carry forward between reports and that's ok. We also want to make it self-service for someone to know where they ask questions. (I think it's ideally with a comment in the linked doc.). I'm happy to discuss this more.

dennis-tra commented 1 year ago

Hi @BigLep,

sorry for the late reply. I have been working on improving our measurement setup. I spent some time last week putting our website measurement infrastructure on a new foundation and I'm much more confident about the data we are gathering now. I plan to document the setup in this repository (note that we have created a ProbeLab GitHub organization, so this and other private repositories will eventually be migrated to that org). I have also explained my reasoning for the new setup here.

Because of this new setup, we don't have enough data to report in this week's report.

Some notes regarding the metrics we want to report: Further up in this issue, we focussed on the TTFB and domContentLoaded metrics. While working on our website monitoring infrastructure last week I read up on how to measure website performance and came across this list:

https://developer.mozilla.org/en-US/docs/Learn/Performance/Perceived_performance

To quote the website:

Performance metrics

There is no single metric or test that can be run on a site to evaluate how a user "feels". However, there are a number of metrics that can be "helpful indicators":

First paint The time to start of first paint operation. Note that this change may not be visible; it can be a simple background color update or something even less noticeable.

First Contentful Paint (FCP) The time until first significant rendering (e.g. of text, foreground or background image, canvas or SVG, etc.). Note that this content is not necessarily useful or meaningful.

First Meaningful Paint (FMP) The time at which useful content is rendered to the screen.

Largest Contentful Paint (LCP) The render time of the largest content element visible in the viewport.

Speed index Measures the average time for pixels on the visible screen to be painted.

Time to interactive Time until the UI is available for user interaction (i.e. the last long task of the load process finishes).

I think the relevant metrics on this list for us are First Contentful Paint, Largest Contentful Paint, and Time to interactive. First Meaningful Paint is deprecated (you can see that if you follow the link) and they recommend: "[...] consider using the LargestContentfulPaint API instead.".

First paint would include changes that "may not be visible", so I'm not particularly fond of this metric.

Speed index seems to be very much website-specific. With that, I mean that the network wouldn't play a role in this metric. We would measure the performance of the website itself. I would argue that this is not something we want.

Besides the above metrics, we should still measure timeToFirstByte. According to https://web.dev/ttfb/ the metric would be the time difference between startTime and responseStart:

In the above graph you can also see the two timestamps domContentLoadedEventStart and domContentLoadedEventEnd. So I would think that the domContentLoaded metric would just be the difference between the two. However, this seems to only account for the processing time of the HTML (+ deferred JS scripts).

We could instead define domContentLoaded as the time difference between startTime and domContentLoadedEventEnd.

The revised measurement setup currently gathers the following data:

timeToFirstByte - as defined above
firstContentfulPaint
largestContentfulPaint
The full PerformanceNavigationTiming object

We could also include:

Time to interactive
domContentLoaded - as defined above

I believe we won't be able to report all the above metrics, so if I had the choice between only two, I would choose timeToFirstByte and largestContentfulPaint.

Just want to note that the ask for week-over-week graphs was not unheard! I'm also working on this and will come back here when I have news. I'll try to address all your remarks from the last comment.

Also, I don't have a better place to discuss these things right now. Instead of GH we could use Notion or discuss.ipfs.io. I'll chat with @yiannisbot and @iand and come back here with a proposal.

dennis-tra commented 1 year ago

Explained the new website measurement methodology here: https://github.com/dennis-tra/tiros

BigLep commented 1 year ago

Thanks @dennis-tra for the update.

Infra

Good call on getting good underpinning.
Thanks for sharing https://github.com/dennis-tra/tiros I like all the defenses to prevent caching. The only issue I'm wondering there is about: what advantage is Kubo getting for successive runs because it is already connected with the providers of all the website content. I assume it's effectively getting to bypass content routing via DHT/IPNI and rely on Bitswap's content discovery.

Metrics we're reporting

Agreed on TTFB since that does measure HTTP vs. non-HTTP and is not dependent on the site's initial HTML payload. It is comparable across sites.

Per before, "I don't think with these reports that we want to get into the business of helping site creators realize that their sites could be better optimized or are slow compared to other properties." I'm a bit worried we're heading into these waters by talking about firstContentfulPaint, largestContentfulPaint, etc. I think we should message our rationale with something like "We're including this metric because it helps a site owner see about the impact of using IPFS protocols over HTTP before their site is interactive. HTTP vs. IPFS protocols have some impact on this metric, but they aren't the only culprit. There have been many tools developed over the last decades of the modern web to help." With a message like that, I guess it makes me think we should aslo prefer Time to interactive rather than largestContentfulPaint. (I can see arguments either way but I would give preference ot interactivity rather than rendering of largestContentfulPaint because of how annoying the user experience is when you can't interact with the page to scroll or click. Anyways, I'll defer here.)

Discussion place for weekly reports

I like the idea of having a discuss post per week (e.g., https://discuss.ipfs.tech/t/ipfs-measurement-report-calendar-week-10-2023/16114/2 ). A couple of things:

We should maybe make this its own Discuss category so someone can subscribe to the category for notifications.
(nit) but I think it would be good to get an ISO date into the title: "🌌 IPFS Measurement Report - 2023-03-12"

dennis-tra commented 1 year ago

The only issue I'm wondering there is about: what advantage is Kubo getting for successive runs because it is already connected with the providers of all the website content. I assume it's effectively getting to bypass content routing via DHT/IPNI and rely on Bitswap's content discovery.

We (ProbeLab) also discussed this previously and also assume that a subsequent request will likely be served directly via Bitswap. Since we're tracking if it's the first, second, third, etc., request in the Kubo node's lifetime, we could produce a graph that only considers the first requests. The sample size would be very small, though. Alternatively, we could actively disconnect from the content provider after each request. However, I don't think Kubo gives us information from which peer it fetched the data. If that were the case, we could certainly do that. Then we'd always measure the worst-case performance where we'd need to reach out to the DHT (although we could still, by chance, be connected to another providing peer).

On another note, we're complementing these website measurements with DHT performance measurements, where we directly measure the publication and lookup latencies.

I don't think with these reports that we want to get into the business of helping site creators realize that their sites could be better optimized or are slow compared to other properties.

I also think so, and that's exactly why I argued against the Speed index metric. firstContentfulPaint, and largestContentfulPaint will include latencies for subsequent requests from e.g., script, img, link tags, which may then also be served by Kubo (though, I saw that some websites make cross-origin requests which would then not be served by Kubo). That's why I think these two metrics will depend on Kubo's performance and are worth measuring - but it's certainly muddy.

However, that's also the case for the TTI metric. From the docs:

Time to Interactive (TTI) is a non-standardized web performance 'progress' metric defined as the point in time when the last Long Task finished and was followed by 5 seconds of network and main thread inactivity.

And a "Long Task":

Long tasks that block the main thread for 50ms or more cause, among other issues:

Delayed Time to interactive (TTI).

High/variable input latency.

High/variable event handling latency.

Janky animations and scrolling.

A long task is any uninterrupted period where the main UI thread is busy for 50ms or longer. Common examples include:

Long-running event handlers.

Expensive reflows and other re-renders.

Work the browser does between different turns of the event loop that exceeds 50 ms.

Especially the list of common examples sounds very website specific to me. I could imagine a SPA spending too much time on the main thread rendering the page. This wouldn't have something to do with Kubo's performance IMO.

I think measuring the TTI definitely won't hurt, so I'll try to track it regardless of whether we will eventually report it 👍

We should maybe make this its own Discuss category so someone can subscribe to the category for notifications.

Totally! We could also rename "Testing & Experiments" to something like "Measurements" (just to limit the number of categories). Who owns the forum? I believe I don't have the necessary permissions to create new categories.

(nit) but I think it would be good to get an ISO date into the title: "🌌 IPFS Measurement Report - 2023-03-12"

seems like I can't edit the post anymore :/ will do for the next 👍

BigLep commented 1 year ago

Thanks @dennis-tra:

Kubo nodes having established connections to peers in successive runs

I think it's fine to leave as it. I think we can make the case that it's expected that as IPFS protocols become more prevalent that your likelihood of being connected to a peer that already has the content goes up. This is the "advantage" that the Kubo/IPFS path gets, just as the HTTP path is getting the advantage of server-side CDN caching.
It does hit on that we need to have a way for someone to find out all the persistent callouts/caveats about a given graph (i.e., someone should be able to see our hypothesis that runs 2+ for a given site will be more performant because of existing connections). They don't have to be "in your face", but it should be self-service for someone to learn all the caveats (similar to what we're doing in https://www.notion.so/pl-strflt/IPFS-KPIs-f331f51033cc45979d5ccf50f591ee01 ).

Forum

For now I created a measurements subcategory: https://discuss.ipfs.tech/c/testing-and-experiments/measurements/39

I moved and renamed https://discuss.ipfs.tech/t/ipfs-measurement-report-2023-week-10-2023-03-12/16114

I also gave you moderation access.

dennis-tra commented 1 year ago

Apologies that I haven't come around to addressing your feedback from here yet. Just want to signal that it's not forgotten 👍

It does hit on that we need to have a way for someone to find out all the persistent callouts/caveats about a given graph (i.e., someone should be able to see our hypothesis that runs 2+ for a given site will be more performant because of existing connections). They don't have to be "in your face", but it should be self-service for someone to learn all the caveats (similar to what we're doing in https://www.notion.so/pl-strflt/IPFS-KPIs-f331f51033cc45979d5ccf50f591ee01 ).

Totally, with our planned probelab website we want to give detailed description about the graphs and their measurement methodology.

whizzzkid commented 1 year ago

I am not reading the intentions behind tracking these metrics, is it SEO? or just understanding of how these pages perform in real-world?

Thoughts:

If SEO is the goal, then we should be mindful CLS is now 25% of the total score. This could be relevant for the urls mentioned in the root comment.
If tracking page performance is the goal, web-vitals makes it simpler for hooking up to the browser metrics and reporting that in the real world.
If we only want to track performance related to interactions with IPFS, then interactions with the page should be dev's problem, Helia can potentially ship with a light weight instrumentation wrapper that reports performance anonymously.

PS: Can someone create a thread summary for tl;dr;?

yiannisbot commented 1 year ago

Thanks for the feedback @whizzzkid! SEO is not a goal here.

If we only want to track performance related to interactions with IPFS, then interactions with the page should be dev's problem

We totally agree that we should not be fiddling around with dev's problems - that's not the goal here. However, we do want to measure performance from the interactions with IPFS. We want to find a metric that would tell us how quickly the website loads from a user's perspective when loading it over IPFS. We are capturing the TTFB, which is an indication, but not the whole story.

I didn't look at "web-vitals" yet, but will do.

probe-lab / network-measurements