sitespeedio / sitespeed.io

sitespeed.io is an open-source tool for comprehensive web performance analysis, enabling you to test, monitor, and optimize your website’s speed using real browsers in various environments.
https://www.sitespeed.io/
MIT License
4.74k stars 601 forks source link

Questions about metrics and the Comapre plugin (Statistical Methods for Regression Analysis) #4241

Open Edson1337 opened 2 months ago

Edson1337 commented 2 months ago

Your question

Dear sitespeed.io developers, I am a computer science undergraduate student in my final period. Having said that, for the topic of my final project I chose to do a performance analysis between client-side and server-side rendering. So, basically I take the same website in csr (Client-Side Rendering) and ssr (Server-Side Rendering) and analyze the results of the metrics for both. I am currently using your wonderful tool for performance evaluation, extracting results of the following metrics: firstContentfulPaint, largestContentfulPaint, cumulativeLayoutShift, pageLoadTime, ttfb, maxPotentialFid and totalBlockingTime.

I'd like to ask if these metrics are suitable for the performance analysis I want to do, or if there are any that would fit better? Also, with the compare plugin would I be able to do a test comparing the results of the csr site with the ssr site?

soulgalore commented 2 months ago

Hi @Edson1337 thanks for reaching out, it sounds interesting. Feel free to reach out on the sitespeed.io Slack channel if you need any help.

I would use some of the visual metrics (we record a video of the screen, then analyse the video and get metrics when things are panted): FirstVisualChange and other metrics. You can get these metrics from Chrome/Edge/Firefox and Safari on Mac. Some of the metrics you mentioned are Chrome only. You can also choose specific elements (let me know if you don't find the documentation to do so) so you can measure for example when the largest H1 is painted on the screen.

The compare plugin should work fine (taking one of them as a baseline). Checkout --compare.alternative where default is "greater" you may need to change that depending on how you want to do run your test. I use the compare plugin for alerting for regression but in your case you want to find all significant changes independent on direction if I understand it correctly. Let me know how it works out for you, maybe we need to do some tweaks.

Best Peter

Edson1337 commented 2 months ago

Hi @soulgalore , thanks for your reply. If you prefer, we can continue via Slack, as I joined the community here.

I'm currently using Chrome's metrics because I needed at least 2 browsers that used the same metrics and I noticed that Firefox had some different ones. As a result, I noticed that Edge was able to use mobile device simulation using Chrome's configuration. In addition, I had made a python script that mapped the values of the metrics in the JSONs of the results, and I was creating my own JSON with the results of various scenarios that I had set up. I then merged the JSONs of the results of the metrics tested for more than one website into a CSV file. All this in order to manipulate the dataframe I created for statistical analysis. However, I recently saw that the Compare plugin is doing this.

The configurations I'm calling scenarios for testing web applications, each one being CSR and SSR. Below are the scenarios:

Devices Networks Browsers Metrics
Desktop 3G Chrome TBT
Moto G4 4G Edge Max Potential Fid
iPhone 8 Cable FCP
LCP
CLS
Page Load Time

Below is a snippet of the Dataset I was creating:

results_sample

In the example I only show one application, the CSR version, but there is also the SSR version, with the same metrics in the same scenarios. What I call a benchmark is the application, whether it's csr or ssr, on that route.

soulgalore commented 2 months ago

I prefer GitHub but if you have quick questions, Slacks works fine too :) Ok, cool, let me know if you need help or something is hard to understand/strange.

Edson1337 commented 2 months ago

So the metrics I'm currently using aren't very good for seeing a difference in the performance of the same version of a site? I'm prioritizing the Chrome settings because of the use of simulated applications, and time is a bit tight to do the Firefox settings.

Regarding Compare, I just need to define as a baseline, in my case, the application in SSR, in which I created a version of it in CSR, and then compare both their results. In this case, each scenario I've created would be a different test for an application, so after the evaluation, will I ask Compare to compare the results, using the .har of each one, or will I ask it to generate a format beforehand to do the statistical analysis?

soulgalore commented 2 months ago

So the metrics I'm currently using aren't very good for seeing a difference in the performance of the same version of a site?

I think they work ok.

Regarding Compare, I

Here are the documentation: https://www.sitespeed.io/documentation/sitespeed.io/compare/ Run one baseline test with X runs (at least 21) and a JSON file will be stored (it has more information than the HAR). Then run your test and point out the baseline test, and it will automatically compare. All metrics aren't used though for the statistical analys so you need check that yours are there. For example, I would only use statistical analys for timing metrics. CLS that measure how much content move around, should have the same metric all the time for each CSR test versus the SSR (that's my guess).

Edson1337 commented 2 months ago

All right, I appreciate that. I'll run it following the documentation. I just have one question in the moment: is there a reason why there should be at least 21 interactions (runs)?

Edson1337 commented 2 months ago

So I carried out the test, as described in the documentation. First I did the evaluation for the application in CSR, then I did it for the same in SSR.

In both runs I used: --compare.saveBaseline --compare.id start_page

I don't know if this command was necessary for the 2 executions or just for the first one to be the baseline, I didn't understand that very well. That said, could you explain?

In addition, there was a metric that gave a significant difference, but I have no idea what it does. Below is a screenshot of the result:

sitespeed_statistical_doubt

What is this metric?

soulgalore commented 2 months ago

should be at least 21 interactions (runs)?

Sorry I think I was wrong at least 20 samples I think is recommended (21 if you want to have one run that is the median). You can have smaller sample sizes too but then I think you need to change the configuration for Mann Whitney. Check if you can find some good examples out there, there was some time I looked into it so I don't remember exactly.

--compare.saveBaseline

This means a JSON file will be stored for that run. You only need to to this the first time in your case and then that file will be used as a baseline for the next time you run your test.

--compare.id start_page

This is the name/id of your test. You can use that if you run many different tests and make sure you compare it with the correct one. Like for example I run tests for 10 different pages, then I make sure I give them one unique id.

cpu-benchmark

You can read about it here: https://www.sitespeed.io/documentation/sitespeed.io/cpu-benchmark/ - it's the time it takes to run a for loop, it's used to make sure you run on a "stable" machine. You want that metric to be the same. In your case, do you run on the test on a dedicated machine? It could be three things:

  1. Do other things run on the machine that could influence the metric?
  2. The test runs after the test is finished (by default the test is finished lodEventEnd +2 seconds) but maybe your page is not finished by then in one of your test cases? If that's the case you can change when to end the test. https://www.sitespeed.io/documentation/sitespeed.io/browsers/#choose-when-to-end-your-test
  3. Maybe there's something wrong when browsertime/sitespeed.io runs the collection of the cpu benchmark, if you can help me reproduce it I can have a look.
Edson1337 commented 2 months ago

Thank you very much for replying.

So, let me recap:

So for my case, which has 18 scenarios (configurations), these are JSONs with the example below:

config.json

{
  "browsertime": {
    "iterations": 21,
    "prettyPrint": true,
    "headless": true,
    "connectivity": {
      "profile": "3g"
    },
    "browser": "edge",
    "chrome": {
      "mobileEmulation": {
        "deviceName": "Moto G4"
      }
    }
  },
  "plugins": {
    "add": "analysisstorer",
    "prettyPrint": true
  }
}

For each rendering (CSR or SSR) I will run the 18 scenarios for each. That said, each scenario is run twice, once for the SSR and once for the CSR, with the baseline being the first and the second respectively being what I want to compare. So when running sitespeed.io with the scenario for the first time, it goes with the 2 flags:

sitespeed.io --config config.json ssr_app_localhost_url --compare.saveBaseline --compare.id start_page

As for what I want to compare, I'll only do it with:

sitespeed.io --config config.json csr_app_localhost_url --compare.id start_page

Will it be like that?

Edson1337 commented 2 months ago

Now to the cpu-benchmark:

So, I ask you, is there anything I can do to ensure that there is no consistency in the results, without some process getting in the way? Or will I only be able to achieve this by running with docker?

soulgalore commented 2 months ago

That looks ok. For getting stable metrics I think you need to run on a dedicated server or a dedicated Android phone and pin the CPU at a specific speed. There are some information in https://www.sitespeed.io/documentation/sitespeed.io/web-performance-testing-in-practice/#pin-the-cpu-governor on how you can do that. If you run Android and Moto G5 or Samsung A51 the pinning of the CPU is in the code, so you can do it using the command line as long as your phone is rooted.

Edson1337 commented 2 months ago

Got it, thank you very much. Now, I have a question about the comparison data, is it stored in a file or is it just injected into the results HTML for visualization? For example, when I generated that first dataset, I took the information in "pages/data/" from the execution JSONs, for example browsertime.run-1.json, browsertime.run-2.json, and so on. In addition to taking the averages from "data/" in JSON browsertime.summary-total.json.

That said, do I need to extract from the HTML?

Or will it be similar to browsertime, where if I put in the configuration below, sitespeed.io returns the JSON?

"plugins": {
   "add": "analysisstorer",
   "prettyPrint": true
}
soulgalore commented 2 months ago

If you need the raw data, there's a couple of different ways you can do it:

  1. Run only Browsertime - then you will get all data as a JSON that you can read from Python. You will miss out on Mann Whitney but you can implement the yourself, its. not so much work.
  2. Create your own plugin - data/result is passed in a message queue so you could get the data/metrics you want (and store it or whatever).
  3. Use the JSON but I don't remember if you get data per run or the summary of all runs, need to check that.
Edson1337 commented 2 months ago

In this case, I can already extract the data from the metrics. What I wanted were the Mann Whitney and Wilcoxon values obtained from the statistical comparison. That said, if I create my plugin, would I be able to get these results from compare? Or would I have to go into the HTML to extract these values?

The data from each run is stored in JSONs with the prefix "browsertime.run" in pages/data.

soulgalore commented 2 months ago

If you build your own plugin you will get a message compare.pageSummary that will hold all the raw data displayed on the result page, then you can cherry pick the metrics you want from there.

Edson1337 commented 1 month ago

Hi @soulgalore , I apologize for not replying and contacting you again, I ended up having some unforeseen circumstances and I'm going back again to adapt my code to extract the statistical data from compare.pageSummary. However, I realized that the way I was doing it would skew the results of the comparisons by not isolating the applications, tools and even browsers. So, I decided to go for the docker approach and I wanted to know if I could run my sitespeed.io scenarios for the Edge browser through docker, because I didn't see in the docker documentation how to run it with docker. That said, will docker execution support Edge? Is it installed in the image?

soulgalore commented 1 month ago

Edge should be in the image. Do it work?

Edson1337 commented 1 month ago

I couldn't find it in the documentation, but the Edge image is there. Thank you very much!

I'm already using it with Docker, I've isolated my applications in containers too. That benchmark metric no longer appears.

I've made some adjustments to the scenarios (configs). I'm using the applications in SSR to be the baseline for the comparison, as well as explicitly using the configuration for mannwhitneyu:

{
        "compare": {
            "id": "baseline_id",
            "baselinePath": "./baseline_to_statistical",
            "saveBaseline": true,
            "testType": "mannwhitneyu",
            "alternative": "two-sided",
            "mannwhitneyu": {
                "useContinuity": false,
                "method": "auto"
            }
        }
}

For CSR applications I just remove the saveBaseline.

Can you tell me if this configuration makes sense?

soulgalore commented 1 month ago

For where to store the baseline I would follow the example in https://www.sitespeed.io/documentation/sitespeed.io/compare/#run-your-test where you map a volume for baseline and then use that inside the container. Using ./ makes things harder to follow and understand where it's stored inside the container?

Edson1337 commented 1 month ago

Thanks again for answering. So, I'm calling sitespeed.io using the container that is generated when I give the run command, I'm not creating any volume to save the results or the baseline. So I'm saving the results in the default directory where sitespeed.io runs. As for the baseline, my python code creates the folder and only with this relative path is it being accepted, thus saving it in the './baseline_to_statistical' created, as well as being able to use it for comparison.

In addition, I had doubts about the testType I should use, I'm generating the results with 'mannwhitneyu', because it works with independent pairs, but the apps I'm comparing are an adjustment to the original Server-Side Rendering (SSR) app, which I changed the code to be Client-Side Rendering (CSR). That said, would you advise me to stick with 'mannwhitneyu' or should I use 'wilcoxon'?

soulgalore commented 1 month ago

I haven't tested so much between the two, so I'm not sure. Try with both and see if you see any difference. I think the problem or the key issue here is to find the right metrics and user journeys to measure so that you measure what matters.

Edson1337 commented 1 month ago

Hi @soulgalore, thanks for the reply. I would like to inform you that I am testing for the testType being Wilcoxon, because I believe it makes more sense for the data I am testing. So, I have a question. What is the range of magnitude of Cliff's Delta? I saw in places that they were like this:

0.147 (small), 0.33 (medium), and 0.474 (large)

Is this the one you're using?

soulgalore commented 1 month ago

I've been using < 0,3 small, between 0,3 and 0,5 medium and larger than 0,5 large. I've been using it to know the effect of a regression, in the way that if the effect is small it doesn't matter so much.

Edson1337 commented 1 month ago

I see, now I understand better, thank you very much. I wanted to ask you what you think of wilcoxon.zeroMethod “wilcox”, which discards all zero difference pairs, would it be interesting for my proposal or do I keep the default,“ zsplit”?