harlan-zw / unlighthouse

Scan your entire site with Google Lighthouse in 2 minutes (on average). Open source, fully configurable with minimal setup.
https://unlighthouse.dev
MIT License
3.65k stars 105 forks source link

Crawling Website does not use extra headers specified in configuration #223

Closed LeoLeal closed 2 weeks ago

LeoLeal commented 2 weeks ago

Problem Statement

In my project, we are building a Monitoring workflow to generate lighthouse reports on the website to check performance to keep track each release.

The Website in this case, uses cloudflare bot fight which has a specific user agent in the allow-list for our synthetic tools. For this reason, I am configuring unlighthouse to use that user agent to make the requests to the website.

Expected Behaviour

Upon setting the configuration lighthouseOptions.extraHeaders['User-Agent'], requests to sitemap.xml should use this User Agent so cloudflare doesn't block my agent.

Current Behaviour

Unlighthouse is always sending the 'Unlighthouse' user agent to the axios requests. This is caused by the line below: https://github.com/harlan-zw/unlighthouse/blob/8a79f62c6bd2161f0df8ff9e3ab486a632ebcdd4/packages/core/src/util.ts#L139

When spreading the configuration, the latest item takes precedence when object attributes conflict. So in this code, my custom header for User Agent will be ignored, making sitemap.xml requests fail with 403.

LeoLeal commented 2 weeks ago

I cannot add Unlighthouse as an allowed user agent in cloudflare, as It poses a risk.

harlan-zw commented 2 weeks ago

Hi, please configure using the extraHeaders config, not modifying the lighthouseOptions directly.

https://unlighthouse.dev/guide/guides/authentication#custom-headers-authentication

LeoLeal commented 2 weeks ago

I have set the configurations as you mentioned (I am omitting sensitive information)

My configuration file looks like this:

const config = {
  site: 'xxxx.xx',
  ci: {
    budget: {
      performance: 0.9,
      accessibility: 0.9,
      'best-practices': 0.9,
      seo: 0.9
    }
  },
  scanner: {
    exclude: ['/.*?pdf', '.*/amp', 'en-*', '.*?mp4'],
    samples: 1,
    sitemap: true,
    robotsTxt: false,
    throttle: false
  },
  extraHeaders: {
    'User-Agent': 'XXXX'
  },
  debug: true
}

export default config

My github action (Im running it in a github worflow) is still failing with the message:

[warn] [Unlighthouse] Request to site xxxx.xx/ threw an unhandled exception. Please check the URL is valid and not blocking crawlers. Request failed with status code 403

I double checked and the user agent white listed in cloudflare is correct from the configuration. It's just not receving this user agent string.

LeoLeal commented 2 weeks ago

New information.

This doesnt happen when I run locally. only in github ci environment.

LeoLeal commented 2 weeks ago

In CI environment tho, I notice the requests for sitemap.xml are generated from node environment using fetch(). These requests are not sending the user agent header. Is It possible that github ci enforces the user agent header?

harlan-zw commented 2 weeks ago

Hey, I think I caught the bug, can you try out v0.13.1?