harlan-zw / unlighthouse

Scan your entire site with Google Lighthouse in 2 minutes (on average). Open source, fully configurable with minimal setup.

https://unlighthouse.dev

MIT License

3.88k stars 115 forks source link

Crawling Website does not use extra headers specified in configuration #223

Closed LeoLeal closed 5 months ago

LeoLeal commented 5 months ago

Problem Statement

In my project, we are building a Monitoring workflow to generate lighthouse reports on the website to check performance to keep track each release.

The Website in this case, uses cloudflare bot fight which has a specific user agent in the allow-list for our synthetic tools. For this reason, I am configuring unlighthouse to use that user agent to make the requests to the website.

Expected Behaviour

Upon setting the configuration lighthouseOptions.extraHeaders['User-Agent'], requests to sitemap.xml should use this User Agent so cloudflare doesn't block my agent.

Current Behaviour

Unlighthouse is always sending the 'Unlighthouse' user agent to the axios requests. This is caused by the line below: https://github.com/harlan-zw/unlighthouse/blob/8a79f62c6bd2161f0df8ff9e3ab486a632ebcdd4/packages/core/src/util.ts#L139

When spreading the configuration, the latest item takes precedence when object attributes conflict. So in this code, my custom header for User Agent will be ignored, making sitemap.xml requests fail with 403.

LeoLeal commented 5 months ago

I cannot add Unlighthouse as an allowed user agent in cloudflare, as It poses a risk.

harlan-zw commented 5 months ago

Hi, please configure using the extraHeaders config, not modifying the lighthouseOptions directly.

https://unlighthouse.dev/guide/guides/authentication#custom-headers-authentication

LeoLeal commented 5 months ago

I have set the configurations as you mentioned (I am omitting sensitive information)

My configuration file looks like this:

const config = {
  site: 'xxxx.xx',
  ci: {
    budget: {
      performance: 0.9,
      accessibility: 0.9,
      'best-practices': 0.9,
      seo: 0.9
    }
  },
  scanner: {
    exclude: ['/.*?pdf', '.*/amp', 'en-*', '.*?mp4'],
    samples: 1,
    sitemap: true,
    robotsTxt: false,
    throttle: false
  },
  extraHeaders: {
    'User-Agent': 'XXXX'
  },
  debug: true
}

export default config

My github action (Im running it in a github worflow) is still failing with the message:

[warn] [Unlighthouse] Request to site xxxx.xx/ threw an unhandled exception. Please check the URL is valid and not blocking crawlers. Request failed with status code 403

I double checked and the user agent white listed in cloudflare is correct from the configuration. It's just not receving this user agent string.

LeoLeal commented 5 months ago

New information.

This doesnt happen when I run locally. only in github ci environment.

LeoLeal commented 5 months ago

In CI environment tho, I notice the requests for sitemap.xml are generated from node environment using fetch(). These requests are not sending the user agent header. Is It possible that github ci enforces the user agent header?

harlan-zw commented 5 months ago

Hey, I think I caught the bug, can you try out v0.13.1?