microsoft / accessibility-insights-service

Accessibility Insights Service
Other
70 stars 50 forks source link

[Bug] Crawler can detect incorrect HTTP status codes when navigations involving redirect chains are partially cached #2413

Closed dbjorge closed 1 year ago

dbjorge commented 1 year ago

Describe the bug

We received a customer report via support for the ADO task (which also repros via the scan CLI package) that in their authentication-based scenario, the crawler would emit the following error when attempting to scan their page:

Web browser failed to open URL(s). Please check error log for details that was saved as C:/temp/aiscan/ai_scan_cli_output/ai-cli-browser-errors-20230331-180401.log
[
  {
    "url": "REDACTED",
    "error": "{\n  errorType: 'HttpErrorCode',\n  statusCode: 302,\n  statusText: 'Found',\n  message: 'Page returned an unsuccessful response code',\n  stack: 'Error\\n' +\n    '    at PageResponseProcessor.getResponseError (C:\\\\repos\\\\accessibility-insights-service\\\\packages\\\\cli\\\\dist\\\\ai-scan-cli.js:8393:40)\\n' +\n    '    at PageNavigationHooks.postNavigation (C:\\\\repos\\\\accessibility-insights-service\\\\packages\\\\cli\\\\dist\\\\ai-scan-cli.js:7807:58)\\n' +\n    '    at PageProcessorBase.postNavigation (C:\\\\repos\\\\accessibility-insights-service\\\\packages\\\\cli\\\\dist\\\\ai-scan-cli.js:5019:48)\\n' +\n    '    at PuppeteerCrawler._executeHooks (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\apify\\\\build\\\\crawlers\\\\basic_crawler.js:592:23)\\n' +\n    '    at PuppeteerCrawler._handleNavigation (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\apify\\\\build\\\\crawlers\\\\browser_crawler.js:405:20)\\n' +\n    '    at runMicrotasks (<anonymous>)\\n' +\n    '    at processTicksAndRejections (node:internal/process/task_queues:96:5)\\n' +\n    '    at async PuppeteerCrawler._handleRequestFunction (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\apify\\\\build\\\\crawlers\\\\browser_crawler.js:343:13)\\n' +\n    '    at async wrap (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\@apify\\\\timeout\\\\index.js:73:27)'\n}"
  }
]

Normally, this error is a symptom of incorrectly configured authentication settings (especially a working service account that lacks permissions for the page under test). However, in this case, we verified that the authentication settings were correct and were able to verify the issue using our own service account. Examining a trace of network traffic suggested that this page was using a standard MSAL-based auth flow:

This is a standard enough auth flow; the page isn't doing anything out of line here.

The issue gets triggered by our postNavigation hook in page-navigation-hooks.ts, which contains a check verifying that the Puppeteer.HTTPResponse we receive from Apify should have a successful HTTP status code (in the 200-299 range). Apify doesn't do a great job of documenting exactly what this response we receive is expected to be in the event that there's a redirect chain while loading the page; I traced through the Apify code and found that it's just passing through the return value of Puppeteer's Page.goto as-is.

Page.goto is documented as returning the last response in the chain for a goto that involves a redirect chain, so what we'd expect to see here is the response representing the cached lookup for the page under test. However, what we actually receive is a spliced response where response.status() is 302, even though the originally cached response was a 200. In analyzing the assorted requests/responses, I noticed that Puppeteer's internal _requestId property is the same for the actual 302 request and the final request that it gives a cached response for; I suspect there might be some sort of object pooling going on that's sharing the response objects incorrectly, causing a leak of the second response's status code.

The two main workarounds I could imagine us considering on the service side are:

I searched through the Puppeteer and Chromium bug trackers and wasn't able to find any existing tracking issue. I didn't dig deeply enough into Puppeteer's implementation to feel confident about whether this is more likely to be an issue in Puppeteer or Chromium. I also haven't attempted to repro this in isolation of the service/apify, though if my understanding of the issue is correct, this should be possible. Before filing an upstream bug, it would be good to do that and verify whether the issue still repros in more recent versions of Puppeteer and/or in Playwright.

Area

To Reproduce

Steps to reproduce the behavior:

  1. Build the repo (I tested as of 012ddf25f4e6b403460ef9f861721bcad35d3f08)
  2. Execute the following command, filling in the REDACTED service account parameters with our usual internal test service account and the URL with the customer URL from internal support thread entitled "RE: Support For Adding Accessibility Insights To CI Deployment ."
    > node --inspect-brk C:\repos\accessibility-insights-service\packages\cli\dist\ai-scan-cli.js --crawl --serviceAccountName 'REDACTED' --serviceAccountPassword 'REDACTED' --authType AAD --url REDACTED --silentMode false --maxUrls 1 --restart
  3. Connect to the node debugger instance in chrome:inspect
  4. Set a breakpoint at postNavigation in packages/scanner-global-library/src/page-navigation-hooks.ts
  5. Start up Fiddler and start tracing
  6. Resume the debugger until it breaks at postNavigation
  7. Observe that the response object at the breakpoint has a status() of 302 but otherwise matches the response expected for the page under test
  8. Compare this to the Fiddler trace and observe that the page under test never returned such a status code

Optionally, set up a page.on('response') handler in preNavigation to log request/response data, and observe that the _responseId properties for the actual 302 and the false 302 are reused

Expected behavior

This page should scan successfully, as it does when caching is disabled in our preNavigation hook

Log files

See description

Screenshots

n/a

Additional context

n/a

dbjorge commented 1 year ago

Definitely seems like it's a puppeteer issue. Here's a self-contained repro (contains both an express server reproducing the required test page redirect behavior and also a self-contained puppeteer/playwright test script demonstrating the issue, see comment at top for instructions).

The issue repros in both Puppeteer 18.1.0 (what service currently uses) and 19.8.3 (latest as of writing). It does not repro in Playwright, but note that this is in part because Playwright's behavior for Page.goto's return value differs (both as-documented and in-practice) in this case. Playwright documents:

In case of multiple redirects, [Page.goto] will resolve with the first non-redirect response.

That is, in the same type of code as we see in the repro, Playwright (intentionally) resolves with the first 200 response. However, even if the repro is modified to prime the cache and then check the result of a direct Page.goto to the second page in line (which responds with a 302 directly), Playwright still doesn't repro the issue.

This is in comparison to Puppeteer, which documents:

In case of multiple redirects, [Page.goto] will resolve with the response of the last redirect.

This wording is a little ambiguous to me about whether the intended behavior is actually to return the 302 here vs the final 200 here, but whichever one is intended, it seems like it's probably a puppeteer bug that you get a different answer based on whether the page is cached or not.

dbjorge commented 1 year ago

Filed upstream issue with Puppeteer at https://github.com/puppeteer/puppeteer/issues/9965. However, it's likely that we'll need to consider a workaround even if they fix it, since Apify locks us into an older version of Puppeteer.

dbjorge commented 1 year ago

This turns out to actually be a bug further upstream in Chromium (https://crbug.com/1340398), but Puppeteer has implemented a workaround on their end and released it in 19.8.4. Playwright also already contains a similar workaround as of this commit, which I think maps to release 1.23.1, though there isn't a patch note for it.

So other options for workarounds on our end now include:

dbjorge commented 1 year ago

Verified that bumping the override of puppeteer to 19.8.4 results in the repro case scanning correctly using the CLI.