[Bug] Crawler can detect incorrect HTTP status codes when navigations involving redirect chains are partially cached

dbjorge commented 1 year ago

Describe the bug

We received a customer report via support for the ADO task (which also repros via the scan CLI package) that in their authentication-based scenario, the crawler would emit the following error when attempting to scan their page:

Web browser failed to open URL(s). Please check error log for details that was saved as C:/temp/aiscan/ai_scan_cli_output/ai-cli-browser-errors-20230331-180401.log

[
  {
    "url": "REDACTED",
    "error": "{\n  errorType: 'HttpErrorCode',\n  statusCode: 302,\n  statusText: 'Found',\n  message: 'Page returned an unsuccessful response code',\n  stack: 'Error\\n' +\n    '    at PageResponseProcessor.getResponseError (C:\\\\repos\\\\accessibility-insights-service\\\\packages\\\\cli\\\\dist\\\\ai-scan-cli.js:8393:40)\\n' +\n    '    at PageNavigationHooks.postNavigation (C:\\\\repos\\\\accessibility-insights-service\\\\packages\\\\cli\\\\dist\\\\ai-scan-cli.js:7807:58)\\n' +\n    '    at PageProcessorBase.postNavigation (C:\\\\repos\\\\accessibility-insights-service\\\\packages\\\\cli\\\\dist\\\\ai-scan-cli.js:5019:48)\\n' +\n    '    at PuppeteerCrawler._executeHooks (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\apify\\\\build\\\\crawlers\\\\basic_crawler.js:592:23)\\n' +\n    '    at PuppeteerCrawler._handleNavigation (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\apify\\\\build\\\\crawlers\\\\browser_crawler.js:405:20)\\n' +\n    '    at runMicrotasks (<anonymous>)\\n' +\n    '    at processTicksAndRejections (node:internal/process/task_queues:96:5)\\n' +\n    '    at async PuppeteerCrawler._handleRequestFunction (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\apify\\\\build\\\\crawlers\\\\browser_crawler.js:343:13)\\n' +\n    '    at async wrap (C:\\\\repos\\\\accessibility-insights-service\\\\node_modules\\\\@apify\\\\timeout\\\\index.js:73:27)'\n}"
  }
]

Normally, this error is a symptom of incorrectly configured authentication settings (especially a working service account that lacks permissions for the page under test). However, in this case, we verified that the authentication settings were correct and were able to verify the issue using our own service account. Examining a trace of network traffic suggested that this page was using a standard MSAL-based auth flow:

Browser requests the page under test
Page under test responds 200
The browser caches this initial response for the page under test
The MSAL script on the page initiates an auth flow by performing a JS redirect pointing window.location to https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize?... with a redirect_uri query parameter pointing exactly back to the page under test
Having previously authenticated at browser launch, this receives an immediate response without user interaction of a 302 with Location pointing back to the page under test (plus an oauth code in the URI fragment)
The browser uses the cached version of the page under test
All further main frame navigation activity is script-driven and doesn't involve further request/response pairs - there is some further URL manipulation, but it's based on the history.pushstate API, not page navigations

This is a standard enough auth flow; the page isn't doing anything out of line here.

The issue gets triggered by our postNavigation hook in page-navigation-hooks.ts, which contains a check verifying that the Puppeteer.HTTPResponse we receive from Apify should have a successful HTTP status code (in the 200-299 range). Apify doesn't do a great job of documenting exactly what this response we receive is expected to be in the event that there's a redirect chain while loading the page; I traced through the Apify code and found that it's just passing through the return value of Puppeteer's Page.goto as-is.

Page.goto is documented as returning the last response in the chain for a goto that involves a redirect chain, so what we'd expect to see here is the response representing the cached lookup for the page under test. However, what we actually receive is a spliced response where response.status() is 302, even though the originally cached response was a 200. In analyzing the assorted requests/responses, I noticed that Puppeteer's internal _requestId property is the same for the actual 302 request and the final request that it gives a cached response for; I suspect there might be some sort of object pooling going on that's sharing the response objects incorrectly, causing a leak of the second response's status code.

The two main workarounds I could imagine us considering on the service side are:

Updating PageResponseProcessor.getResponseError to treat any responses with response.fromCache() === true as having an ok() status would work around this specific case, but it's a bit questionable in general because it's not guaranteed that all cached responses are successful - for example, Chromium will happily cache a 404 response whose response headers allow it.
Disabling caching (eg, by setting page.setCacheEnabled(false); in our preNavigation hook) prevents the issue from occuring and causes the page to be scanned successfully overall, but risks some performance impact.

I searched through the Puppeteer and Chromium bug trackers and wasn't able to find any existing tracking issue. I didn't dig deeply enough into Puppeteer's implementation to feel confident about whether this is more likely to be an issue in Puppeteer or Chromium. I also haven't attempted to repro this in isolation of the service/apify, though if my understanding of the issue is correct, this should be possible. Before filing an upstream bug, it would be good to do that and verify whether the issue still repros in more recent versions of Puppeteer and/or in Playwright.

Area

[ ] Install script
[ ] Update script
[x] Service
[x] Scan CLI

To Reproduce

Steps to reproduce the behavior:

Build the repo (I tested as of 012ddf25f4e6b403460ef9f861721bcad35d3f08)
Execute the following command, filling in the REDACTED service account parameters with our usual internal test service account and the URL with the customer URL from internal support thread entitled "RE: Support For Adding Accessibility Insights To CI Deployment ."
```
> node --inspect-brk C:\repos\accessibility-insights-service\packages\cli\dist\ai-scan-cli.js --crawl --serviceAccountName 'REDACTED' --serviceAccountPassword 'REDACTED' --authType AAD --url REDACTED --silentMode false --maxUrls 1 --restart
```
Connect to the node debugger instance in chrome:inspect
Set a breakpoint at postNavigation in packages/scanner-global-library/src/page-navigation-hooks.ts
Start up Fiddler and start tracing
Resume the debugger until it breaks at postNavigation
Observe that the response object at the breakpoint has a status() of 302 but otherwise matches the response expected for the page under test
Compare this to the Fiddler trace and observe that the page under test never returned such a status code

Optionally, set up a page.on('response') handler in preNavigation to log request/response data, and observe that the _responseId properties for the actual 302 and the false 302 are reused

Expected behavior

This page should scan successfully, as it does when caching is disabled in our preNavigation hook

Log files

See description

Screenshots

n/a

Additional context

n/a

dbjorge commented 1 year ago

Definitely seems like it's a puppeteer issue. Here's a self-contained repro (contains both an express server reproducing the required test page redirect behavior and also a self-contained puppeteer/playwright test script demonstrating the issue, see comment at top for instructions).

The issue repros in both Puppeteer 18.1.0 (what service currently uses) and 19.8.3 (latest as of writing). It does not repro in Playwright, but note that this is in part because Playwright's behavior for Page.goto's return value differs (both as-documented and in-practice) in this case. Playwright documents:

In case of multiple redirects, [Page.goto] will resolve with the first non-redirect response.

That is, in the same type of code as we see in the repro, Playwright (intentionally) resolves with the first 200 response. However, even if the repro is modified to prime the cache and then check the result of a direct Page.goto to the second page in line (which responds with a 302 directly), Playwright still doesn't repro the issue.

This is in comparison to Puppeteer, which documents:

In case of multiple redirects, [Page.goto] will resolve with the response of the last redirect.

This wording is a little ambiguous to me about whether the intended behavior is actually to return the 302 here vs the final 200 here, but whichever one is intended, it seems like it's probably a puppeteer bug that you get a different answer based on whether the page is cached or not.

dbjorge commented 1 year ago

Filed upstream issue with Puppeteer at https://github.com/puppeteer/puppeteer/issues/9965. However, it's likely that we'll need to consider a workaround even if they fix it, since Apify locks us into an older version of Puppeteer.

dbjorge commented 1 year ago

This turns out to actually be a bug further upstream in Chromium (https://crbug.com/1340398), but Puppeteer has implemented a workaround on their end and released it in 19.8.4. Playwright also already contains a similar workaround as of this commit, which I think maps to release 1.23.1, though there isn't a patch note for it.

So other options for workarounds on our end now include:

Update our override of puppeteer to use 19.8.4 instead of the current 18.1.0 (both versions are well above what our current version of apify claims support for).
Use playwright instead of puppeteer
Update from apify v2 to crawlee (the renamed apify v3), whose puppeteer engine currently depends on "puppeteer": "<= 19.x" (which would allow for 19.8.4 without a resolution)
Add a puppeteer patch that implements the NetworkManager.ts update contained in this PR
- Note that this path would require an update to the ADO task to enable the use of patches in its runtime dependency installation step

dbjorge commented 1 year ago

Verified that bumping the override of puppeteer to 19.8.4 results in the repro case scanning correctly using the CLI.

microsoft / accessibility-insights-service