Closed dbjorge closed 1 year ago
Definitely seems like it's a puppeteer issue. Here's a self-contained repro (contains both an express server reproducing the required test page redirect behavior and also a self-contained puppeteer/playwright test script demonstrating the issue, see comment at top for instructions).
The issue repros in both Puppeteer 18.1.0 (what service currently uses) and 19.8.3 (latest as of writing). It does not repro in Playwright, but note that this is in part because Playwright's behavior for Page.goto
's return value differs (both as-documented and in-practice) in this case. Playwright documents:
In case of multiple redirects, [Page.goto] will resolve with the first non-redirect response.
That is, in the same type of code as we see in the repro, Playwright (intentionally) resolves with the first 200 response. However, even if the repro is modified to prime the cache and then check the result of a direct Page.goto
to the second page in line (which responds with a 302 directly), Playwright still doesn't repro the issue.
This is in comparison to Puppeteer, which documents:
In case of multiple redirects, [Page.goto] will resolve with the response of the last redirect.
This wording is a little ambiguous to me about whether the intended behavior is actually to return the 302 here vs the final 200 here, but whichever one is intended, it seems like it's probably a puppeteer bug that you get a different answer based on whether the page is cached or not.
Filed upstream issue with Puppeteer at https://github.com/puppeteer/puppeteer/issues/9965. However, it's likely that we'll need to consider a workaround even if they fix it, since Apify locks us into an older version of Puppeteer.
This turns out to actually be a bug further upstream in Chromium (https://crbug.com/1340398), but Puppeteer has implemented a workaround on their end and released it in 19.8.4. Playwright also already contains a similar workaround as of this commit, which I think maps to release 1.23.1, though there isn't a patch note for it.
So other options for workarounds on our end now include:
"puppeteer": "<= 19.x"
(which would allow for 19.8.4 without a resolution)NetworkManager.ts
update contained in this PR
Verified that bumping the override of puppeteer to 19.8.4 results in the repro case scanning correctly using the CLI.
Describe the bug
We received a customer report via support for the ADO task (which also repros via the scan CLI package) that in their authentication-based scenario, the crawler would emit the following error when attempting to scan their page:
Normally, this error is a symptom of incorrectly configured authentication settings (especially a working service account that lacks permissions for the page under test). However, in this case, we verified that the authentication settings were correct and were able to verify the issue using our own service account. Examining a trace of network traffic suggested that this page was using a standard MSAL-based auth flow:
window.location
tohttps://login.microsoftonline.com/organizations/oauth2/v2.0/authorize?...
with aredirect_uri
query parameter pointing exactly back to the page under testThis is a standard enough auth flow; the page isn't doing anything out of line here.
The issue gets triggered by our
postNavigation
hook inpage-navigation-hooks.ts
, which contains a check verifying that thePuppeteer.HTTPResponse
we receive from Apify should have a successful HTTP status code (in the 200-299 range). Apify doesn't do a great job of documenting exactly what thisresponse
we receive is expected to be in the event that there's a redirect chain while loading the page; I traced through the Apify code and found that it's just passing through the return value of Puppeteer'sPage.goto
as-is.Page.goto
is documented as returning the last response in the chain for agoto
that involves a redirect chain, so what we'd expect to see here is the response representing the cached lookup for the page under test. However, what we actually receive is a spliced response whereresponse.status()
is 302, even though the originally cached response was a 200. In analyzing the assorted requests/responses, I noticed that Puppeteer's internal_requestId
property is the same for the actual 302 request and the final request that it gives a cached response for; I suspect there might be some sort of object pooling going on that's sharing the response objects incorrectly, causing a leak of the second response's status code.The two main workarounds I could imagine us considering on the service side are:
PageResponseProcessor.getResponseError
to treat any responses withresponse.fromCache() === true
as having anok()
status would work around this specific case, but it's a bit questionable in general because it's not guaranteed that all cached responses are successful - for example, Chromium will happily cache a 404 response whose response headers allow it.page.setCacheEnabled(false);
in ourpreNavigation
hook) prevents the issue from occuring and causes the page to be scanned successfully overall, but risks some performance impact.I searched through the Puppeteer and Chromium bug trackers and wasn't able to find any existing tracking issue. I didn't dig deeply enough into Puppeteer's implementation to feel confident about whether this is more likely to be an issue in Puppeteer or Chromium. I also haven't attempted to repro this in isolation of the service/apify, though if my understanding of the issue is correct, this should be possible. Before filing an upstream bug, it would be good to do that and verify whether the issue still repros in more recent versions of Puppeteer and/or in Playwright.
Area
To Reproduce
Steps to reproduce the behavior:
chrome:inspect
postNavigation
inpackages/scanner-global-library/src/page-navigation-hooks.ts
postNavigation
response
object at the breakpoint has astatus()
of 302 but otherwise matches the response expected for the page under testOptionally, set up a
page.on('response')
handler inpreNavigation
to log request/response data, and observe that the_responseId
properties for the actual 302 and the false 302 are reusedExpected behavior
This page should scan successfully, as it does when caching is disabled in our
preNavigation
hookLog files
See description
Screenshots
n/a
Additional context
n/a