puppeteer / puppeteer

JavaScript API for Chrome and Firefox
https://pptr.dev
Apache License 2.0
89k stars 9.09k forks source link

Pre-rendering on pages with 301 stopped working on pptr@>=1.7.0 #3941

Closed rokerkony closed 5 years ago

rokerkony commented 5 years ago

Hi all, we are using pptr for prerendering our SPA. One of the URLs which needs to be prerendered is example.de which should end with 301 with a header location pointing to example.com. We have a working solution below with pptr@1.6.2 but if we update to >=1.7.0 it stopped working because it is returning the fake status code.

Steps to reproduce

Tell us about your environment:

What steps will reproduce the problem?

The code working on 1.6.2

await page.setRequestInterception(true);
const FAKE_HTTP_CODE = 399;

let isRedirect = false;
page.on('request', request => {
    if (isRedirect) {
        // detect redirect and make sure that chrome will not process next request properly by sending fake status code 399
        return request.respond({status: FAKE_HTTP_CODE});
    }

    return request.continue({url: request.url()});
});

page.on('response', response => {
    if (response.request().resourceType() === 'document' && response.status() === 301) {
        isRedirect = true;
    }
});

Response before update:

< HTTP/1.1 301 Moved Permanently
< location: https://www.erento.com/

Response after update:

< HTTP/1.1 399 unknown

The unit test which is not working anymore:

test('301 response code', async () => {
    const response = await Browser.get('https://www.erento.de', {'Accept-Language': 'de-DE'});
    expect(response.statusCode).toBe(301);
    expect(response.headers.location).toBe('https://www.erento.com/');
});

It is important for crawlers to return 301 if necessary though. Is there any way in new pptr versions to not follow redirects but to return the redirect response?

gandhimonik commented 5 years ago

I am facing similar kind of issue with puppeteer@1.12.2 where its not able to detect the URL moved from HTTP to HTTPS.

When I load on headful chrome the URL loads fine and I see 301 Moved Permanently in the response however in the headless chrome it just throws Navigation Timeout Exceeded: 30000ms exceeded.

aslushnikov commented 5 years ago

this is a dupe of #4030, check out this comment

LT;DR: the request.continue({url}) doesn't redirect to a given URL - instead, it loads data from the given url as if it was served from the original one.

In order to do a real redirect, use request.respond.

rokerkony commented 5 years ago

@aslushnikov that is actually what I was already trying but it doesn't help. Please look into a following example:

const puppeteer = require('puppeteer');

(async() => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.setRequestInterception(true);

    let isRedirectStatus = false;
    page.on('request', (request) => {
        if (isRedirectStatus !== false) {
            return request.respond({
                status: isRedirectStatus,
                headers: {
                    location: 'https://www.erento.com'
                },
            });
        }

        return request.continue({url: request.url()});
    });

    page.on('response', (response) => {
        if (response.request().resourceType() === 'document') {
            const status = response.status();
            if ([301, 302].includes(status)) {
                isRedirectStatus = status;
            }
        }
    });

    await page.goto('http://www.erento.de');
    await browser.close();
})();

it ends with ERR_TOO_MANY_REDIRECTS and the expected behaviour would be to return the object in .respond() method (in example the url is hardcoded just for simplicity)

aslushnikov commented 5 years ago

@rokerkony What your code does is redirecting all requests to 'https://www.erento.com' once there's at least one redirect response from server. Given that http://www.erento.de immediately redirects to HTTPS, all your future requests are redirected to the same URL and you naturally get ERR_TOO_MANY_REDIRECTS.

Are you trying to redirect just the navigation request? In this case, use the request.isNavigationRequest() and check for its target URL before issuing a redirect.

rokerkony commented 5 years ago

@aslushnikov What I'm trying to accomplish is to get the same behavior as in version 1.6.2 and lower, I hope it will be still possible somehow.

We are using pptr for our prerendering and we would love to return a response with 301 and location header when google goes to e.g.: erento.de. When we would return a content from followed redirects it would be treated as a duplicate content instead of redirect.

I hope it is well demostrated by a first post in this issue where we used fake http status 399 and pptr returned 301 with location header (see test for it).

aslushnikov commented 5 years ago

@rokerkony so what does Browser.get do under the hood? I'm still not sure what's the end goal here - can you please elaborate more? Maybe I'd be able to suggest something.