microsoft / playwright-java

Java version of the Playwright testing and automation library
https://playwright.dev/java/
Apache License 2.0
1.07k stars 195 forks source link

[Bug]: ERR_HTTP2_PROTOCOL_ERROR #1592

Closed yidasanqian closed 1 month ago

yidasanqian commented 1 month ago

Version

1.43.0

Steps to reproduce

    String url = "https://www.eet-china.com/mp/a319436.html";    
    try (Playwright playwright = Playwright.create();
         Browser browser = playwright.chromium().launch(new BrowserType.LaunchOptions().setHeadless(true));
    ) {
        BrowserContext context = browser.newContext(
                new Browser.NewContextOptions()
                        .setIgnoreHTTPSErrors(true)
                        .setUserAgent(userAgent)                           
        );
        Page page = context.newPage();      
        try {
            Response response = page.navigate(url, new Page.NavigateOptions().setReferer(url).setWaitUntil(WaitUntilState.LOAD).setTimeout(15000));
            if (response.ok()) {                 
                String htmlContent = response.frame().content();
                Document doc = Jsoup.parse(htmlContent);

                doc.select("div.top, div#top, footer, div#footer, [id*=footer], div.footer, [class*=footer], div.right, div#right").remove();
                String clean = Jsoup.clean(doc.body().html(), safelist);
                System.out.println("clean = " + clean);            
            }
        } catch (PlaywrightException error) {
            System.out.println("error: " + error.getMessage());
        } finally {
            context.close();
        }
    }

Expected behavior

I hope to see the webpage content load normally.

Actual behavior

com.microsoft.playwright.PlaywrightException: Error {
  message='net::ERR_HTTP2_PROTOCOL_ERROR at https://www.eet-china.com/mp/a319436.html
  name='Error
  stack='Error: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.eet-china.com/mp/a319436.html
    at FrameSession._navigate (/tmp/playwright-java-6751399139355531193/package/lib/server/chromium/crPage.js:515:35)
    at async Frame._gotoAction (/tmp/playwright-java-6751399139355531193/package/lib/server/frames.js:534:28)
}
Call log:
- navigating to "https://www.eet-china.com/mp/a319436.html", waiting until "load"

        at com.microsoft.playwright.impl.WaitableResult.get(WaitableResult.java:56)
        at com.microsoft.playwright.impl.ChannelOwner.runUntil(ChannelOwner.java:120)
        at com.microsoft.playwright.impl.Connection.sendMessage(Connection.java:130)
        at com.microsoft.playwright.impl.ChannelOwner.sendMessage(ChannelOwner.java:106)
        at com.microsoft.playwright.impl.FrameImpl.navigateImpl(FrameImpl.java:463)
        at com.microsoft.playwright.impl.PageImpl.lambda$navigate$46(PageImpl.java:870)
        at com.microsoft.playwright.impl.LoggingSupport.withLogging(LoggingSupport.java:47)
        at com.microsoft.playwright.impl.ChannelOwner.withLogging(ChannelOwner.java:89)
        at com.microsoft.playwright.impl.PageImpl.navigate(PageImpl.java:870)
        at com.microsoft.playwright.impl.PageImpl.navigate(PageImpl.java:42)
        at com.toowe.enterprise.core.app.searchapp.service.impl.WebSearchServiceImpl.lambda$spiderWebContent$9(WebSearchServiceImpl.java:804)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: com.microsoft.playwright.impl.DriverException: Error {
  message='net::ERR_HTTP2_PROTOCOL_ERROR at https://www.eet-china.com/mp/a319436.html
  name='Error
  stack='Error: net::ERR_HTTP2_PROTOCOL_ERROR at https://www.eet-china.com/mp/a319436.html
    at FrameSession._navigate (/tmp/playwright-java-6751399139355531193/package/lib/server/chromium/crPage.js:515:35)
    at async Frame._gotoAction (/tmp/playwright-java-6751399139355531193/package/lib/server/frames.js:534:28)
}
Call log:
- navigating to "https://www.eet-china.com/mp/a319436.html", waiting until "load"

        at com.microsoft.playwright.impl.Connection.dispatch(Connection.java:259)
        at com.microsoft.playwright.impl.Connection.processOneMessage(Connection.java:211)
        at com.microsoft.playwright.impl.ChannelOwner.runUntil(ChannelOwner.java:118)
        ... 13 common frames omitted

Additional context

             // Start tracing before creating / navigating a page.
            context.tracing().start(new Tracing.StartOptions()
                    .setScreenshots(true)
                    .setSnapshots(true)
                    .setSources(true));

            Page page = context.newPage();
            page.navigate(url);
            // The "navigate" method ERROR directly, so the following code didn't execute, and I couldn't view the trace.zip file.
            context.tracing()
                    .stop(new Tracing.StopOptions()
                    .setPath(Paths.get("/trace.zip")));

Environment

Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae) Maven home: /opt/apache-maven-3.9.6 Java version: 1.8.0_402, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en, platform encoding: UTF-8 OS name: "linux", version: "5.15.0-67-generic", arch: "amd64", family: "unix"

dockerfile part:

FROM mcr.microsoft.com/playwright/java:v1.43.0-jammy
ENV LANG C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive

RUN rm -rf /usr/lib/jvm/*  \
   && apt-get update  \
   && apt-get install -y --no-install-recommends openjdk-8-jdk \
   && apt-get autoremove -y \
   && apt-get clean -y \
   && rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
ENV JAVA_OPTS="-Xms512m -Xmx8g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/heapdumps/azs_backend_heapdump.hprof -Djava.security.egd=file:/dev/./urandom"
yury-s commented 1 month ago

The web page returns invalid HTTP/2 response when accessed from headless browser. Looks like some bot protection measure. If you deem it's a playwright bug, please provide a self-contained example including the web page that we could run locally.