platonai / exotic-amazon

A complete solution to crawl amazon at scale completely and accurately.
146 stars 47 forks source link

NullPointerException: it must not be null (ChromeLauncher) #22

Closed sskmtm closed 1 year ago

sskmtm commented 1 year ago

使用 main 分支最新代码,出现了如下错误:

22:51:10.574 [r-worker-2] INFO  ai.platon.pulsar.common.AppContext - Version: Linux version 5.15.0-46-generic (buildd@lcy02-amd64-115) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022

22:51:10.586 [r-worker-2] INFO  a.p.pulsar.common.ProcessLauncher - Launching process:
/usr/bin/google-chrome-stable --proxy-server=114.230.104.102:4212 --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --remote-allow-origins=* --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true --user-data-dir=/tmp/pulsar-root/context/cx.1wJI7e8/pulsar_chrome
Exception in thread "Thread-9" java.lang.NullPointerException: it must not be null
    at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.waitForDevToolsServer$lambda-14(ChromeLauncher.kt:193)
    at java.base/java.lang.Thread.run(Thread.java:834)
22:51:10.678 [r-worker-2] ERROR a.p.p.p.b.driver.WebDriverFactory - Can not launch browser pulsar_chrome | launch
22:51:10.678 [r-worker-2] WARN  a.p.p.p.b.e.context.WebDriverContext - 3. Retry task 1 in crawl scope | caused by: launch
22:51:10.692 [5-thread-1] INFO  a.p.p.p.b.e.c.MultiPrivacyContextManager - Maintaining service is started
22:51:10.779 [r-worker-2] INFO  a.p.p.c.component.LoadComponent.Task -   3. 💔 🔃 U for RR got 1601 0 <- 0 in 0s, last fetched 5m32s ago, fc:1/7 Retry(1601) rs: Driver exception, rsp: CRAWL | 1wJI7e8 | file:///tmp/ln/1f6ede83881b702b4a3c5ffa9b01ef51.htm | https://www.amazon.com/s?k=sport+shoes -parse -refresh
22:51:10.780 [r-worker-2] INFO  a.p.p.c.component.LoadComponent.Task - Log explanation: https://github.com/platonai/pulsarr/blob/master/docs/log-format.adoc

我在本地 macos 环境运行没什么问题,在 linux 服务器上运行出错

其中 chrome 版本:

root@iZgc776ki4oy6i4s7y7o58Z:~/crawl#
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl# google-chrome --version
Google Chrome 110.0.5481.96
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl#
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl#
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl# chromedriver --version
ChromeDriver 110.0.5481.77 (65ed616c6e8ee3fe0ad64fe83796c020644d42af-refs/branch-heads/5481@{#839})
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl#
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl#

请问这个是什么情况,是哪里环境配置不正确吗?

platonai commented 1 year ago

我们在类似环境下运行 exotic-amazon,没有发现类似问题。

23:08:41.754 [main] INFO  a.p.e.a.s.CrawlStarterKt - Starting CrawlStarterKt v0.3.2-SNAPSHOT using Java 11.0.18 on platonai-20190202-1 with PID 3527511 (/home/vincent/exotic-amaz
on-pro/0.3.2-SNAPSHOT/target/exotic-amazon-0.3.2-SNAPSHOT.jar started by vincent in /home/vincent/exotic-amazon-pro/0.3.2-SNAPSHOT)

23:09:10.876 [r-worker-8] INFO a.p.p.common.AppContext - Version: Linux version 5.4.0-51-generic (buildd@lcy01-amd64-020) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #56-Ubuntu SMP Mon Oct 5 14:28:49 UTC 2020

vincent@platonai-20190202-1:~/exotic-amazon-pro/0.3.2-SNAPSHOT$ google-chrome -version
Google Chrome 110.0.5481.177 

此外,PulsarR 及其衍生项目不依赖 chromedriver。

sskmtm commented 1 year ago

我的 google-chromeubuntu 版本:

root@iZgc776ki4oy6i4s7y7o58Z:~/crawl# google-chrome-stable --version
Google Chrome 110.0.5481.96
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl# which google-chrome-stable
/usr/bin/google-chrome-stable
root@iZgc776ki4oy6i4s7y7o58Z:~/crawl# cat /etc/issue
Ubuntu 22.04.1 LTS \n \l

root@iZgc776ki4oy6i4s7y7o58Z:~/crawl#

依照上面提供运行时数据:

12:39:05.567 [main] INFO  ai.platon.CrawlApplicationKt - Starting CrawlApplicationKt v1.12 using Java 11.0.13 on iZgc776ki4oy6i4s7y7o58Z with PID 8898 (/root/crawl/exotic-amazon-1.12.jar started by root in /root/crawl)
...
...
...
12:41:28.083 [r-worker-2] INFO  ai.platon.pulsar.common.AppContext - Version: Linux version 5.15.0-46-generic (buildd@lcy02-amd64-115) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022

12:41:28.099 [r-worker-2] INFO  a.p.pulsar.common.ProcessLauncher - Launching process:
/usr/bin/google-chrome-stable --proxy-server=180.105.59.115:4215 --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --remote-allow-origins=* --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true --user-data-dir=/tmp/pulsar-root/context/cx.1duiMK9/pulsar_chrome
Exception in thread "Thread-9" java.lang.NullPointerException: it must not be null
    at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.waitForDevToolsServer$lambda-14(ChromeLauncher.kt:193)
    at java.base/java.lang.Thread.run(Thread.java:834)
12:41:28.190 [r-worker-2] ERROR a.p.p.p.b.driver.WebDriverFactory - Can not launch browser pulsar_chrome | launch
12:41:28.191 [r-worker-2] WARN  a.p.p.p.b.e.context.WebDriverContext - 3. Retry task 1 in crawl scope | caused by: launch
...

这里有一个运行时现象: 旧版本:在本地运行是,不会打开浏览器 新版本:在本地运行时,会打开浏览器进行拟人操作

不知道这种情况会不会造成在服务器上的错误

sskmtm commented 1 year ago

我们在类似环境下运行 exotic-amazon,没有发现类似问题。

23:08:41.754 [main] INFO  a.p.e.a.s.CrawlStarterKt - Starting CrawlStarterKt v0.3.2-SNAPSHOT using Java 11.0.18 on platonai-20190202-1 with PID 3527511 (/home/vincent/exotic-amaz
on-pro/0.3.2-SNAPSHOT/target/exotic-amazon-0.3.2-SNAPSHOT.jar started by vincent in /home/vincent/exotic-amazon-pro/0.3.2-SNAPSHOT)

23:09:10.876 [r-worker-8] INFO a.p.p.common.AppContext - Version: Linux version 5.4.0-51-generic (buildd@lcy01-amd64-020) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #56-Ubuntu SMP Mon Oct 5 14:28:49 UTC 2020

vincent@platonai-20190202-1:~/exotic-amazon-pro/0.3.2-SNAPSHOT$ google-chrome -version
Google Chrome 110.0.5481.177 

此外,PulsarR 及其衍生项目不依赖 chromedriver。

新版本执行 google-chrome 命令时没有 --headless 参数,这是会造成这个错误的原因吗?

platonai commented 1 year ago

是的。

新版本执行 google-chrome 命令时没有 --headless 参数,这是会造成这个错误的原因吗?

sskmtm commented 1 year ago

我尝试在 application.properties 文件中修改一些配置,但是并没有生效:

# How the browser display, can be one of: GUI, HEADLESS
browser.display.mode=HEADLESS
#browser.display.mode=GUI
browser.launch.with.xvfb=true
browser.driver.headless=true

这个配置需要怎么弄呢? 或者是 pulsar 忽略了相关的配置吗?

platonai commented 1 year ago

看了下,我们在 main 函数中覆盖了 browser.display.mode:

    val prod = System.getenv("ENV")?.lowercase()
    if (prod == "prod") {
        // product environment, the best speed is required
        additionalProfiles.add("prod")
    } else {
        // development environment
        BrowserSettings.privacy(2).maxTabs(8).headed()
    }

你在后面加入一行:

BrowserSettings.headless()

就可以了。

sskmtm commented 1 year ago

ok,这样确实可以解决上面的,感谢