internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
636 stars 95 forks source link

brozzle-page Not Working With Recent Version of Google Chrome #256

Open treid003 opened 1 year ago

treid003 commented 1 year ago

When I recently used a brozzle-page command with a recent version of Google Chrome, I noticed that brozzler does not load the web page that should be archived. brozzle_page_not_loading_web_page_during_crawl_session

This results in the web page not being archived successfully.

WARC file: WARCPROX-20230519163909687-00000-0so5t1md.warc brozzle_page_failed_to_archive_web_page

This issue also occurred when trying to archive other web pages:

The commands I used are listed below (video example):

warcprox -p 8081 -d ./warcs/IGN/brozzle_page/2023_05_19 --dedup-db-file /dev/null

export BROZZLER_EXTRA_CHROME_ARGS="--ignore-certificate-errors"

brozzle-page --chrome-exe '/usr/bin/google-chrome' --proxy localhost:8081 'https://www.ign.com/articles/the-last-of-us-season-1-review'

A “WebSocketBadStatusException: Handshake status 403 Forbidden” occurred when recently running these commands on Ubuntu (22.04.2 LTS and 20.04.6 LTS) and macOS (Ventura 13.3.1). WebSocketBadStatusException

When I used these commands earlier this year it was working successfully (video): Brozzel_Page_Working_Earlier_This_Year

After noticing this issue, I went through the recent stable versions of Google Chrome and found the last stable version that worked with the brozzle-page command was version 109.0.5414.119 which was released on January 24, 2023.

chrome.deb URI (109.0.5414.119): https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_109.0.5414.119-1_amd64.deb

Crawling session: https://youtu.be/A-zr6zVTZSo?t=5569 brozzler_works_with_version_109 0 5414 119_crawling_session

Replay session: https://youtu.be/A-zr6zVTZSo?t=6345 brozzler_works_with_version_109 0 5414 119_replay_session

The first stable version of Chrome that did not work with the brozzle-page command is version 111.0.5563.110 which was released on March 21, 2023.

chrome.deb URI (111.0.5563.110): https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_111.0.5563.110-1_amd64.deb

Crawling session: https://youtu.be/A-zr6zVTZSo?t=4903 brozzler_fails_with_version_111 0 5563 110_crawling_session

Replay session: https://youtu.be/A-zr6zVTZSo?t=4992 brozzler_fails_with_version_111 0 5563 110_replay_session

Chrome release blog post for 111.0.5563.110: https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop_21.html

galgeek commented 1 year ago

Thank you, @treid003, for your detailed report!

Current brozzler code is compatible with Chrome up to v.110. This linux version runs without apparent issue:

https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_110.0.5481.177-1_amd64.deb

I've also run current code on Apple M1/M2 with Thorium v.110, using the branch here: https://github.com/internetarchive/brozzler/pull/255

treid003 commented 1 year ago

Thanks for testing v.110.

When going through the different Chrome versions I accidentally skipped v.110 since the blog post titles were different for v.110 Chrome releases.