oven-sh / bun

Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one
https://bun.sh
Other
74.49k stars 2.79k forks source link

crawlee playwright bun : Running crawlee's playwright crawler with bun causes Protocol mismatch error #13296

Open oindrila-b opened 3 months ago

oindrila-b commented 3 months ago

What version of Bun is running?

1.1.22

What platform is your computer?

Linux 6.5.0-45-generic x86_64 x86_64

What steps can reproduce the bug?

Hello Bun Community,

I'm using apify/crawlee in my project to scrape some websites, and I want to do it in a bun environment instead of node environment. The crawler I chose for my project is PlaywrightCrawler from crawlee.

The script section in my package.json of the project looks something like this :

"scripts": { "start": "bun run start:dev", "start:prod": "node dist/main.js", "start:dev": "bun run src/main.ts", "migration": "bunx drizzle-kit generate", "build": "tsc", "test": "echo \"Error: oops, the actor has no tests yet, sad!\" && exit 1", "postinstall": "bunx crawlee install-playwright-browsers" },

run : bun start:dev

What is the expected behavior?

I expect bun to be able to process the URLs without throwing any errors. When I run the same project in node environmentwhere my package.json script is this :

"scripts": { "start": "npm run start:dev", "start:prod": "node dist/main.js", "start:dev": "tsx src/main.ts", "build": "tsc", "test": "echo \"Error: oops, the actor has no tests yet, sad!\" && exit 1", "postinstall": "npx crawlee install-playwright-browsers" },

it works perfectly : Here's the result I get using npm and expect from bun as well :

my-crawler@0.0.1 start npm run start:dev

my-crawler@0.0.1 start:dev tsx src/main.ts

INFO PlaywrightCrawler: Starting the crawler. INFO PlaywrightCrawler: enqueueing new URLs INFO PlaywrightCrawler: Crawlee for Python · Fast, reliable crawlers. {"url":"https://crawlee.dev/python/"} INFO PlaywrightCrawler: Quick Start | Crawlee · Build reliable crawlers. Fast. {"url":"https://crawlee.dev/docs/quick-start"} INFO PlaywrightCrawler: Examples | Crawlee · Build reliable crawlers. Fast. {"url":"https://crawlee.dev/docs/examples"} INFO PlaywrightCrawler: @crawlee/core | API | Crawlee · Build reliable crawlers. Fast. {"url":"https://crawlee.dev/api/core"} INFO PlaywrightCrawler: Changelog | API | Crawlee · Build reliable crawlers. Fast. {"url":"https://crawlee.dev/api/core/changelog"} .....

This is what I want in bun as well.

What do you see instead?

When I execute the project using bun start:dev, even though the crawler gets initialised without any issues, when it comes to running the crawler using the crawler.run() method, I encounter this error:

INFO PlaywrightCrawler: Starting the crawler. 50 | catch (error) { 51 | reject(error); 52 | return; 53 | } 54 | } 55 | const fn = origin.startsWith('https:') ? https_1.default.request : http_1.default.request; ^ error: Protocol mismatch. Expected: http:. Got: developer.chrome.com: at new ClientRequest (node:http:1001:14) at node:http:247:22

I tried to fix this by adding a piece of code that makes sure the urls have the correct protocols before they are added to the crawler for scraping , this is that code :

this.urls = options.urls.map(url => { if (!url.startsWith('http://') && !url.startsWith('https://')) { return https://${url}; } Logger.info(url) return url; });

where this.urls is an empty array of strings, which later gets added to the crawler for crawling.

From what I see, the default setup uses tsx src/main.ts to run the main file and it runs perfectly, using bun run src/main.ts makes it have a protocol mismatch error.

Additional information

No response

soupman99 commented 3 months ago

@oindrila-b did you find a work around or solution for this? I'm having the same issue.

ImBIOS commented 3 months ago

I'm also experiencing this roadblock which causes me to revert to using node for this project.

UPDATE 1: LOL, using tsx gave a different error for another library. I'm going to try to also make an issue in Crawlee, to make sure both parties know it had an error.

@oindrila-b Can you please share how to implement this workaround?

this.urls = options.urls.map(url => {
if (!url.startsWith('http://') && !url.startsWith('https://')) {
return https://${url};
}
Logger.info(url)
return url;
});

Related:

UPDATE 2: I'm currently using tsx and npm, until this bug fixed.

braincomb commented 3 weeks ago

Oddly enough with Bun v1.1.34 I get a different error now: NS_ERROR_UNKNOWN_HOST. But the behavior is the same, Firefox browser instance started via crawlee is unable to access any HTTPS websites, however HTTP works.

Chromium instance reports net::ERR_TUNNEL_CONNECTION_FAILED

al6x commented 1 week ago

Same error bun -v 1.1.36, to reproduce:

npx crawlee create my-crawler # <== Choose the TypeScript Example
bun run src/main.ts

The error

INFO  PlaywrightCrawler: Starting the crawler.
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session... 
goto: net::ERR_TUNNEL_CONNECTION_FAILED at https://crawlee.dev/
Call log:
  - navigating to "https://crawlee.dev/", waiting until "load"

    at processTicksAndRejections (//projects/my-crawler/native:7:39)