mendableai / firecrawl

๐Ÿ”ฅ Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
14.43k stars 1.05k forks source link

Locally hosted firecrawl stuck when called from dify #464

Open kouyakamada opened 1 month ago

kouyakamada commented 1 month ago

I want to call firecrawl hosted on the company network from a dify hosted on the same network. It registers to dify with no problem, but when I run the crawl, it gets stuck with no response from firecrawl. Checking the docker compose logs seems to show an error. What is the cause of this?

Expected Behavior: Can call firecrawl from a dify hosted on the company network.

Environment : Based on dokcer compose in repository.

image

docker compose logs

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
api-1                 | [2024-07-26T04:18:38.474Z]INFO - Number of CPUs: 64 available
api-1                 | [2024-07-26T04:18:38.475Z]INFO - Worker 201 listening on port 3002
api-1                 | [2024-07-26T04:18:38.475Z]INFO - For the Queue UI, open: http://0.0.0.0:3002/admin/@/queues
api-1                 | [2024-07-26T04:18:38.477Z]INFO - Connected to Redis Session Rate Limit Store!
api-1                 | [2024-07-26T04:18:38.482Z]INFO - Web scraper queue created
api-1                 | [2024-07-26T04:18:38.485Z]INFO - Worker 66 started
api-1                 | [2024-07-26T04:18:38.488Z]INFO - Number of CPUs: 64 available
api-1                 | [2024-07-26T04:18:38.495Z]INFO - Web scraper queue created
api-1                 | [2024-07-26T04:18:38.499Z]INFO - Worker 171 started
api-1                 | [2024-07-26T04:18:38.499Z]INFO - Worker 66 listening on port 3002
api-1                 | [2024-07-26T04:18:38.500Z]INFO - For the Queue UI, open: http://0.0.0.0:3002/admin/@/queues
api-1                 | [2024-07-26T04:18:38.502Z]INFO - Connected to Redis Session Rate Limit Store!
api-1                 | [2024-07-26T04:18:38.513Z]INFO - Worker 171 listening on port 3002
api-1                 | [2024-07-26T04:18:38.513Z]INFO - For the Queue UI, open: http://0.0.0.0:3002/admin/@/queues
api-1                 | [2024-07-26T04:18:38.513Z]INFO - Number of CPUs: 64 available
api-1                 | [2024-07-26T04:18:38.516Z]INFO - Connected to Redis Session Rate Limit Store!
api-1                 | [2024-07-26T04:18:38.521Z]INFO - Web scraper queue created
api-1                 | [2024-07-26T04:18:38.525Z]INFO - Worker 391 started
api-1                 | [2024-07-26T04:18:38.542Z]INFO - Worker 391 listening on port 3002
api-1                 | [2024-07-26T04:18:38.542Z]INFO - For the Queue UI, open: http://0.0.0.0:3002/admin/@/queues
api-1                 | [2024-07-26T04:18:38.545Z]INFO - Connected to Redis Session Rate Limit Store!
api-1                 | [2024-07-26T04:18:45.372Z]WARN - You're bypassing authentication
api-1                 | [2024-07-26T04:18:45.373Z]WARN - You're bypassing authentication
api-1                 | [2024-07-26T04:18:45.379Z]ERROR - Attempted to access Supabase client when it's not configured.
api-1                 | [2024-07-26T04:18:45.380Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured.
api-1                 | [2024-07-26T04:18:45.471Z]ERROR - Attempted to access Supabase client when it's not configured.
api-1                 | [2024-07-26T04:18:45.471Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured.
api-1                 | [2024-07-26T04:18:45.478Z]ERROR - Attempted to access Supabase client when it's not configured.
api-1                 | [2024-07-26T04:18:45.479Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured.
api-1                 | [2024-07-26T04:18:45.487Z]ERROR - Attempted to access Supabase client when it's not configured.
api-1                 | [2024-07-26T04:18:45.487Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured.

Proxy settings added to dockerfile

ENV HTTPS_PROXY=http://xxx.xxx.xxx:3128
ENV HTTP_PROXY=http://xxx.xxx.xxx:3128
ENV https_proxy=http://xxx.xxx.xxx:3128
ENV http_proxy=http://xxx.xxx.xxx:3128
ENV NO_PROXY=127.0.0.1,localhost,redis,api,worker,playwright-service
ENV no_proxy=127.0.0.1,localhost,redis,api,worker,playwright-service

RUN npm -g config set proxy http://xxx.xxx.xxx:3128
RUN npm -g config set https-proxy http://xxx.xxx.xxx:3128

Result of calling api from command line

curl -X POST http://xx.xx.xx.xx:3002/v0/scrape -H 'Content-Type: application/json' -d '{"url":"https://www.firecrawl.dev/"}'
{"success":true,"error":"No page found","returnCode":200,"data":{"content":"","markdown":"","html":"","linksOnPage":[],"metadata":{"sourceURL":"https://www.firecrawl.dev/","pageStatusCode":400,"pageError":"Bad Request"}}}
nickscamara commented 1 month ago

@kouyakamada were you able to figure this out? We published a lot more improvements to our self hosting guide. Let us know if that helps.

tak-s commented 1 month ago

I also encountered an issue where crawling did not work well in a Proxy environment.

By modifying the area around axios.get() in apps/api/src/scraper/WebScraper/scrapers/fetch.ts, as shown below, I was able to resolve the issue.

// add module
import * as tunnel from 'tunnel';
try {
  //const response = await axios.get(url, {
  //  headers: {
  //    "Content-Type": "application/json",
  //  },
  //  timeout: universalTimeout,
  //  transformResponse: [(data) => data], // Prevent axios from parsing JSON automatically
  //});

  // Set proxy agent
  let agent;
  const httpProxy = process.env.HTTP_PROXY || null;
  const httpsProxy = process.env.HTTPS_PROXY || null;

  if (url.startsWith('https://') && httpsProxy) {
    const httpsProxyUrl = new URL(httpsProxy);
    agent = tunnel.httpsOverHttp({
      proxy: {
        host: httpsProxyUrl.hostname,
        port: parseInt(httpsProxyUrl.port, 10),
      },
    });
    Logger.info(`Using tunnel agent with HTTPS proxy: ${httpsProxy}`);
  } else if (url.startsWith('http://') && httpProxy) {
    const httpProxyUrl = new URL(httpProxy);
    agent = tunnel.httpOverHttp({
      proxy: {
        host: httpProxyUrl.hostname,
        port: parseInt(httpProxyUrl.port, 10),
      },
    });
    Logger.info(`Using tunnel agent with HTTP proxy: ${httpProxy}`);
  } else {
    Logger.info(`No proxy settings found or not required for the URL: ${url}. Proceeding without a proxy.`);
  }

  const response = await axios.get(url, {
    headers: {
      "Content-Type": "application/json",
    },
    timeout: universalTimeout,
    transformResponse: [(data) => data], // Prevent axios from parsing JSON automatically
    ...(agent ? { httpsAgent: agent, proxy: false } : {}), // set proxy agent
  });

Add the tunnel module to apps/api/Dockerfile

RUN pnpm install
RUN pnpm add tunnel    # add
RUN pnpm run build

Build with docker compose and start the container.

docker compose build && docker compose up -d

I hope this helps.

artificialzjy commented 1 month ago

ๆƒณ้—ฎไธ‹ๆ‚จ๏ผŒๆ‚จๆ˜ฏ้€š่ฟ‡API Keyๅฐ†FireCrawlๆณจๅ†ŒๅˆฐDifyไธญ็š„ๅ—๏ผŸๅฆ‚ๆžœๆ˜ฏ๏ผŒๆ‚จๆœฌๅœฐ้ƒจ็ฝฒ็š„FireCrawlๆ˜ฏๅฆ‚ไฝ•่Žทๅ–API Key็š„ไปฅๅŠๅฆ‚ไฝ•่ฎพ็ฝฎAuthorization็š„๏ผŸ

NuerSir commented 4 weeks ago

@artificialzjy TEST_API_KEY

kouyakamada commented 3 weeks ago

@tak-s Sorry for the late reply. Now able to crawl within our internal network and call it from dify! Thank you!

danialcheung commented 3 weeks ago

@tak-s Great solution! ~When I get to the RUN pnpm add tunnel part I'm met with the following:~

โ€‰WARNโ€‰ deprecated @devil7softwares/pos@1.0.2: This package has been renamed to `fast-tag-pos`
โ€‰WARNโ€‰ 3 deprecated subdependencies found: glob@7.2.3, inflight@1.0.6, superagent@8.1.2
Packages: +1
+
Progress: resolved 1050, reused 1037, downloaded 1, added 1, done

dependencies:
+ tunnel 0.0.6

โ€‰WARNโ€‰ Issues with peer dependencies found
.
โ”œโ”€โ”ฌ langchain 0.2.8
โ”‚ โ””โ”€โ”€ โœ• unmet peer puppeteer@^19.7.2: found 22.12.1
โ””โ”€โ”ฌ @hyperdx/node-opentelemetry 0.8.1
  โ””โ”€โ”ฌ @opentelemetry/auto-instrumentations-node 0.46.1
    โ”œโ”€โ”ฌ @opentelemetry/instrumentation-http 0.51.1
    โ”‚ โ””โ”€โ”ฌ @opentelemetry/core 1.24.1                                                                                                                                                                                 โ”‚   โ””โ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.0.0 <1.9.0": found 1.9.0                                                                                                                                            โ””โ”€โ”ฌ @opentelemetry/sdk-node 0.51.1
      โ”œโ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.3.0 <1.9.0": found 1.9.0                                                                                                                                                โ”œโ”€โ”ฌ @opentelemetry/sdk-trace-base 1.24.1
      โ”‚ โ”œโ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.0.0 <1.9.0": found 1.9.0                                                                                                                                              โ”‚ โ””โ”€โ”ฌ @opentelemetry/resources 1.24.1
      โ”‚   โ””โ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.0.0 <1.9.0": found 1.9.0                                                                                                                                            โ”œโ”€โ”ฌ @opentelemetry/exporter-trace-otlp-proto 0.51.1                                                                                                                                                              โ”‚ โ””โ”€โ”ฌ @opentelemetry/otlp-transformer 0.51.1
      โ”‚   โ”œโ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.3.0 <1.9.0": found 1.9.0                                                                                                                                            โ”‚   โ”œโ”€โ”ฌ @opentelemetry/sdk-logs 0.51.1
      โ”‚   โ”‚ โ””โ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.4.0 <1.9.0": found 1.9.0                                                                                                                                          โ”‚   โ””โ”€โ”ฌ @opentelemetry/sdk-metrics 1.24.1                                                                                                                                                                        โ”‚     โ””โ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.3.0 <1.9.0": found 1.9.0                                                                                                                                          โ””โ”€โ”ฌ @opentelemetry/sdk-trace-node 1.24.1
        โ”œโ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.0.0 <1.9.0": found 1.9.0                                                                                                                                                โ”œโ”€โ”ฌ @opentelemetry/context-async-hooks 1.24.1                                                                                                                                                                    โ”‚ โ””โ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.0.0 <1.9.0": found 1.9.0                                                                                                                                              โ”œโ”€โ”ฌ @opentelemetry/propagator-b3 1.24.1
        โ”‚ โ””โ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.0.0 <1.9.0": found 1.9.0                                                                                                                                              โ””โ”€โ”ฌ @opentelemetry/propagator-jaeger 1.24.1
          โ””โ”€โ”€ โœ• unmet peer @opentelemetry/api@">=1.0.0 <1.9.0": found 1.9.0

~Any tips on how to resolve this? Thank you kindly!~

~EDIT: To clarify, I believe the stage it fails at is the RUN pnpm run build as it can't identify what 'tunnel' is. But I just want to be sure it has nothing to do with the warnings~

EDIT 2: Got it working now, not really sure what happened but running it a second time did the trick!