mendableai / firecrawl

šŸ”„ Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
17.81k stars 1.31k forks source link

[BUG] Self-Hosted USE_DB_AUTHENTICATION evaluated inconsistently #526

Closed rhyswynn closed 1 month ago

rhyswynn commented 2 months ago

The USE_DB_AUTHENTICATION environment variable is treated as a boolean in the single_url.ts and scrape-events.ts, where it is expecting a boolean true, or it seems 'undefined' instead of anything in single_url. This prevents playwright from being invoked when the value is set to false in the env file. But leaving it blank or commenting it out causes errors in the supabase.ts evaluation where it is looking for a string value false.

To Reproduce Steps to reproduce the issue:

  1. Configure the .env file with USE_DB_AUTHENTICATION=

  2. Run the docker compose up

  3. Errors will display that Supabase environment variables aren't configured correctly.

  4. Line 12 of supabase.ts is expecting a string value https://github.com/mendableai/firecrawl/blob/5a778f2c22a451f1eead5eb9733bcd462d3cd081/apps/api/src/services/supabase.ts#L12

  5. Line 20 of supabase.ts is returning the error ERROR - Supabase environment variables aren't configured correctly. Supabase client will not be initialized. Fix ENV configuration or disable DB authentication with USE_DB_AUTHENTICATION env variable

  6. Configure the .env file with USE_DB_AUTHENTICATION=false

  7. Run the docker compose up

  8. Run a scrape request with pageOptions/waitFor configured with a positive number

  9. Several lines in single_url.ts related to scraper selection evaluate for the property to be undefined, so playwright is never used https://github.com/mendableai/firecrawl/blob/5a778f2c22a451f1eead5eb9733bcd462d3cd081/apps/api/src/scraper/WebScraper/single_url.ts#L91

  10. scrape-events.ts gives an error trying to use Supabase because it evaluates the property looking for a boolean ERROR - Attempted to access Supabase client when it's not configured. https://github.com/mendableai/firecrawl/blob/5a778f2c22a451f1eead5eb9733bcd462d3cd081/apps/api/src/lib/scrape-events.ts#L39

Expected Behavior Configure .env with USE_DB_AUTHENTICATION= Run docker compose up with no errors Run a scrape request with pageOptions/waitFor configured with a positive number playwright is invoked with no errors

Environment (please complete the following information):

Logs I provided the specific lines of code in the files where the errors were being generated

rafaelsideguide commented 2 months ago

@rhyswynn good catch! #516 fixes some of the evaluations. I'm gonna add a commit with the ones that are missing.

rafaelsideguide commented 2 months ago

hey @rhyswynn I just committed all validations for USE_DB_AUTHENTICATION. Could you please check if #516 resolves this issue?

rhyswynn commented 2 months ago

scraper/WebScraper/single_url.ts also needs to be updated as the different scraper selection methods are evaluating the environment variable, not the new boolean variable.

kevinswiber commented 2 months ago

@rhyswynn @rafaelsideguide Some of those scraping variables are also impacted by #531. For example, the ScrapingBee scraper will attempt to run in Docker, because its value is actually set to the end-of-line comment instead of being blank.

rafaelsideguide commented 2 months ago

I'm adding to the scraping orders the following (PR #516):

export const baseScrapers = [
  useFireEngine ? "fire-engine" : undefined,
  useFireEngine ? "fire-engine;chrome-cdp" : undefined,
  useScrapingBee ? "scrapingBee" : undefined,
  useDatabaseAuth ? undefined : "playwright",
  useScrapingBee ? "scrapingBeeLoad" : undefined,
  "fetch",
].filter(Boolean);
let defaultOrder = [
    useFireEngine ? "fire-engine" : undefined,
    useFireEngine ? "fire-engine;chrome-cdp" : undefined,
    useScrapingBee ? "scrapingBee" : undefined,
    useScrapingBee ? "scrapingBeeLoad" : undefined,
    useDatabaseAuth ? undefined : "playwright",
    "fetch",
  ].filter(Boolean);

@kevinswiber @rhyswynn let me know if this solves the issue

rhyswynn commented 2 months ago

Yes, #516 looks like it will take care of everything. Thank you!