mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
13.13k stars 942 forks source link

[BUG] Issue with crawl going beyond Limit #435

Closed calebpeffer closed 1 month ago

calebpeffer commented 1 month ago

From Janice in Discord:

I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong?

 app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
    start_time = time.time()
    crawl_url = "https://www.lsu.edu/majors"
    params = {
        "crawlerOptions": {
            "limit": 500,
            "maxDepth": 2,
            "ignoreSitemap": False,
            "ignoreRobots": False,

        },
        "pageOptions": {
            "onlyMainContent": True,
            "parsePDF": True,
            "removeTags": ["script", "style", "nav", "header", "footer",
                           ".advertisement", ".sidebar", ".nav", ".menu",
                           "#comments", "img", "svg", "iframe", "video",
                           "audio"]
        },
    }
    urls = []
    job_id = app.crawl_url(crawl_url, params=params, wait_until_done=False)

According to Sachin, it might be an issue with the maxDepth Parameter

calebpeffer commented 1 month ago

Update, it doesn't look like the maxDepth parameter had any impact here. It still didn't respect the limit

rafaelsideguide commented 1 month ago

@calebpeffer we were experiencing issues with this page because it contains thousands of images in the sitemap. This was causing the servers to run out of memory during the crawl.

jgluck-eab commented 1 month ago

I just tried this code again and the limit is still not being restricted. Could you have another look please?

rafaelsideguide commented 1 month ago

@calebpeffer I've just submitted PR #485 which filters .inc files and checks for capitalized file extensions.

In this PR, I conducted tests across various limits (0, 10, 100, 200, 500, and "not set") and maxDepths (0, 2, 5, and "not set"). Here are the results:

maxDepth: 0 maxDepth: 2 maxDepth: 5 maxDepth: not set
limit URLs URLs URLs URLs
0 0 0 0 0
10 0 8 10 10
100 0 8 100 100
200 0 8 142 142
500 0 8 142 142
not set 0 8 142 142

@jgluck-eab We plan to merge this PR in four hours. Could you rerun your code and check if you encounter any unexpected results? Please share any discrepancies and specify the results you were expecting.

Below is the code I used for testing:

app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"

limits = [0, 10, 100, 200, 500, None]
maxDepths = [0, 2, 5, None]
urls_crawled = []

for j in range(0, len(maxDepths)):
    for i in range(0, len(limits)):
        params = {
            "crawlerOptions": {
                "ignoreSitemap": False,
                "ignoreRobots": False,

            },
            "pageOptions": {
                "onlyMainContent": True,
                "removeTags": ["script", "style", "nav", "header", "footer",
                               ".advertisement", ".sidebar", ".nav", ".menu",
                               "#comments", "img", "svg", "iframe", "video",
                               "audio"]
            },
        }

        if limits[i] != None:
            params['crawlerOptions']['limit'] = limits[i]

        if maxDepths[j] != None:
            params['crawlerOptions']['maxDepth'] = maxDepths[j]

        urls = []
        crawl_request = app.crawl_url(crawl_url, params=params, wait_until_done=False)

        job_id = crawl_request['jobId']
        status = app.check_crawl_status(job_id)

        while status['status'] == 'active':
            status = app.check_crawl_status(job_id)
            time.sleep(2)

        time.sleep(5) # wait for the data to be saved at the db
        status = app.check_crawl_status(job_id)
        urls_crawled.append({
            "limit": limits[i],
            "maxDepth": maxDepths[j],
            "num_urls": len(status['data'])
        })

for k in range(0, len(urls_crawled)):
    print(urls_crawled[k])
jgluck-eab commented 1 month ago

I tried this again today and now the limit is working. This can be closed. Thank you!

rafaelsideguide commented 1 month ago

Closing this.