[BUG] Issue with crawl going beyond Limit

calebpeffer commented 1 month ago

From Janice in Discord:

I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong?

 app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
    start_time = time.time()
    crawl_url = "https://www.lsu.edu/majors"
    params = {
        "crawlerOptions": {
            "limit": 500,
            "maxDepth": 2,
            "ignoreSitemap": False,
            "ignoreRobots": False,

        },
        "pageOptions": {
            "onlyMainContent": True,
            "parsePDF": True,
            "removeTags": ["script", "style", "nav", "header", "footer",
                           ".advertisement", ".sidebar", ".nav", ".menu",
                           "#comments", "img", "svg", "iframe", "video",
                           "audio"]
        },
    }
    urls = []
    job_id = app.crawl_url(crawl_url, params=params, wait_until_done=False)

According to Sachin, it might be an issue with the maxDepth Parameter

calebpeffer commented 1 month ago

Update, it doesn't look like the maxDepth parameter had any impact here. It still didn't respect the limit

rafaelsideguide commented 1 month ago

@calebpeffer we were experiencing issues with this page because it contains thousands of images in the sitemap. This was causing the servers to run out of memory during the crawl.

jgluck-eab commented 1 month ago

I just tried this code again and the limit is still not being restricted. Could you have another look please?

rafaelsideguide commented 1 month ago

@calebpeffer I've just submitted PR #485 which filters .inc files and checks for capitalized file extensions.

In this PR, I conducted tests across various limits (0, 10, 100, 200, 500, and "not set") and maxDepths (0, 2, 5, and "not set"). Here are the results:

	maxDepth: 0	maxDepth: 2	maxDepth: 5	maxDepth: not set
limit	URLs	URLs	URLs	URLs
0	0	0	0	0
10	0	8	10	10
100	0	8	100	100
200	0	8	142	142
500	0	8	142	142
not set	0	8	142	142

@jgluck-eab We plan to merge this PR in four hours. Could you rerun your code and check if you encounter any unexpected results? Please share any discrepancies and specify the results you were expecting.

Below is the code I used for testing:

app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"

limits = [0, 10, 100, 200, 500, None]
maxDepths = [0, 2, 5, None]
urls_crawled = []

for j in range(0, len(maxDepths)):
    for i in range(0, len(limits)):
        params = {
            "crawlerOptions": {
                "ignoreSitemap": False,
                "ignoreRobots": False,

            },
            "pageOptions": {
                "onlyMainContent": True,
                "removeTags": ["script", "style", "nav", "header", "footer",
                               ".advertisement", ".sidebar", ".nav", ".menu",
                               "#comments", "img", "svg", "iframe", "video",
                               "audio"]
            },
        }

        if limits[i] != None:
            params['crawlerOptions']['limit'] = limits[i]

        if maxDepths[j] != None:
            params['crawlerOptions']['maxDepth'] = maxDepths[j]

        urls = []
        crawl_request = app.crawl_url(crawl_url, params=params, wait_until_done=False)

        job_id = crawl_request['jobId']
        status = app.check_crawl_status(job_id)

        while status['status'] == 'active':
            status = app.check_crawl_status(job_id)
            time.sleep(2)

        time.sleep(5) # wait for the data to be saved at the db
        status = app.check_crawl_status(job_id)
        urls_crawled.append({
            "limit": limits[i],
            "maxDepth": maxDepths[j],
            "num_urls": len(status['data'])
        })

for k in range(0, len(urls_crawled)):
    print(urls_crawled[k])

jgluck-eab commented 1 month ago

I tried this again today and now the limit is working. This can be closed. Thank you!

rafaelsideguide commented 1 month ago

Closing this.

mendableai / firecrawl

[BUG] Issue with crawl going beyond Limit #435