mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
6.91k stars 494 forks source link

Is it possible to get the intermediate pages returned by firecrawl's crawling #209

Open GildeshAbhay opened 3 weeks ago

GildeshAbhay commented 3 weeks ago

Problem Description If firecrawl fails in the middle of a crawl, it returns nothing

Proposed Feature It would be great if we can get the intermediate results (till the point, crawling/scraping are both successful)

Alternatives Considered Individual scraping won't work as we would have to use third party sources for the endpoints/sitemap

Implementation Suggestions None yet

Use Case Any data is better than no data

Additional Context None yet

rafaelsideguide commented 3 weeks ago

@GildeshAbhay I think what you're asking for might already be covered, as Firecrawl can provide partial data through the status route if a crawl gets interrupted. You can check on https://docs.firecrawl.dev/api-reference/endpoint/status

partial_data

Partial documents returned as it is being crawls (streaming). When a page is ready it will append to the parial_data array - so no need to wait for all the website to be crawled.

Could you check if this works for your situation?

GildeshAbhay commented 3 weeks ago

what if we need each page separately scraped and stored, via crawling method? (coz we need firecrawl's sitemap loader functionality)?

rafaelsideguide commented 3 weeks ago

@GildeshAbhay I'm not sure if I fully understand your problem. Just to clarify, partial_data includes content, markdown, and metadata for each page. If you're referring to a specific package or SDK, could you please provide more details?

GildeshAbhay commented 3 weeks ago

right now, you guys create 1 job id for one crawl job (that might include 25 pages for eg) what i want is, those 25 pages scraped one by one and stored. And you might say i can use firecrawl's scrape method but then again, this means i will lose out on the firecrawl's sitemap extraction functionality (that is available in firecrawl's crawling method) I need, firecrawl's crawling method returning both complete/partial data page by page rather than at once.

rafaelsideguide commented 3 weeks ago

@GildeshAbhay Thanks for explaining what you need! You can actually run the crawl with crawlerOptions.returnOnlyUrls = true to get just the URLs from the sitemap. Then, you can scrape each URL individually with the scrape method. This way, you still use the sitemap extraction and get the data per page. Does this work for you?

rafaelsideguide commented 1 week ago

@GildeshAbhay It’s been 14 days since you added your last comment here. Could you let us know if you're still experiencing the problem, or if there's anything else we can help you with?