[Feat] Better dev experience when accessing stream results on /crawl

calebpeffer commented 2 weeks ago

Currently, the stream on crawl sends back information in the partial data array, which is limited to 50 items according to the docs. The newer items replace older items in partial data.

@evanboyle suggested we give him a "cursor" basically one that allows him to paginate which part of the stream he can access. Another option would be an interator (call .next()) to get the data)

Whatever approach is best, we can improve the devx here.

nickscamara commented 2 weeks ago

100% I believe it is already planned for v1. ccing @rafaelsideguide

EvanBoyle commented 2 weeks ago

For a little more color, I'm scraping a user's website as a part of the onboarding experience for my product. Users get dropped into a blog/content editor and scraping runs in the background. I need scraping results to get incrementally indexed into our relational + vector storage layer so that the user can start using our product (which relies on grounding/RAG over the content) immediately rather than waiting until the crawl completes.

Since the partial_data doesn't give me control over pagination (it is possible I fall behind processing the window provided by partial_data) I'm currently using the /scrape endpoint instead of /crawl. This means fetching the sitemap on my own, and then submitting each URL into a kinesis stream with a lambda worker that calls /scrape. The benefits of this today are:

I can stay under the scrape rate limit (my workload is inconsistent and "bursty") and handling errors is automatic (retry policy set up on kinesis/lambda)
I can control concurrency so that scraping a sitemap happens as fast as possible. Maybe a crawl can do 500 pages/minute but I haven't tried
Content gets incrementally indexed and made available to RAG during user onboarding.

If throughput of using crawl is substantially slower than calling scrape directly I'll probably stick with that route. Even if I use crawl, I'll still have to run some sort of background task to process the results of a crawl as they are made available.

One thing that might be worth considering in the future is adding a webhook that the crawler can call with each result. Then I could configure firecrawl to call my indexing API directly with the markdown from every URL. I'd just need to be able to configure a few custom parameters (parameterized URL for the webhook/api I'm calling, API token to authenticate with my API, some metadata internal to my system about the customer/org doing the crawl).

jgluck-eab commented 2 weeks ago

This would be helpful for me too.

mendableai / firecrawl

[Feat] Better dev experience when accessing stream results on /crawl #384