Open calebpeffer opened 2 weeks ago
100% I believe it is already planned for v1. ccing @rafaelsideguide
For a little more color, I'm scraping a user's website as a part of the onboarding experience for my product. Users get dropped into a blog/content editor and scraping runs in the background. I need scraping results to get incrementally indexed into our relational + vector storage layer so that the user can start using our product (which relies on grounding/RAG over the content) immediately rather than waiting until the crawl completes.
Since the partial_data
doesn't give me control over pagination (it is possible I fall behind processing the window provided by partial_data
) I'm currently using the /scrape
endpoint instead of /crawl
. This means fetching the sitemap on my own, and then submitting each URL into a kinesis stream with a lambda worker that calls /scrape
. The benefits of this today are:
If throughput of using crawl is substantially slower than calling scrape directly I'll probably stick with that route. Even if I use crawl, I'll still have to run some sort of background task to process the results of a crawl as they are made available.
One thing that might be worth considering in the future is adding a webhook that the crawler can call with each result. Then I could configure firecrawl to call my indexing API directly with the markdown from every URL. I'd just need to be able to configure a few custom parameters (parameterized URL for the webhook/api I'm calling, API token to authenticate with my API, some metadata internal to my system about the customer/org doing the crawl).
This would be helpful for me too.
Currently, the stream on crawl sends back information in the partial data array, which is limited to 50 items according to the docs. The newer items replace older items in partial data.
@evanboyle suggested we give him a "cursor" basically one that allows him to paginate which part of the stream he can access. Another option would be an interator (call .next()) to get the data)
Whatever approach is best, we can improve the devx here.