Hello and thank you for building this amazing libary.
i'm using crawl4ai in a production environment with up to 50 concurrents requests in a fastapi Application. the problem i have is the memory usage, im building using docker and this is my docker file :
i tried two methods for handling crawl4ai, one using fastApi lifespan where i create a global crawler:
# Global AsyncWebCrawler instance
crawler = None
@asynccontextmanager
async def lifespan(app_start: FastAPI):
# Startup: create and initialize the AsyncWebCrawler
global crawler
crawler = AsyncWebCrawler(verbose=False, always_by_pass_cache=True)
await crawler.__aenter__()
yield
if crawler:
await crawler.__aexit__(None, None, None)
app = FastAPI(lifespan=lifespan)
scraping_semaphore = asyncio.Semaphore(10)
With this approach, memory usage keeps increasing indefinitely, requiring a server reboot every three days to keep it running smoothly, even with a Semaphore set to 10.
Alternatively, I’ve tried using the crawler without a global instance. With this approach, I experience memory spikes, but they eventually return to normal. Additionally, with 10 concurrent requests running on a server with 4 vCPUs and 16 GB of RAM, the response time averages around 20 seconds.
@app.post("/crawl_urls")
async def crawl_urls(request: ScrapeRequest):
try:
#print(f"Received {request.urls} urls to scrape")
if not request.urls:
return []
tasks = [process_url(url) for url in request.urls]
results = await asyncio.gather(*tasks)
return results
except Exception as e:
#print(f"Error in scrape_urls: {e}")
return []
async def process_url(url):
try:
if await is_pdf(url):
return ''
#start_time = time.time()
result = await crawl_url(url)
return result
except Exception as e:
#print(f"Error processing {url}: {e}")
return ''
async def crawl_url(url):
try:
async with AsyncWebCrawler(verbose=False,always_by_pass_cache=True) as crawler:
result = await crawler.arun(url=url, verbose=False,bypass_cache=True)
#print(result.markdown)
return result.markdown
except Exception as e:
print(f"error in crawl4ai {e}")
return ''
# im bypassing the cache to test for concurrents requests
I’m not sure if there are specific settings I can adjust to improve performance and reduce memory usage. Any advice on optimizing this setup would be greatly appreciated.
P.S.: I also tried using arun_many, but it didn’t result in any performance improvement.
Hello and thank you for building this amazing libary.
i'm using crawl4ai in a production environment with up to 50 concurrents requests in a fastapi Application. the problem i have is the memory usage, im building using docker and this is my docker file :
i tried two methods for handling crawl4ai, one using fastApi lifespan where i create a global crawler:
With this approach, memory usage keeps increasing indefinitely, requiring a server reboot every three days to keep it running smoothly, even with a
Semaphore set to 10
.Alternatively, I’ve tried using the crawler without a global instance. With this approach, I experience memory spikes, but they eventually return to normal. Additionally, with 10 concurrent requests running on a server with 4 vCPUs and 16 GB of RAM, the response time averages around 20 seconds.
I’m not sure if there are specific settings I can adjust to improve performance and reduce memory usage. Any advice on optimizing this setup would be greatly appreciated.
P.S.: I also tried using
arun_many
, but it didn’t result in any performance improvement.