unclecode / crawl4ai

πŸ”₯πŸ•·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
1.14k stars 113 forks source link

Container Crashes with Multiple Concurrent Requests #30

Open FractalMind opened 2 days ago

FractalMind commented 2 days ago

Description:

When using Crawl4AI, I noticed that the container breaks when handling multiple requests simultaneously. The backend inside the container works fine with a single request. However, if I send three requests concurrently, the container crashes and becomes unresponsive. The only way to get it working again is to restart the container.

Steps to Reproduce:

Start the Crawk4AI container. Send a single request to the backend and observe that it works correctly. Send three concurrent requests to the backend. Observe that the container crashes and becomes unresponsive. Expected Behavior:

The container should handle multiple concurrent requests without crashing.

Actual Behavior:

The container crashes and becomes unresponsive when handling multiple concurrent requests. It requires a restart to function again.

Environment:

Crawl4AI Version: f8a11779 unclecode unclecode@kidocode.com on 6/26/24 at 4:48 AM Docker Version: Docker version 24.0.7, build 24.0.7-0ubuntu4 Host OS: Ubuntu 24.04 LTS

Logs:

[LOG] πŸš€ Extraction done for https://www.victorytrainingcenter.ca/en/, time taken: 0.1942427158355713 seconds. Error caching URL: table crawled_data has no column named links INFO: 172.26.0.1:45036 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://www.victorytrainingcenter.ca/en/'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://www.victorytrainingcenter.ca/en/ using LocalSeleniumCrawlerStrategy... [LOG] βœ… Crawled https://www.victorytrainingcenter.ca/en/ successfully! [LOG] πŸš€ Crawling done for https://www.victorytrainingcenter.ca/en/, success: True, time taken: 0.6984555721282959 seconds DEBUG:root:[LOG] Crawl request for URL: ['https://alten.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://alten.ca using LocalSeleniumCrawlerStrategy... [LOG] πŸš€ Content extracted for https://www.victorytrainingcenter.ca/en/, success: True, time taken: 0.19995403289794922 seconds [LOG] πŸ”₯ Extracting semantic blocks for https://www.victorytrainingcenter.ca/en/, Strategy: NoExtractionStrategy [LOG] πŸš€ Extraction done for https://www.victorytrainingcenter.ca/en/, time taken: 0.20115089416503906 seconds. Error caching URL: table crawled_data has no column named links INFO: 172.26.0.1:53162 - "POST /crawl HTTP/1.1" 200 OK [LOG] βœ… Crawled https://alten.ca successfully! [LOG] πŸš€ Crawling done for https://alten.ca, success: True, time taken: 1.1569640636444092 seconds [LOG] πŸš€ Content extracted for https://alten.ca, success: True, time taken: 0.08790349960327148 seconds [LOG] πŸ”₯ Extracting semantic blocks for https://alten.ca, Strategy: NoExtractionStrategy [LOG] πŸš€ Extraction done for https://alten.ca, time taken: 0.0891120433807373 seconds. Error caching URL: table crawled_data has no column named links INFO: 172.26.0.1:53166 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://dantech.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://dantech.ca using LocalSeleniumCrawlerStrategy... [LOG] βœ… Crawled https://dantech.ca successfully! [LOG] πŸš€ Crawling done for https://dantech.ca, success: True, time taken: 0.6926393508911133 seconds [LOG] πŸš€ Content extracted for https://dantech.ca, success: True, time taken: 0.27173662185668945 seconds [LOG] πŸ”₯ Extracting semantic blocks for https://dantech.ca, Strategy: NoExtractionStrategy [LOG] πŸš€ Extraction done for https://dantech.ca, time taken: 0.2736227512359619 seconds. //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Error caching URL: table crawled_data has no column named links //!!!!!!!!!!! < = START COCURRENT REQUESTS HERE INFO: 172.26.0.1:40742 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://www.victorytrainingcenter.ca/en/'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://www.victorytrainingcenter.ca/en/ using LocalSeleniumCrawlerStrategy... DEBUG:root:[LOG] Crawl request for URL: ['https://alten.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://alten.ca using LocalSeleniumCrawlerStrategy... DEBUG:root:[LOG] Crawl request for URL: ['https://dantech.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://dantech.ca using LocalSeleniumCrawlerStrategy... [ERROR] 🚫 Failed to crawl https://www.victorytrainingcenter.ca/en/, error: Failed to crawl https://www.victorytrainingcenter.ca/en/: unknown error: session deleted because of page crash from unknown error: cannot determine loading status from tab crashed (Session info: chrome-headless-shell=126.0.6478.126) INFO: 172.26.0.1:59356 - "POST /crawl HTTP/1.1" 200 OK WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: localhost. Connection pool size: 1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: localhost. Connection pool size: 1 [ERROR] 🚫 Failed to crawl https://dantech.ca, error: Failed to crawl https://dantech.ca: invalid session id[ERROR] 🚫 Failed to crawl https://alten.ca, error: Failed to crawl https://alten.ca: invalid session id INFO: 172.26.0.1:59366 - "POST /crawl HTTP/1.1" 200 OK INFO: 172.26.0.1:59368 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://www.victorytrainingcenter.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://www.victorytrainingcenter.ca using LocalSeleniumCrawlerStrategy... [ERROR] 🚫 Failed to crawl https://www.victorytrainingcenter.ca, error: Failed to crawl https://www.victorytrainingcenter.ca: invalid session id INFO: 172.26.0.1:60476 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://dantech.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://dantech.ca using LocalSeleniumCrawlerStrategy... [ERROR] 🚫 Failed to crawl https://dantech.ca, error: Failed to crawl https://dantech.ca: invalid session id INFO: 172.26.0.1:60486 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://alten.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://alten.ca using LocalSeleniumCrawlerStrategy... [ERROR] 🚫 Failed to crawl https://alten.ca, error: Failed to crawl https://alten.ca: invalid session id INFO: 172.26.0.1:60492 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://dantech.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://dantech.ca using LocalSeleniumCrawlerStrategy... [ERROR] 🚫 Failed to crawl https://dantech.ca, error: Failed to crawl https://dantech.ca: invalid session id INFO: 172.26.0.1:38250 - "POST /crawl HTTP/1.1" 200 OK DEBUG:root:[LOG] Crawl request for URL: ['https://dantech.ca'] DEBUG:root:[LOG] Loading extraction and chunking strategies... DEBUG:root:[LOG] Running the WebCrawler... Error retrieving cached URL: no such column: links [LOG] πŸ•ΈοΈ Crawling https://dantech.ca using LocalSeleniumCrawlerStrategy... [ERROR] 🚫 Failed to crawl https://dantech.ca, error: Failed to crawl https://dantech.ca: invalid session id INFO: 172.26.0.1:51836 - "POST /crawl HTTP/1.1" 200 OK

FractalMind commented 2 days ago

I have rebuilt with 3 workers instead of 1 and I get the same behaviour

unclecode commented 2 days ago

@FractalMind Have you tried running it without the container to see if it gives the same result? I want to know if this issue occurs when we run it in a container. Thx