mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
17.79k stars 1.31k forks source link

Strange behaviors in concurrency #538

Closed samiabdelhamid closed 1 week ago

samiabdelhamid commented 2 months ago

the issue is of 2 parts. part 1: case 1: setting the docker compose concurrency and workers both to 10 then send 10 concurrent requests to /crawl then send 10 concurrent respective requests to the respective jobIds result: 8 jobs active with current_step: "scraping" + 2 waiting jobs

case 2: setting the docker compose concurrency and workers both to 12 then send 10 concurrent requests to /crawl then send 10 concurrent respective requests to the respective jobIds result: 8 jobs active with current_step: "scraping" + 2 waiting jobs

part 2: The 2 waiting jobs turn to status 'active' without current_step indefinetly as a response for all countless resends of the request {"status":"active","data":null,"partial_data":[]}

rafaelsideguide commented 2 months ago

Hey @samiabdelhamid,

For Part 1: It seems you might be reaching the limit of threads available on your machine, which is why you can only support a maximum of 8 concurrent workers. Could you please share the configuration of the machine you're using?

For Part 2: When the workers pick up the job, is the data being populated? Or is it just the current_step variable that isn’t being filled?

samiabdelhamid commented 2 months ago

@rafaelsideguide Thanks for your reply

Part1: using EC2 machine type "t3.large" The t3.large instance has 2 vCPUs, and each vCPU can handle 2 threads . Therefore, a t3.large instance can handle 4 threads in total.

Part 2: no data populated at all as well only status active with no change {"status":"active","data":null,"partial_data":[]}

rafaelsideguide commented 2 months ago

Hey @samiabdelhamid,

Given your t3.large instance can handle 4 threads, I recommend setting your worker limit to 4. This should keep things running smoothly without overloading your system.

Also, since we use Bull for queue management in Firecrawl, you might find this Bull GitHub discussion helpful for understanding how concurrency works with our setup.

Hope this helps! Let me know if there's anything else you need.

samiabdelhamid commented 2 months ago

@rafaelsideguide Thank you for your reply, the issue for Part 1 was the NUM_OF_QUEUES set to 8. I followed your suggestion and changed the worker's limit to 4 for things to run smoothly, but the same issue in part 2 remains. For Part 2, if a job gets a waiting status, when it goes active, there is no data or step returned

rafaelsideguide commented 2 months ago

@samiabdelhamid, recently (in PR #459) we switched from Bull to BullMQ, which offers better management of job concurrency. Could you update your repo, install the new packages, and test if the data problem is still happening?

rafaelsideguide commented 1 week ago

Closing this one for now, as we've made several improvements to the concurrency system. Feel free to reopen if you continue to experience any problems.