teamcapybara / capybara

Acceptance test framework for web applications
http://teamcapybara.github.io/capybara/
MIT License
9.98k stars 1.44k forks source link

Issue with Capybara Gem - Scraping Blocked on Indeed Site #2735

Closed ror-web-expert closed 4 months ago

ror-web-expert commented 4 months ago

Problem: I'm facing an issue with my Rails application that involves scraping data from different sites using the Capybara gem. Everything works fine for most sites, but I'm encountering a problem specifically with Indeed.

Description: When I attempt to scrape data from Indeed with the headless option set to true, I get blocked. However, when I set the headless option to false, the scraping works fine. Upon inspecting the screenshot generated by @session.save_screenshot, it clearly indicates that I've been blocked.

capybara-202401171752145506410800

Steps to Reproduce:

Set headless: true in browser options. Attempt to scrape data from Indeed. Observe the blocking issue. Expected Behavior: Scraping should work seamlessly with headless mode enabled, just as it does for other sites.

Environment:

Rails Version: 7 Capybara Version: 3.39.2 Nokogiri Version: 1.15.4-x86_64-linux

Additional Information:

Adding a proxy service did not resolve the issue. The problem seems specific to the interaction between Indeed and Capybara with headless mode.

Workaround:

Setting headless: false resolves the blocking issue, but this is not an ideal solution.

Request for Assistance: I'm seeking guidance on potential solutions or workarounds to enable headless scraping for Indeed without being blocked. Any insights or recommendations would be greatly appreciated.

Thank you for your assistance!

twalpole commented 4 months ago

I fail to see how this is an issue with Capybara. Capybara is a tool for testing web apps, not a scraping tool actively hiding itself from sites. The fact that you're using it abuse the terms of service of indeed is not something we can help you with.