Closed ror-web-expert closed 4 months ago
I fail to see how this is an issue with Capybara. Capybara is a tool for testing web apps, not a scraping tool actively hiding itself from sites. The fact that you're using it abuse the terms of service of indeed is not something we can help you with.
Problem: I'm facing an issue with my Rails application that involves scraping data from different sites using the Capybara gem. Everything works fine for most sites, but I'm encountering a problem specifically with Indeed.
Description: When I attempt to scrape data from Indeed with the headless option set to true, I get blocked. However, when I set the headless option to false, the scraping works fine. Upon inspecting the screenshot generated by @session.save_screenshot, it clearly indicates that I've been blocked.
Steps to Reproduce:
Set headless: true in browser options. Attempt to scrape data from Indeed. Observe the blocking issue. Expected Behavior: Scraping should work seamlessly with headless mode enabled, just as it does for other sites.
Environment:
Rails Version: 7 Capybara Version: 3.39.2 Nokogiri Version: 1.15.4-x86_64-linux
Additional Information:
Adding a proxy service did not resolve the issue. The problem seems specific to the interaction between Indeed and Capybara with headless mode.
Workaround:
Setting headless: false resolves the blocking issue, but this is not an ideal solution.
Request for Assistance: I'm seeking guidance on potential solutions or workarounds to enable headless scraping for Indeed without being blocked. Any insights or recommendations would be greatly appreciated.
Thank you for your assistance!