Please enable JS and disable any ad blocker message returned

pavlovtech / WebReaper

Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.

GNU General Public License v3.0

109 stars 26 forks source link

var x = new ScraperEngineBuilder() .GetWithBrowser("https://shift.com/cars/", actions => actions .ScrollToEnd() .Build()) .Parse(new() { new("action", "html") }) .WriteToJsonFile(@"c://Oxford//output123.json") .LogToConsole() .Build() .Run();

Thank you for trying out my library! This message is shown because the page is not fully rendered before you start scrapping.

I slightly modified your code to add a delay :

.GetWithBrowser("https://shift.com/cars/", actions => actions
                .Wait(milliseconds: 10000) // <----------
                .ScrollToEnd()
                .Build())

Now if I run it without headless mode I see the following page:

This site has protection against crawlers and web scrappers. The same page shows up if I open it in my personal chrome profile. It is not a limitation of the library as overcoming captchas and other protection mechanisms is out of the scope of the library.

If you want to overcome this protection, then it requires additional research. If you sort this out, you certainly can embed the solution into the library.

pavlovtech / WebReaper

Please enable JS and disable any ad blocker message returned #15