pavlovtech / WebReaper

Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.
GNU General Public License v3.0
109 stars 26 forks source link

Please enable JS and disable any ad blocker message returned #15

Closed MattComb closed 1 year ago

MattComb commented 1 year ago

This is a very nice package thank you. When using it to review the number of cars on the page below, I am getting the message back "Please enable JS and disable any ad blocker"

Is this a limitation with the code, or is there something I can change with the headless browser ?

Steps to reproduce the behavior:

var x = new ScraperEngineBuilder()
    .GetWithBrowser("https://shift.com/cars/", actions => actions
        .ScrollToEnd()
        .Build())
    .Parse(new()
    {
        new("action", "html")
    })
    .WriteToJsonFile(@"c://Oxford//output123.json")
    .LogToConsole()
    .Build()
    .Run();
pavlovtech commented 1 year ago

Thank you for trying out my library! This message is shown because the page is not fully rendered before you start scrapping.

I slightly modified your code to add a delay :

.GetWithBrowser("https://shift.com/cars/", actions => actions
                .Wait(milliseconds: 10000) // <----------
                .ScrollToEnd()
                .Build())

Now if I run it without headless mode I see the following page:

image

This site has protection against crawlers and web scrappers. The same page shows up if I open it in my personal chrome profile. It is not a limitation of the library as overcoming captchas and other protection mechanisms is out of the scope of the library.

If you want to overcome this protection, then it requires additional research. If you sort this out, you certainly can embed the solution into the library.