pavlovtech / WebReaper

Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.
GNU General Public License v3.0
109 stars 26 forks source link

Enhancement - Ability to Parse a list #28

Open Marcel0024 opened 3 months ago

Marcel0024 commented 3 months ago

Hi, i've been looking at this library, it's really promissing. It really saves a lot of time writing boiler plate. But i'm missing one feature to really be able to use it for my use-case.

Is your feature request related to a problem? Please describe.

The issue i'm running into is i don't have to open each link to scrape them. My first page is the page with listings and has pagination.

For example:

Page 1


[Listing 1]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 2]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 3]
    [Name]
    [Amount]
    [Rating]
    [Link]

 pages [1] 2 3 4 5 6 ... 234  

Page 2


[Listing 1]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 2]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 3]
    [Name]
    [Amount]
    [Rating]
    [Link]

 pages 1 [2] 3 4 5 6 ... 234  

The way the library is setup is, i have to .Follow(...) each link and .Parse(..) each one opened page. But in my case i don't have to. The data i need is on this page already.

Describe the solution you'd like

Ability to parse a List, maybe use a JArray for the object returned in the entity.

Describe alternatives you've considered

I didn't find a workaround. I did try something like this:

.Parse([..Enumerable.Range(0, 10).Select(x =>
    {
        return new Schema($"Listing{x}")
        {
            new SchemaElement("Name", " div.min-w-0 > a > h2"),
            new SchemaElement("Amount", "div.min-w-0 > p.font-semibold")
        };
    })])

But all listing are the same, since the query selector just grabs the first one https://github.com/pavlovtech/WebReaper/blob/master/WebReaper/Core/Parser/Concrete/AngleSharpContentParser.cs#L85

Additional context

To keep backwards compatability, i think this needs to be implemented on SchemaElement with a new property. Maybe IsList or IsArray.

In FillOutput() https://github.com/pavlovtech/WebReaper/blob/master/WebReaper/Core/Parser/Concrete/AngleSharpContentParser.cs#L43 in the try we can add differentiate if it's a list or not, if so, GetListData() returns a list of data to adda JArray.

I'm willing to work on a PR with some guidance/approval.

Marcel0024 commented 3 months ago

Just realized you would have to change the Job implementation as well

https://github.com/pavlovtech/WebReaper/blob/988ea8cff8ce7bcbc08e193f96350874022c24cb/WebReaper/Domain/Job.cs#L17

Because every page would have to become a TargetPage.

Damn there's no way to override this. I thought with a custom IContentParser would do the trick, but ran into this.