josephleblanc / web_crawler

Crawl a given royalroad seed page for story text
1 stars 0 forks source link

How to generalize across different websites? #4

Open josephleblanc opened 2 years ago

josephleblanc commented 2 years ago

I'm having trouble writing writing a struct and methods to identify target content and the next link using a config file in a way that will generalize well across websites. Any advice appreciated.

The goal of this program is to take a seed page, scrape data, and follow a link to the next page, then repeat. Currently this program is quite ad hoc. There is one struct for one website, which contains fields which will not generalize well to other websites the program should be able to scrape.

There are three pieces of information required to scrape a given website:

Ideally we would want a struct with only that information for each target for this scraper. Also ideally, we would want a config file with templates for each website (royalroad.com, wuxiaworld.com) which can provide the identifiers of content and next page.

The main difficulty is writing a method that will grab the next link for different websites with only data provided in a config file. The crate scraper seems good at grabbing certain tags, but it is not always easy to identify the next link. For example, the page from a 2nd chapter in a story from royalroad.com has the following html:

<div class="row nav-buttons">
    <div class="col-xs-6 col-md-4 col-lg-3 col-xl-2">
            <a class="btn btn-primary col-xs-12" href="/fiction/21220/mother-of-learning/chapter/301778/1-good-morning-brother">
                <i class="far fa-chevron-double-left mr-3"></i> Previous <br class="visible-xs-block" />Chapter
            </a>
    </div>
    <div class="col-xs-6 col-md-4 col-md-offset-4 col-lg-3 col-lg-offset-6">

            <a class="btn btn-primary col-xs-12" href="/fiction/21220/mother-of-learning/chapter/301784/3-the-bitter-truth">
                Next <br class="visible-xs-block" />Chapter <i class="far fa-chevron-double-right ml-3"></i>
            </a>
    </div>
</div>

There are two links in the above html, one going to the previous page and one going to the next page. Because the two links are within tags that are exactly the same, I can only see two ways of differentiating them - either by the order in which they appear or by their .inner_html() value. I use the following method to get the next link by the order in which they appear:

pub fn addr_next_chapter<'a>(html: &'a Html, selector: &'a Selector) -> Option<&'a str> {
    html
        .select(selector)
        .nth(1)?
        .value()
        .attr("href")
}

where the selector used in this case is provided by a config file to the struct, and is a[class="btn btn-primary col-xs-12"].

The problem with this method is that it will not generalize well, because it relies upon the next link being the second of two similar links leading to the previous and next page. On the other hand, if I choose to instead use the inner_html() to get the next link, then I need to provide the method with another parameter which might not be used in a website other than royalroad.com. Not to mention this website uses relative links and must be combined with the base website reqwest the next link, while other websites may use absolute links.

tl;dr I'm having trouble writing writing a struct and methods to identify target content and the next link using a config file in a way that will generalize well across websites. Any advice appreciated.

josephleblanc commented 2 years ago

Though it is rather ad hoc, I may end up simply minimizing the current struct and then writing a new struct for another website, to see how much work is involved and which similarities can be gleaned from the experience. Then moving forward I can make a more informed decision about how possible a generalized struct+methods are vs. how much work it takes to write new structs+methods for new websites. After all, this is just a simplified version of my next project, which will be a scraper that grabs financial advice articles.

josephleblanc commented 2 years ago

For now I am tentatively satisfied with the level of generality in the struct WebNovel and it's methods.

It now uses two points of data to find the appropriate link to the next page for each page template. I'll see how well it works going forward, after adding a few more page profiles.