josephleblanc / web_crawler

Crawl a given royalroad seed page for story text
1 stars 0 forks source link

Consider using a library for XML/HTML parsing and selection #2

Closed guygastineau closed 2 years ago

guygastineau commented 2 years ago

I looked on crates.io, and I found https://crates.io/crates/scraper. This looks like an excellent library for parsing and querying xml/html. The current code is specific to one webstie, and extending it to work with other websites would require duplicating most of the work in lib.rs for each target. A better architecture would use some sort of config language letting the user define different sites and selector queries for retrieving the raw data that is then handled by a common routine. In order to accommodate such a design you would need a tested html parser and some way to parse selectors for queries. Luckily scraper does both of these things, and it is actively maintained. I know this is a lot to think about at once. After eliminating the warning pointed out in #1 I would suggest just rewriting your backend with the scraper crate before implementing some sort of config if that all sounds too daunting to do at once.

guygastineau commented 2 years ago

We will probably hit some interesting road-blocks in pursuit of this more general design, but the reward will be well worth it. Ad hoc implementations of actions such as scraping don't suit Rust very well imo. The programs feel sort of throw away, and we can end up with a lot of duplicated code across many projects all doing almost the same thing. This is common for Python, and if that is the desired path then python or scheme are both great fits, however Rust, as most compiled languages and especially languages with explicit concern for memory are, is not well suited to this approach.

guygastineau commented 2 years ago

I won't make a new issue of it yet, but after exploring a library for html parsing and querying the door will open for a more general architecture that:

  1. Will make the project easier to understand and test.
  2. By giving the program a larger set of work of work to do via configuration files and generality you can explore the impact of using an async framework with a thread pool to do a bunch of parallel work if you don't mind hogging system resources to get the scraping done. Of course, making the resource usage configurable via config or cli flags would then be a good idea.
josephleblanc commented 2 years ago

I'm going to follow your advice in the first comment on this issue, and rewrite the backend with scraper before implementing some sort of config.

I like the idea of making the program more general, and that trying to do so will present some interesting challenges and opportunities to learn.

josephleblanc commented 2 years ago

A few things come to mind when I consider making a config file, such as:

josephleblanc commented 2 years ago

OK it has been refactored to use the scraper crate!

That took more work than I thought it would, but I am learning a lot about using crates and which parts of the program would benefit from being generalized further.

On the downside, there are now a lot of uses of unwrap.

guygastineau commented 2 years ago

This may also be a good opportunity to use enum or struct?

Yes, you will likely want to use some kind of struct. We might need enums somewhere along the way, but the top level representation of a config probably will be a struct.

keeping a record of which addresses for which sites had already been scraped, so work is not duplicated

Well, we should have all the "profiles" for each site from the config in some kind of iterable from whatever library we use to read whatever spec/format/language we choose for the config. So, we can just traverse the entries which will keep us from repeating work. If you want to deduplicate entries you could use a set or a HashSet on the entries/profiles from the config.


On the downside, there are now a lot of uses of unwrap.

Yeah, I checked your unwrap usage with this grep -R unwrap src/ | wc -l which won't count multiple unwraps on a single line. The total was 20. I am surprised clippy didn't complain... Oh well, I'll take a real look at the code, but the basic answer for this is to handle the return values instead of unwrapping them. There are various ways to do this. You can use match to scrutinize sum types (enums) Result<T,E> and Option<T>. That is probably the simplest way; just some good old pattern matching on tagged unions. Rust also has if let ... syntax and and_then is is like a poor man's bind from haskell:

-- In Haskell
(>>=) :: Monad m => m a -> (a -> m b) -> m b
// In Rust

impl<T> Option<T> {
    pub const fn and_then<U, F: FnOnce(T) -> Option<U>>(self, f: F) -> Option<U> {
        match self {
            Some(x) => f(x),
            None => None,
        }
    }
}

// Or the more complicated and_then for Result...
impl<T,E> Result<T,E>
    pub fn and_then<U, F: FnOnce(T) -> Result<U, E>>(self, op: F) -> Result<U, E> {
        match self {
            Ok(t) => op(t),
            Err(e) => Err(e),
        }
    }
}

Using and_then you can string together introductions (functions taking a type and returning another type wrapped up in Result or Option respectively) to compose a pipeline that short circuits on early failures. It is a nice way to control the flow without being as tedious. In Rust we can also use ? at the end of statements (still before the ;) to return their Err or None depending on the return type. This might seem like too much right now, but I figured I should mention it.


Okay, I'll take alook at the code now, and I will post examples of fixing the unwraps.

guygastineau commented 2 years ago

Oh, I see you learned about ?

josephleblanc commented 2 years ago

Hey yeah the ? was something I just used I think for the first time. Mostly the errors are propagated up out of lib.rs now, and I guess I'm just not exactly sure what to do with them.

What I'm used to is having a program fail catastrophically whenever something unexpected happens, and then unwind the program to find the problem (coming from C). That is one of the reasons why I want to focus a bit on error handling, because it is something I don't really grok.

I also went ahead and made a first pass at creating a struct for the page being crawled. I think my next step will be:

Those are my first thoughts at next steps, but I'm also curious if there are any other improvements or directions you would suggest.

guygastineau commented 2 years ago

Failing catastrophically in an executable is fine especially if you have good error messages. You might like to use expect yo have more control over what message is massed to the user during a crash in main.