Closed guygastineau closed 2 years ago
We will probably hit some interesting road-blocks in pursuit of this more general design, but the reward will be well worth it. Ad hoc implementations of actions such as scraping don't suit Rust very well imo. The programs feel sort of throw away, and we can end up with a lot of duplicated code across many projects all doing almost the same thing. This is common for Python, and if that is the desired path then python or scheme are both great fits, however Rust, as most compiled languages and especially languages with explicit concern for memory are, is not well suited to this approach.
I won't make a new issue of it yet, but after exploring a library for html parsing and querying the door will open for a more general architecture that:
I'm going to follow your advice in the first comment on this issue, and rewrite the backend with scraper
before implementing some sort of config.
I like the idea of making the program more general, and that trying to do so will present some interesting challenges and opportunities to learn.
A few things come to mind when I consider making a config file, such as:
enum
or struct
?OK it has been refactored to use the scraper
crate!
That took more work than I thought it would, but I am learning a lot about using crates and which parts of the program would benefit from being generalized further.
On the downside, there are now a lot of uses of unwrap
.
This may also be a good opportunity to use enum or struct?
Yes, you will likely want to use some kind of struct. We might need enums somewhere along the way, but the top level representation of a config probably will be a struct.
keeping a record of which addresses for which sites had already been scraped, so work is not duplicated
Well, we should have all the "profiles" for each site from the config in some kind of iterable from whatever library we use to read whatever spec/format/language we choose for the config. So, we can just traverse the entries which will keep us from repeating work. If you want to deduplicate entries you could use a set or a HashSet on the entries/profiles from the config.
On the downside, there are now a lot of uses of unwrap.
Yeah, I checked your unwrap
usage with this grep -R unwrap src/ | wc -l
which won't count multiple unwraps on a single line. The total was 20. I am surprised clippy didn't complain... Oh well, I'll take a real look at the code, but the basic answer for this is to handle the return values instead of unwrapping them. There are various ways to do this. You can use match
to scrutinize sum types (enums) Result<T,E>
and Option<T>
. That is probably the simplest way; just some good old pattern matching on tagged unions. Rust also has if let ...
syntax and and_then
is is like a poor man's bind from haskell:
-- In Haskell
(>>=) :: Monad m => m a -> (a -> m b) -> m b
// In Rust
impl<T> Option<T> {
pub const fn and_then<U, F: FnOnce(T) -> Option<U>>(self, f: F) -> Option<U> {
match self {
Some(x) => f(x),
None => None,
}
}
}
// Or the more complicated and_then for Result...
impl<T,E> Result<T,E>
pub fn and_then<U, F: FnOnce(T) -> Result<U, E>>(self, op: F) -> Result<U, E> {
match self {
Ok(t) => op(t),
Err(e) => Err(e),
}
}
}
Using and_then you can string together introductions (functions taking a type and returning another type wrapped up in Result or Option respectively) to compose a pipeline that short circuits on early failures. It is a nice way to control the flow without being as tedious. In Rust we can also use ?
at the end of statements (still before the ;
) to return their Err
or None
depending on the return type. This might seem like too much right now, but I figured I should mention it.
Okay, I'll take alook at the code now, and I will post examples of fixing the unwraps.
Oh, I see you learned about ?
Hey yeah the ?
was something I just used I think for the first time. Mostly the errors are propagated up out of lib.rs
now, and I guess I'm just not exactly sure what to do with them.
What I'm used to is having a program fail catastrophically whenever something unexpected happens, and then unwind the program to find the problem (coming from C). That is one of the reasons why I want to focus a bit on error handling, because it is something I don't really grok.
I also went ahead and made a first pass at creating a struct for the page being crawled. I think my next step will be:
Those are my first thoughts at next steps, but I'm also curious if there are any other improvements or directions you would suggest.
Failing catastrophically in an executable is fine especially if you have good error messages. You might like to use expect yo have more control over what message is massed to the user during a crash in main.
I looked on crates.io, and I found https://crates.io/crates/scraper. This looks like an excellent library for parsing and querying xml/html. The current code is specific to one webstie, and extending it to work with other websites would require duplicating most of the work in
lib.rs
for each target. A better architecture would use some sort of config language letting the user define different sites and selector queries for retrieving the raw data that is then handled by a common routine. In order to accommodate such a design you would need a tested html parser and some way to parse selectors for queries. Luckilyscraper
does both of these things, and it is actively maintained. I know this is a lot to think about at once. After eliminating the warning pointed out in #1 I would suggest just rewriting your backend with the scraper crate before implementing some sort of config if that all sounds too daunting to do at once.