Closed Mews closed 1 week ago
Oh right I guess this new "body" field doesn't match the type hint for crawl_result
:P
Should it be Dict[str, Dict[str, Union[List[str], str]]]
?
@indrajithi I'm not quite sure how to do the type hints for the crawl_result
variable now that it has this new body
field :/
@indrajithi I'm not quite sure how to do the type hints for the
crawl_result
variable now that it has this newbody
field :/
self.crawl_result: Dict[str, Dict[str, Union[List[str], str]]] = {}
Does this not work?
If the type hint for crawl_result is not working, we can just set it to a basic dict or override/suppress checking that case and move on.
Alright I'm on my phone right now but I'll get to it when I get home :+1:
@indrajithi I'm not quite sure how to do the type hints for the
crawl_result
variable now that it has this newbody
field :/
self.crawl_result: Dict[str, Dict[str, Union[List[str], str]]] = {}
Does this not work?
Nope that's what raised the error on the ci. I'll open an issue about it so that it can be dealt with later.
@indrajithi Ok I just introduced a temporary fix, I set the type hint to Dict[str, Any]
, so you can rerun the ci and merge if everything passes. I'll open the issue now.
Closes #8
Changes
include_body
argument in theSpider
class. This is a boolean that defaults to false. When set to true, the body of the crawled pages will be included incrawl_result
.Spider.crawl
added the code to add the body tocrawl_result
fromsoup.html
.Right now the body of the page is added regardless of wether it finds links inside it or not! This just felt like the most expectable behavior, but let me know if I should change it. Also there are no verbose prints, I didn't find it necessary but let me know if I should add some 👍