Closed datawhores closed 1 month ago
Hi. I'm not part of the review process but i don't think using json or yaml is a good idea for this kind of cache. You can not append to either of them. Every time you want to update the cache (every new request) the entire thing needs to be rewritten to disk.
That's gonna be an exponential problem given that this essentially a perpetual cache cause it only grows, never expires. I suggest using sqlite or even a shelve
There is also this package which adds cache to aiohttp requests: https://github.com/requests-cache/aiohttp-client-cache
The size will be bigger but it will be easier to manage, faster, and can be used on any site if wanted to. Taking a worst case approach, downloading the entire OF section of the forum (18.7K threads, 451.9K posts) would take about 22K requests, assuming 20 posts per page. With 300kb per page, that would be a 6.5 GB cache which i think is a pretty reasonable size for such unreasonable download case. Most users won't ever be close to that.
A agree, JSON definitely isn't an ideal way to cache the website. That definitely package looks like the easiest way to implement a cache, and I like that we could use it for any websites. Large coomer pages usually take a while to scrape and it can be annoying if you are re-scraping something you have done before.
@NTFSvolume That package looks great!
Took a quick look at implementing it and it should be really easy to do.
Made a branch for it: https://github.com/jbsparrow/CyberDropDownloader/tree/aiohttp-client-caching
Made a branch for it: https://github.com/jbsparrow/CyberDropDownloader/tree/aiohttp-client-caching
In here, when next_page
does not exist, its means the scraper is currently at the last page of the thread. When that happens, the last request needs to be invalidated and redone but with the cache disabled or the crawler will never catch new posts on the thread.
Could also work moving the next_page
check at the beginning of the while loop and do the request without cache if no next_page
.
Something similar needs to happens on the other forums crawlers
can be changed to
async def get_BS4(self, domain: str, url: URL, client_session: ClientSession, skip_cache: bool = False) -> BeautifulSoup:
"""Returns a BeautifulSoup object from the given URL"""
if skip_cache:
async with client_session.disabled().get(url, headers=self._headers, ssl=self.client_manager.ssl_context, proxy=self.client_manager.proxy) as response:
...
We can also set the cache expiry time, I was thinking 7 days for every website, and we can also add the option to configure it for individual websites.
@NTFSvolume The filter_fn
parameter can be used to determine whether or not we should cache a response without having to request twice or move the last page code around. Can just turn the last page checks into functions and use them that way.
https://aiohttp-client-cache.readthedocs.io/en/stable/user_guide.html#custom-response-filtering
can be changed to
async def get_BS4(self, domain: str, url: URL, client_session: ClientSession, skip_cache: bool = False) -> BeautifulSoup: """Returns a BeautifulSoup object from the given URL""" if skip_cache: async with client_session.disabled().get(url, headers=self._headers, ssl=self.client_manager.ssl_context, proxy=self.client_manager.proxy) as response: ...
We can also set the cache expiry time, I was thinking 7 days for every website, and we can also add the option to configure it for individual websites.
That works. If no next_page
, skip_cache
will be passed as true
. As for the expire time, honestly i think even 1 month will be good for forums cause realistically the content of old posts does not change frequently. The newest ones are the most likely to be edited for tipos or added links but those would be handle already by always fetching the last page from the forum itself.
1 month for forums and 7 days for file-hosts / everything else. That would be my recomendation. But yes, the user should be able to change them.
@NTFSvolume The
filter_fn
parameter can be used to determine whether or not we should cache a response without having to request twice or move the last page code around. Can just turn the last page checks into functions and use them that way.https://aiohttp-client-cache.readthedocs.io/en/stable/user_guide.html#custom-response-filtering
Excellent. That would be the best way. Do not cache a response if it is the last page.
Committed check for SimpCity. Set all to expire after 7 days but I agree that 30 is a good idea. Will add more websites and stuff tomorrow.
Added caching support for coomer (not to say that it wasn't there before but that now it won't cache the last page of a coomer profile)
Also cleaned up code among some other fixes.
This adds a cache to the simpcity process to reduce load