Revert "Revert "add simpcity cache to reduce load""

datawhores commented 1 month ago

This adds a cache to the simpcity process to reduce load

data is held in separate simpcity cache file
yaml was finky about saving the stringified html objects, so json was used
The caching is a little more complicated to save space, saving the entire page response takes about 300kb, however saving only the required information including posts data reduces this to around 15kb per page

NTFSvolume commented 1 month ago

Hi. I'm not part of the review process but i don't think using json or yaml is a good idea for this kind of cache. You can not append to either of them. Every time you want to update the cache (every new request) the entire thing needs to be rewritten to disk.

That's gonna be an exponential problem given that this essentially a perpetual cache cause it only grows, never expires. I suggest using sqlite or even a shelve

There is also this package which adds cache to aiohttp requests: https://github.com/requests-cache/aiohttp-client-cache

The size will be bigger but it will be easier to manage, faster, and can be used on any site if wanted to. Taking a worst case approach, downloading the entire OF section of the forum (18.7K threads, 451.9K posts) would take about 22K requests, assuming 20 posts per page. With 300kb per page, that would be a 6.5 GB cache which i think is a pretty reasonable size for such unreasonable download case. Most users won't ever be close to that.

jbsparrow commented 1 month ago

A agree, JSON definitely isn't an ideal way to cache the website. That definitely package looks like the easiest way to implement a cache, and I like that we could use it for any websites. Large coomer pages usually take a while to scrape and it can be annoying if you are re-scraping something you have done before.

jbsparrow commented 1 month ago

@NTFSvolume That package looks great!

jbsparrow commented 1 month ago

Took a quick look at implementing it and it should be really easy to do.

jbsparrow commented 1 month ago

Made a branch for it: https://github.com/jbsparrow/CyberDropDownloader/tree/aiohttp-client-caching

NTFSvolume commented 1 month ago

Made a branch for it: https://github.com/jbsparrow/CyberDropDownloader/tree/aiohttp-client-caching

https://github.com/jbsparrow/CyberDropDownloader/blob/ff0edbed63f738f5b034389c39f49f3cffd8bf3c/cyberdrop_dl/scraper/crawlers/simpcity_crawler.py#L122-L129

In here, when next_pagedoes not exist, its means the scraper is currently at the last page of the thread. When that happens, the last request needs to be invalidated and redone but with the cache disabled or the crawler will never catch new posts on the thread.

Could also work moving the next_page check at the beginning of the while loop and do the request without cache if no next_page.

Something similar needs to happens on the other forums crawlers

jbsparrow commented 1 month ago

https://github.com/jbsparrow/CyberDropDownloader/blob/958523cf13ce297588aaecf0cd9845f1881bd9da/cyberdrop_dl/clients/scraper_client.py#L85-L88

can be changed to

    async def get_BS4(self, domain: str, url: URL, client_session: ClientSession, skip_cache: bool = False) -> BeautifulSoup:
        """Returns a BeautifulSoup object from the given URL"""
        if skip_cache:
            async with client_session.disabled().get(url, headers=self._headers, ssl=self.client_manager.ssl_context, proxy=self.client_manager.proxy) as response:
    ...

We can also set the cache expiry time, I was thinking 7 days for every website, and we can also add the option to configure it for individual websites.

jbsparrow commented 1 month ago

@NTFSvolume The filter_fn parameter can be used to determine whether or not we should cache a response without having to request twice or move the last page code around. Can just turn the last page checks into functions and use them that way.

https://aiohttp-client-cache.readthedocs.io/en/stable/user_guide.html#custom-response-filtering

NTFSvolume commented 1 month ago

https://github.com/jbsparrow/CyberDropDownloader/blob/958523cf13ce297588aaecf0cd9845f1881bd9da/cyberdrop_dl/clients/scraper_client.py#L85-L88

can be changed to
    async def get_BS4(self, domain: str, url: URL, client_session: ClientSession, skip_cache: bool = False) -> BeautifulSoup:
        """Returns a BeautifulSoup object from the given URL"""
        if skip_cache:
            async with client_session.disabled().get(url, headers=self._headers, ssl=self.client_manager.ssl_context, proxy=self.client_manager.proxy) as response:
    ...
We can also set the cache expiry time, I was thinking 7 days for every website, and we can also add the option to configure it for individual websites.

That works. If no next_page, skip_cachewill be passed as true. As for the expire time, honestly i think even 1 month will be good for forums cause realistically the content of old posts does not change frequently. The newest ones are the most likely to be edited for tipos or added links but those would be handle already by always fetching the last page from the forum itself.

1 month for forums and 7 days for file-hosts / everything else. That would be my recomendation. But yes, the user should be able to change them.

NTFSvolume commented 1 month ago

@NTFSvolume The filter_fn parameter can be used to determine whether or not we should cache a response without having to request twice or move the last page code around. Can just turn the last page checks into functions and use them that way.

https://aiohttp-client-cache.readthedocs.io/en/stable/user_guide.html#custom-response-filtering

Excellent. That would be the best way. Do not cache a response if it is the last page.

jbsparrow commented 1 month ago

Committed check for SimpCity. Set all to expire after 7 days but I agree that 30 is a good idea. Will add more websites and stuff tomorrow.

jbsparrow commented 1 month ago

Added caching support for coomer (not to say that it wasn't there before but that now it won't cache the last page of a coomer profile)

Also cleaned up code among some other fixes.

jbsparrow / CyberDropDownloader

Revert "Revert "add simpcity cache to reduce load"" #124