gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
22.97k stars 1.75k forks source link

Pluggable cache system? #103

Open jimsmart opened 6 years ago

jimsmart commented 6 years ago

Hi,

Thanks for Colly! :)

I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)

I'd like to propose a pluggable cache system, similar to how you've made other Colly components.

Perhaps with an API like this:-

type Cache interface {
    Init() error
    Get(url string) (*colly.Response, error)
    Put(url string, r colly.Response) error
    Remove(url string) error
}

...or...

type Cache interface {
    Init() error
    Get(url string) ([]byte, error)
    Put(url string, data []byte) error
    Remove(url string) error
}

The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.

The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.

If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.

I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)

— Your thoughts?

jimsmart commented 6 years ago

Another option is to pass/return io.Reader across the interface, instead of []byte

asciimoo commented 6 years ago

@jimsmart thanks for this great idea and for the detailed and clear specification.

I'm currently working on a storage interface (04d96ab7c5bbf879de632fcc683af592be4ba0b8) which follows pretty much the same concept. It allows us to use any kind of backend to store visited urls and cookies (http://go-colly.org/docs/best_practices/storage/). Let me describe its architecture, perhaps it can help to design a more sophisticated cache functionality. storage interface is a stand alone module in colly: https://github.com/gocolly/colly/blob/master/storage/storage.go and colly imports it. storage module also contains the default naive implementation of the storage interface to be backward compatible. I decided to create separate packages for different implementations of the Storage interface to reduce colly's dependencies. Redis storage implementation can be found here: https://github.com/gocolly/redisstorage .

I think, a similar concept could work with the 2nd cache interface from your specification. What do you think?

Also, I'd suggest an additional Close() error function in the interface to be able to terminate connections.

jimsmart commented 6 years ago

Yes. That's what I was thinking, similar to package storage.

And have a SetCache(c cache.Cache) error method on Collector.

I don't need Close() myself... Get(), Put() and Remove() should all be atomic operations, and each method would simply obtain a database connection on execution, and close it before returning. All Go db drivers use a connection pool behind the scenes anyhow, so it's a good pattern to follow, has minimal overheads and is easy to reason about. It's better to do that than to have an open db connection hanging about for hours/days (it's better to release it back to the pool after every use, that way the driver can deal with broken or stale connections, etc.). — My Init() method here wasn't for creating a connection, it was to create the db / db table if it doesn't already exist (or, for the file system: create the cache dir).

Your redisstorage.Storage could then also implement Get(), Put() and Remove() and would satisfy both storage.Storage and cache.Cache interfaces.

— Slight aside: I see you have a lot of stutter in your interface names. It might be better to move (or perhaps copy) them up to the parent package, then you'd have colly.Storage as the interface in the method calls.

asciimoo commented 6 years ago

And have a SetCache(c cache.Cache) error method on Collector.

Yes.

I don't need Close() myself... Get(), Put() and Remove() should all be atomic operations

You're right, i've removed it from Storage too, thanks.

Your redisstorage.Storage could then also implement Get(), Put() and Remove() and would satisfy both storage.Storage and cache.Cache interfaces.

Storage is responsible for storing both visited urls and cookies, so a simple Get(), Put() isn't enough in this case. Also, it is planned to add robots txt handling to the Storage interface which makes the interface even more complex.

Slight aside: I see you have a lot of stutter in your interface names. It might be better to move (or perhaps copy) them up to the parent package, then you'd have colly.Storage as the interface in the method calls.

I agree, it is a bit redundant, but the package doesn't use internals of the colly package which is quite big. I'm not against moving Storage/Cache interface to colly, especially because it would allow us to use the first proposed interface.

jimsmart commented 6 years ago

Storage is responsible for storing both visited urls and cookies, so a simple Get(), Put() isn't enough in this case.

Of course not! :) I wasn't suggesting it would be, I was suggesting that specific implementations of Storage could (optionally) also implement the cache methods (Get/Put/Remove) as well as the methods on the Storage interface — this would then give the user the option of using that store as both a cache and a store for state persistence, or just one or the other.

vosmith commented 6 years ago

@jimsmart what exactly are you get/putting in the cache? My imagination is having a hard time going beyond image and css files which aren't downloaded through Colly in any case. If it's just HTML isn't the IsVisited feature enough? I was initially excited about the idea of a cacheing backend as well, but I couldnt find a clear enough use case.

jimsmart commented 6 years ago

I have several hundred thousand web pages I am processing to extract data.

Currently, if the task fails part way through, I just resume later, and rely on the cached data (in the current file system cache) to get reprocessed and then it simply resumes from where it got to.

But there are so many files, and most are hundreds of kb each. The cache folder is >6gb after a crawl. Compressed and in Sqlite, that same data is a single file, of <1.5gb. It's fast. It's easy to manage.

I've pushed two repos with working prototype code, run go test in each project folder to execute the tests.

https://github.com/jimsmart/collysqlite - implements a cache using Sqlite. https://github.com/jimsmart/collyzstandard - implements a compressing cache wrapper using Zstandard.

Both of these use []byte for the type of data being cached — it was the simplest solution to get a prototype up and running.

jimsmart commented 6 years ago

Basically, if I'm going to invest the time and resources to download that much data, I'd most certainly like to keep it around, for later reruns.

jimsmart commented 6 years ago

Scrapy, the leading Python web scraper, has pluggable caches. Maybe take a look?

jimsmart commented 6 years ago

FWIW: some log output from collyzstandard tests showing compression stats...

2018/02/12 02:40:51 compressed 12930/5251, ratio 2.46, in 2.757465ms https://google.com/
2018/02/12 02:40:51 decompressed in 91.872µs https://google.com/
2018/02/12 02:40:53 compressed 351734/71301, ratio 4.93, in 78.481158ms https://facebook.com/
2018/02/12 02:40:53 decompressed in 1.093598ms https://facebook.com/
2018/02/12 02:40:54 compressed 148439/29126, ratio 5.10, in 54.710252ms https://twitter.com/
2018/02/12 02:40:54 decompressed in 338.833µs https://twitter.com/
asciimoo commented 6 years ago

@jimsmart impressive! Could you please open a PR?

jimsmart commented 6 years ago

Sure, I'll get something together ASAP, I'm just on the tail-end of a bit of a coding marathon, so it won't be today.

I don't think my code will integrate cleanly yet, it needs some more work. But I'm on with that already.

I'm also concerned about 'breaking changes' / whether you have any API compatibility guarantees, or anything similar?

jimsmart commented 6 years ago

In my local repo I also have an implementation of 99% of the 'storage' API... for SQLite.

asciimoo commented 6 years ago

Sure, I'll get something together ASAP, I'm just on the tail-end of a bit of a coding marathon, so it won't be today. I don't think my code will integrate cleanly yet, it needs some more work. But I'm on with that already.

Awesome, there is no rush. Thanks!

I'm also concerned about 'breaking changes' / whether you have any API compatibility guarantees, or anything similar?

Does this change introduce any backward incompatibility?

jimsmart commented 6 years ago

Um, maybe. But I want to minimise that, obviously. In an ideal world there'd be no breaking changes. That's partly why the code needs more work.

I'm new to the party, so I don't know much about any current plans for features/improvements/etc. So I think a slow and gentle approach to this is probably best.

asciimoo commented 6 years ago

I'm new to the party, so I don't know much about any current plans for features/improvements/etc. So I think a slow and gentle approach to this is probably best.

Perfect! The project is relatively young, so we don't have strict rules about api changes. Try to avoid 'breaking changes' if possible and be gentle otherwise as you mentioned.

jonathaningram commented 5 years ago

I wonder if a good solution is to use an interface like that in go-cloud: https://github.com/google/go-cloud/tree/master/blob

Check out https://github.com/google/go-cloud/blob/master/blob/example_test.go for a "local/file" cache. Similarly, check out the GCS or S3 tests. In this way, I could pass my own bucket to colly and it would work with non-local file systems.

vosmith commented 5 years ago

@jonathaningram I think the interface used in go-cloud/blob looks like a good model to follow! Having it as a hard dependency would bloat this project, but I think emulating their framework would a good move.

@asciimoo @jimsmart , Thoughts?

michilu commented 5 years ago

FYI: my use case. I want to cache the responses of the POST requests to API. It is an example of the customizable cache handler. https://github.com/gocolly/colly/compare/master...michilu:customizable-cache-handler

florinutz commented 4 years ago

I just wanted to drop this here https://www.nayuki.io/page/self-encrypted-cache-structure

andreacab commented 4 years ago

Hey I am thinking of picking that up but just wondering where this is at?

I should probably describe my use case. I ran into a blocker as we use heroku and their dynos' file systems are ephemeral. If you're not using a cloud storage mechanism (which you can but is quiet limited under the free tier) then you'll loose and will only have a day worth of caching at most...

jimsmart commented 4 years ago

Hi, the repos mentioned in https://github.com/gocolly/colly/issues/103#issuecomment-364816457 (above) contain my prototype code. Feel free to open issues on those repos if you have any questions — happy to help, as best I can.

I recall there were also changes needed with Colly to make these work, but I don't believe I ever pushed those, sorry. But IIRC the interface was fairly straightforward.

hsinhoyeh commented 2 years ago

Hi, just want to catch up this issue

@jimsmart I don't see any advantage to use sqlite as a cache here, you can even tar files on top of filesystem which could deliver the similar performance compared to sqlite.

@asciimoo If you are seeking an api backward compatibility changes, maybe you can replace filesystem (currently using pkg "os") with https://github.com/spf13/afero, so we could easily swtich to memory fs with tar/gzip functionality on top of it.

any thoughts?

jimsmart commented 2 years ago

@hsinhoyeh

For our use case, scraping many hundreds of thousands of HTML files (with a big variety in sizes), there is certainly a benefit to storing the files in SQLite, it’s actually faster than using raw files on the file system, and also takes much less disk space. There are docs on the SQLite website to support this, and our own benchmarks and metrics confirm this, without doubt.

I’m not sure why you say you don’t see any advantage here? - have you actually tried using SQLite as a storage system in such cases, and compared the metrics? It’s easy to flippantly dismiss something as a bad idea if one has not tried it, but as the old adage goes “to measure is to know” - perhaps your dismissal is better reasoned than it first appears, and you’ve made some decent measurements/comparisons, in which case I would certainly be interested to see your benchmarks/stats, and to know some details about the system they were produced upon (e.g. the underlying file system used, etc)

Furthermore we also compress each page body individually before storing it into SQLite - we tried quite a few compression algorithms, including tar/gzip, but the benchmarks/metrics showed that Zstandard was in fact best for our use case, both speed-wise and size-wise. Naturally, this works very well on top of the aforementioned SQLite storage layer, but also produced improvements when used on top of the native file system.

Unfortunately, we no longer use Colly to build our scrapers, ours now have a completely different architecture, for a variety of reasons, and Colly was no longer the best fit for us, although I still think it is a great project. But because of this, I’ve not pursued any further development of these storage-related plugins, although the original prototype code is still available in my GitHub repos.

Hope that helps.

On 19 Nov 2021, at 07:26, hsinhoyeh @.***> wrote:

Hi, just want to catch up this issue

@jimsmart I don't see any advantage to use sqlite as a cache here, you can even tar files on top of filesystem which could deliver the similar performance compared to sqlite.

@asciimoo If you are seeking an api backward compatibility changes, maybe you can replace filesystem (currently using pkg "os") with https://github.com/spf13/afero, so we could easily swtich to memory fs with tar/gzip functionality on top of it.

any thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

slinso commented 2 years ago

I really like the colly API for scraping, but having just the filesystem for storage is not nice. I came back to this issue a few times over the years and always went back to use my own simple crawler, that basically just uses httpclient and goquery. But I always add caching with compression in some form, when the number of pages gets > 10000.

Thats why I would like to bring this topic up again. Is it worth investigating this issue further? I think @jimsmart did some nice prototyping. Is there any chance to get this included, if we have a complete pull request @hsinhoyeh If so I would like to work on this.

unjello commented 1 year ago

Bumping up. I'd love to see more pluggable caching/storage system, but I don't believe my go-fu is strong enough to attempt refactoring. I may need to, because I would really use it tho, so I wonder if there are people at least willing to do some solid peer-review.