let4be / crusty

Broad Web Crawler
GNU General Public License v3.0
83 stars 3 forks source link

Implement a first approximation of PageRank for Domains #19

Closed let4be closed 3 years ago

let4be commented 3 years ago

Right now this broad crawler is completely emtpy, I think it would be cool if we had something to show off ;) A good candidate for such task could be page rank...

Now calculating URL page rank is a whole mega-task in it's own, proper implementation of which(that scales) could take months because of requirements on throughput, memory, speed, scalability. Such system most likely needs a sophisticated URL -> ID mapping

On the other hand we could easily calculate Domain PageRank Ad hoc, 1) collect all outbound domains for a given Job 2) convert Job's Domain into a Second Level Domain (super-blog.tumblr.com -> tumblr.com) 3) convert all outbound domains into unique Second Level Domains as well 4) store all this in RedisGraph(this will work because there's only very limited N of second level domains and RedisGraph uses sparse matrixes)

https://oss.redislabs.com/redisgraph/ https://github.com/RedisGraph/RedisGraph/issues/398

Depending on the underlying hardware results may vary. However, inserting a new relationship is done in O(1). RedisGraph is able to create over 1 million nodes under half a second and form 500K relations within 0.3 of a second.

RedisGraph has PageRank built-in

let4be commented 3 years ago

While having a true pagerank might be really cool from my brief explaration it clearly goes out of Crusty 's scope ...

The rough and cheap redisgraph based implementation turned out to be painfully slow - all inserts take way-way too much time and redis goes to 100% cpu usage even on my 100mbit channel(now imagine if we wanted to scale this to at least 100gbit/s) Insert speed might have been improved by using their bulk loading(binary protocol) but I don't see how this can be scaled at least in 100 times - write speed basically doesn't scale, can't scale vertically - writes are single threaded and it does not scale horizontally.

Single instance clearly isn't enough even for toy-like inserts...

I decided to drop pagerank for now, proper implementations are of course possible - but this would require some full fledged graph database cluster(big one) capable with keeping up with Crusty data production capabilities...

For now to get some insight on "heavy-hitters" I will simply use excellent probabilistic TOP-K from redisbloom https://oss.redislabs.com/redisbloom/TopK_Commands/ this should give us some insight on what's most popular

TOP-K operations are constant time and the whole structure has finite memory footprint - perfect for estimations in high bandwidth scenarios where only limited resources are availabe, this should scale far outside of a typical Crusty top production and could be further optimized by compressing heavy-hitters on Crusty side and submitting in batches(by using INCRBY instead of ADD)

let4be commented 3 years ago

Aaaand LIVE!