microsoft / ghcrawler

Crawl GitHub APIs and store the discovered orgs, repos, commits, ...
MIT License
377 stars 90 forks source link

Get some measure of current progress and completion #127

Open stuartlangridge opened 6 years ago

stuartlangridge commented 6 years ago

Is there any way to get a sense of how much work the crawler has left to do, and how much it's done? The dashboard shows which requests are queued, but getting a sense of "we've done 250 requests and there are 8000 still to go" would be very useful. I appreciate that this might not actually be a knowable figure -- it's possible that all we know is what's currently queued, and each of those queued requests might spawn another million queued requests once they've been fetched and processed. However, at the moment it's a very shot-in-the-dark affair; it's very hard to get a sense of how long one should wait before there'll be data available in Mongo to process, and whether the data in there is roughly OK or wildly incomplete.

jeffmcaffer commented 6 years ago

sorry for the delay in responding here. Progress is a real challenge in crawler-like systems. As you say, you never really know what you'll find and whether or not you've already seen/processed what you find. I've literally queued one request and ended up with 2 solid days of processing across 16 cores. Millions of requests processed.

The dashboard gives you some insights (as you point out). If you are using RabbitMQ you can get pretty much the same insights from their management UI.

One thing we are playing with in the ClearlyDefined crawler is a webhook that pings on every new write. This is structured like the DeltaLog store in GHCrawler but is simply configured to POST to a given host with the blob it is about to write. This way you can watch for "markers" to tell if some particular data is available.

However there are many caveats to that

Related, you could hook into the logging system to see when something of interest is processed. Many of the same caveats apply but you will see every request, not just the ones that cause a write.

Another approach is to get insight into the queues themselves. Unfortunately at this scale, that can be hard and most queuing systems don't really expose the list of pending items. In our cases, there is a Redis table of items in the queue with a configurable TTL. We use this to avoid "rapid fire" duplicate requests. I think the default definition of "rapid" is 1 hour. The crawler goes through considerable pain to ensure that within that TTL window there are no duplicate requests. One could access Redis and point query for a request of interest.

More caveats:

Hate to be so negative on this but its a hard problem using standard queuing. If we step away from queuing systems and use a database-based queue approach then things change as the database could reasonably be queried and could retain some info etc to avoid some of the negative elements cited above. At that point however we are into a whole new area of development (unless you know of such a system we could just use).