Get some measure of current progress and completion

sorry for the delay in responding here. Progress is a real challenge in crawler-like systems. As you say, you never really know what you'll find and whether or not you've already seen/processed what you find. I've literally queued one request and ended up with 2 solid days of processing across 16 cores. Millions of requests processed.

The dashboard gives you some insights (as you point out). If you are using RabbitMQ you can get pretty much the same insights from their management UI.

One thing we are playing with in the ClearlyDefined crawler is a webhook that pings on every new write. This is structured like the DeltaLog store in GHCrawler but is simply configured to POST to a given host with the blob it is about to write. This way you can watch for "markers" to tell if some particular data is available.

However there are many caveats to that

The blob in question may already exist and may get skipped so it is never written so the webhook never called
there is no ordering guarantee so one blob being written is no indication that any other blobs are or are not written

Related, you could hook into the logging system to see when something of interest is processed. Many of the same caveats apply but you will see every request, not just the ones that cause a write.

Another approach is to get insight into the queues themselves. Unfortunately at this scale, that can be hard and most queuing systems don't really expose the list of pending items. In our cases, there is a Redis table of items in the queue with a configurable TTL. We use this to avoid "rapid fire" duplicate requests. I think the default definition of "rapid" is 1 hour. The crawler goes through considerable pain to ensure that within that TTL window there are no duplicate requests. One could access Redis and point query for a request of interest.

More caveats:

A request may get processed and a very similar request queued so it may look like the original request has not yet been processed
a request could simply be in the queue for > TTL and so be deleted from redis but still not be processed.
listing the entries in redis is blocking and slow so you can really only do point queries.
requests that are not in redis are not guaranteed to have been processed. They may have failed for example and been deadlettered.

Hate to be so negative on this but its a hard problem using standard queuing. If we step away from queuing systems and use a database-based queue approach then things change as the database could reasonably be queried and could retain some info etc to avoid some of the negative elements cited above. At that point however we are into a whole new area of development (unless you know of such a system we could just use).

microsoft / ghcrawler

Get some measure of current progress and completion #127