Add support for crawling a GitHub User’s various contributions, independent of a specific org or repo

danisyellis commented 6 years ago

Our goal is to track contributions by our employees to any open-source project on GitHub. So we'll need to look at each employee’s commits, pull requests, issues, etc. We can do this through the User’s Events.

I have some questions about how to do this: 1) Is there anything in the current constraints of ghcrawler that will make this an exceptionally difficult task?

2) How do I say “traverse the Events for a given User”? Where is an example of similar code doing something similar?

Based on this discussion https://github.com/Microsoft/ghcrawler/issues/94 I thought it would be in the GitHub processor. Inside of that file, my understanding is that this code in user() this._addCollection(request, ‘repos', ‘repo’) should tell it to look at a user’s repos and add those repos to the mongodb repo collection. But currently, as far as I can tell, it processes the user, but doesn’t even hit the repo function. Because I care most about events right now, I also tried this._addCollection(request, 'events', 'null’); and this._addCollection(request, 'events', ‘events’); but neither seemed to do anything.

3) Will this require an advanced traversal policy? I think that I can use the default traversal policy for now and refine it with an advanced one later to grab fewer things from user, if desired, like using graphQL to do a query. Is that right?

geneh commented 6 years ago

When referring to user's events, do you mean the following API: list-public-events-that-a-user-has-received?

AFAIK there is no way to set up public webhooks to trigger events for specific users. According to github webhooks documentation:

Webhooks can be installed on an organization or a specific repository. Once installed, the webhook will be triggered each time one or more subscribed events occurs.

ghcrawler is more targeted towards collecting both public and private data by listening to webhook events even though it is possible to queue up different types of requests manually.

The following project may be of interest. It collects all GitHub publicly available data: http://ghtorrent.org/.

jeffmcaffer commented 6 years ago

The user processing in GHCrawler could be enhanced to follow the events for the user via https://developer.github.com/v3/activity/events/#list-public-events-performed-by-a-user but in normal circumstances this would only be triggered (at best) when GHCrawler is told about an event involving the user. As @geneh mentions, we can only get webhooks for repos and orgs so by definition, if one of your team ventures afield, you will not notice.

The other option is to periodically trigger a refresh of each team member's data. That would recrawl their events and potentially the entities related to those events. That might work though there are a couple caveats:

if the user event API is anything like the others, it will only return a limited number of events (e.g., 300). So you need to poll frequently enough to ensure events are not missed. (300 is a big number for a user but you'd need to verify the count)
I'm not sure if etags will help you here. That is, if the user's event list changes, does the user's object change. If so, you are in luck. If not, you will have to blindly fetch and traverse the events every time regardless.
Depending on how many people you want to track, this could get expensive from a token point of view

As to your specific questions

Not really
you are on the right path with the code. What is missing is the visitor map update. The code says what is possible to traverse, the map says what to actually traverse. That allows you to run a partial traversal. see https://github.com/Microsoft/ghcrawler/blob/develop/lib/visitorMap.js#L251. I am surprised that you are not seeing the repos being traversed. Perhaps they are being filtered out. See https://github.com/Microsoft/ghcrawler/blob/develop/lib/crawler.js#L574
exactly. For example, if you do this and your team really fans out across the open source world (awesome), you will end up crawling quite extensively. For example, if I touch the microsoft/vscode repo and you crawl that event and the related entities, you'll crawl the vscode repo. if you do that, you'll crawl all the commits and all the related users and ... In the current use of GHCrawler we can limit this by having a list of orgs to traverse. Anything outside that list is cut off (see the filter code cited above). That works as we know the set of interesting orgs. In your scenario, you don't.

As an aside, you might also consider https://www.gharchive.org/ which has the events for all of GitHub. That data is surfaced in Big Query and you can query that for activity related to your users. If you don't otherwise need GHCrawler, that might work well, If you are running GHCrawler anyway, ...

danisyellis commented 6 years ago

Thanks so much for the tips and info. This is still a WIP, but I wanted to write an update.

It's a bummer that GitHub doesn't have webhooks for users/{username}/events, but we're going to use crawler for this anyway and just queue it up manually on a regular interval. Mostly because, in the future, we'll also care about our org and the org's repos and it will be nice to have one tool for all GitHub crawling.

Currently, I am able to crawl a user's events, and will be working on trying to limit the crawling to only new events. (I'll check out etags, like you suggested.)

Incidentally, you were right that _shouldFilter was keeping me from getting a user's repos. Thanks.

danisyellis commented 6 years ago

I have a couple more questions (since they're sort of general, I'll be happy to throw the answers into the docs/wiki once legal approves my signing of the CLA) :

When I run the CIABatta :grin: I get containers for Mongo, Redis, RabbitMQ, and Metabase. If I run the crawler in-memory, I understand that there's no persistence of data because there's no Mongo. So my question is: is the in-memory crawler also running without Redis and RabbitMQ? Are they not necessary for very basic crawling that doesn't look at stored data but just crawls everything fresh?

I don't know much about RabbitMQ, but given that it's 'for queueing" I expected it to be necessary for crawler to function. If Redis and Rabbit aren't being used when running the crawler in-memory, what are they being used to do?

geneh commented 6 years ago

Please see https://github.com/Microsoft/ghcrawler/blob/develop/README.md#running-in-memory: Note that since you are running in memory, if you kill the crawler process, all work will be lost. This mode is great for playing around with the crawler or testing. RabbitMQ is one of supported queueing technologies: The crawler can be configured to use a variety of different queuing technologies (e.g., AMQP 1.0 and AMQP 0.9 compatible queues like Azure ServiceBus and Rabbit MQ, respectively), and storage systems (e.g., Azure Blob and MongoDB). You can create your own infrastructure plugins to use different technologies.

danisyellis commented 6 years ago

Hi Gene, I've read that documentation but it doesn't actually answer my questions. I'm a pretty new engineer so maybe there's a lot of information that 'running in memory' and ' queueing technologies' tell someone who's a more experienced engineer, but I don't have that context yet so, to me, this documentation is pretty sparse. It was enough for me to get the crawler running, but not enough to understand how it's working.

I'll try to elaborate on my questions- RabbitMQ is for queuing. When I run the crawler with Docker, RabbitMQ is there doing some sort of queueing for something. But when I run the crawler in-memory, I'm still able to queue things up to get entered into the crawler. Does that mean that RabbitMQ is running? (And, if so, where/how?) Or is the queueing that RabbitMQ does unrelated to the basic crawler queueing- maybe something more involved?

Similarly, is Redis running when I start the crawler up in-memory? What is Redis being used for? I know it's "for caching" but caching what? And when?

I understand that when you kill the crawler process running in-memory all work is lost. I assumed that was because there's no MongoDB running. But I don't understand exactly what Redis and RabbitMQ are doing and when.

jeffmcaffer commented 6 years ago

Hey @danisyellis, the basic point here is that the crawler is configurable with providers. There are providers for queuing, storage, ... We have providers for many different queuing technologies (Rabbit, AMQP 1.0, ServiceBus, Memory, ...). These providers just implement an API however they want and the rest of the system is unaffected.

You can, in general, mix and match providers. The classic production setup for us is to have Rabbit for queuing, Azure blob for storage, ... You may be using Mongo for storing docs. In-memory setups are used for testing and generally use in-memory or local file system setups for queuing, storage, ...

Redis is used pervasively in the system to coordinate and track multiple instances of the crawler as well as various ratelimiting and caching features. Redis is not used at all for the standard in-memory setup as there is only one crawler running and it is local etc.

Check out how the system is configured by following the code at https://github.com/Microsoft/ghcrawler/blob/develop/lib/crawlerFactory.js#L143

microsoft / ghcrawler

Add support for crawling a GitHub User’s various contributions, independent of a specific org or repo #146