Open stuartlangridge opened 7 years ago
You can using the visitor map concept. Essentially you create an object that has every node and edge you want to traverse. Take a look at the end of visitorMap.js. There you will see the full list of everything the crawler can currently traverse.
For your case, you would create a scenario
(e.g., minimal) and then add in a map
for what to do at each entity you care about. Then you can use that scenario when queuing a request to traverse. Something like the following that only allow the crawler to traverse repos, users, issues and pull requests. The key to remember is that the _type is used by the crawler to find the map for a given entity it encounters and the properties of the map identify the edges out of the entity the crawler is allowed to traverse. The value of the property in a map is the map to use when you get to end of the edge. (drawing it out helps...)
const repo = {
_type: 'repo',
owner: self,
issues: collection(issue)
};
const user = {
_type: 'user'
};
const issue = {
_type: 'issue',
user: self,
repo: self,
assignee: self,
closed_by: self
};
const pull_request = {
_type: 'pull_request',
user: self,
merged_by: self,
assignee: self,
head: self,
base: self,
issue: issue
};
const minimal = {
self: self,
neighbors: neighbors,
repo: repo,
issue: issue,
pull_request: pull_request
};
mapList.minimal = minimal;
From there you can reference that the minimal
scenario when queuing a request. Check out the policy spec doc.
Fully get that this is not as easy as one might hope. Ideally the set of maps is something that comes from a configuration file or some such. We could totally do that and a PR to that effect would certainly be welcomed.
A followup on this issue; I've been looking into this with the intention of using new visitor maps. Once I've worked it out, I'll put together a thing that allows supplying a JSON configuration file for a custom visitor map or similar, which will be nice. However, I don't fully understand it and so I have questions. Let's imagine that I plan to use a custom map as follows, by dropping this code into visitorMap.js
.
const sil_issue = {_type: "issue", user:self, repo:self, closed_by:self, assignee: self}
const sil_repo = {_type: 'repo', owner: self, organization: self, issues: collection(issue)}
const sil = {self: self, issue:sil_issue, repo:sil_repo};
mapList.sil = sil;
How do I correctly queue a request to use that map?
From the dashboard (or via the API) I can queue a request by passing an object with type
, url
, and policy
keys. So, if I want to queue a particular repository (say, Microsoft/ghcrawler
!), I pass {"type": "repo", "url": "https://api.github.com/repos/Microsoft/ghcrawler", "policy": "???"}
, but I'm very unsure what to put in the policy
key. I would think it'd be something like default:sil/repo
, to use the default
policyName, my custom visitorMap, and a repo fetch, but that fetches just repo details into the Mongo repo
collection, and doesn't fetch any issues at all. I would have thought that the crawler would follow my edge to the sil_issue
object and fetch all the associated issues as well but it isn't. Is this because I've got the policy spec wrong, or because I'm specifying the visitorMap wrong, or something else? Happy to hear any guidance you may have here.
Phew, I'm going to have to dig into the code on this one. The feature (being able to spec maps) is there but has seen relatively little use. I'm pretty sure it's possible but will need to look carefully. You are on the right path (no pun intended) but there is likely a subtlety to the way the map path is spec'd in the request.
It would be useful to only fetch some data from Github; for example, if I only need issues, pull_requests, and repos, to not have to fetch commits or issue_comments so as to reduce the amount I need to hit Github. Is this possible? The docs on policies seem to suggest that it might be doable, but I don't think I understand well enough how to do it; perhaps a doc clarification? (Or alternatively saying "you can't" would be OK here too.)