Strategical Crawler Planning

moedn commented 4 years ago

Based on this data flow diagram and undocumented feedback I got from the community, I see 3 main roads for the crawler at the moment:

call GitHub API → easy (not a real crawler, but meeting the project goal)
brute force crawling GitHub → certainly expensive
'crawling' manually curated lists like people/organisations do in software (also not a real crawler, but meeting the project goal)
- e.g. a list for lab equipment, a list for agricultural machines etc.
- each list has at least one maintainer
- people can register their OSH there or the maintainer finds or hears about it and updates the list on his-/herself
- → projects would get better connected in (technology-)specific fields

@hoijui @penyuan @dubsnipe any opinions on that? :)

hoijui commented 4 years ago

yesterday I looked at the github API and tried some things to search for repos containing a specific filename, but I failed. It is really strange though, as there are a lot of search possibilities available that are much more expensive for github (like searching for code in all repos). I found some things, but none of them worked. I think it would be the best solution, and I find it hard to believe that nobody did it yet. The question is just, who and where is it!

... And of course, this would have to be found out for each platform separately. Thus it might make sense to do a brute force, HTTP/HTML based crawler that works on all platforms, and try to do this more eloquent and efficient approach just for github and gitlab at first.

penyuan commented 4 years ago

As you know I've been wrestling with the GitHub API (v3 REST and v4 GraphQL) for the past few months.

Some operations are indeed easy as you described, but others are surprisingly in-efficient. Two examples:

I wanted to get every single commit that has been made to a GitHub repository. The GraphQL API forces me query every branch, get the list of commits from each, de-duplicate commits that belong to multiple branches, then combine the data. And since the GraphQL API is paginated to only give you 100 results per query, it can take a long time when a repository has thousands of commits across 5-6 branches... -_-''' OTOH, if I use the REST API, I can get the complete list of commits with just a couple of queries, but they returned tons of unrelated data and the commits only make up a small fraction of it. I've raised this issue in the GitHub Community forums, and to their credit an actual GitHub employee responded to my concerns but without a better solution or any commitment on improving the API. Oh well.
Unsurprisingly, I also want a list of the files that were changed by each commit. This is (directly confirmed by GitHub) not doable with the GraphQL API and with the older REST API, doable but also comes with a mountain of unrelated data to weed through.

That said, I've have no experience with brute-force scraping. My suspicion is that brute-force is computationally expensive and more prone to breakage while using the API is more time-intensive (e.g. waiting for multiple API queries and rate-limiting).

For WP2.2's dashboard I'm still using the GitHub API. And since I also have to scrape Wikifactory, I hope their API is more friendly. :smile:

I actually kind of like curated lists and semi-regularly visit a few of them. Were you thinking of all those "awesome lists" that people maintain on GitLab/GitHub? If those lists have relatively consistent layouts/markup, then maybe a crawler for them wouldn't be so bad? My concern would be that those lists usually just link to projects, and once you get to the projects you will still need to mine their actual repositories, many of which are on GitHub et al. and you get back to the API vs brute-force problem?

hoijui commented 4 years ago

hmmm :/ ok, what you tried to do was more.. getting lots of info about a single repo, while we want rather a tiny bit of info about a lot of repos; basically just: does path "/okh.yml" exist in latest main branch? but yeah... seems the API is not good for both; though I would not trust my research so far to have been exhaustive of the possibilities.

curated lists... what about a kind of combined approach:

we get a list of all repos on github that have the hardware tag/topic, and regularly check through them, whether they have the meta-data file, or whether it changed. thus this is kind of an automatically "curated" list. in less frequent intervals, we check if new repos using the hardware tag have been added. additionally, we supply a web-form to manually request re-visiting a specific repo, in case somebody is eager.

The requirement of the hardware tag helps because:

it allows to separate hardware- from non-hardware-repos
it makes crawling much easier, as we only have to fetch and check through a tiny percentage of repos

... it also excludes repos that do not have the tag, but people already need to know about adding the meta-data file anyway, so that is a small additional hurdle, and even apart from requireing the tag for our crawler, it makes sense that hardware related repos have this tag.

both GitLab and GitHub hav repo topics (essentially tags). on other sites that might not have them, we could just fall back onto crawling all repos.

... looking for the meta-data file would happen through brute force crawling, if not possible through the API, then through HTTP/HTML.

does that make sense?

moedn commented 4 years ago

That sounds like a good compromise :+1: Especially just checking whether a certain URL exists seems much less expensive. That may not be crawling in its original sense, but much more like how we need it. This could also be applied for sites like Appropedia

Tags are a great idea, however does anyone use hardware tags on GitHub/GitLab? Haven't seen them so far; @hoijui do you have an example?

hoijui commented 4 years ago

tag/topic:

GitHub reports 2000 for hardware
- 12 for osh
GitLab reports no number, but lists some anyway for hardware
- 0 for osh

The idea is, that we tell the people adding the meta-data files to also add the tags 9which is a much smaller task). ... But while it is possible to make a pull-request for adding the meta-data file, the tag can only be suggested in there or in an issue, should the meta-data file already be in the repo.

penyuan commented 4 years ago

I also like the idea of filtering for tags. Questions:

Exactly which tags should the crawler look for? Just hardware? Or hw, osh, open-source-hardware, etc. etc. etc.?
I actually used the tags hardware and open-source-hardware for the dashboard repo, but it's clearly not a hardware development project. :stuck_out_tongue_closed_eyes: It is just about hardware. Would repositories like this be a problem for the crawler?

ahane commented 4 years ago

The Github API supports this: curl -u $USERNAME https://api.github.com/search/code?q=filename:okh.yml (you will be asked for your password)

Docs: https://docs.github.com/en/free-pro-team@latest/github/searching-for-information-on-github/searching-code

Same search in the web interface: https://github.com/search?p=4&q=filename%3Aokh.yml&type=Code

there are currently 38 results (includes a few false positives).

interestingly enough most people seem to use a filename like this: okh-$PROJECTNAME.yml luckly the github search doesnt seem to mind.

iop-alliance / OpenKnowHow

Strategical Crawler Planning #122