Software architecture of the crawler?

penyuan commented 3 years ago

I confess I haven't read all the documentation yet, so sorry if I missed something. :sweat_smile:

As you know I'm trying to mine/crawl the commit histories and issues of GitHub repositories in a learning-by-doing kind of way.

During our meeting yesterday, @mkampik made the great point that rather than keeping this part in the dashboard backend, what if we incorporated it into the crawler? That would sound like a neater/tidier implementation rather than essentially maintaining two crawlers (one just for commits and issues tied to the dashboard, and another for everything else). Another advantage is that this might make implementing the data ontology easier.

I suppose this would depend on the architecture and modularity of the crawler. Right now I just have a couple of Python scripts that call the GitHub (and Wikifactory) APIs. Would such incorporation be desirable or feasible? Would the crawler accept a plugin?

I am not super opinionated on this, but curious what others think.

penyuan commented 3 years ago

This comment is an update that after our recent meetings, it sounds like that we'll continue our independent efforts to develop and dashboard and crawler, but the dashboard will include a module dedicated to (two-way) moving data between the dashboard and the crawler/Wikibase stack.

We will continue to coordinate closely on formulating the section of the ontology dedicated to the data needs of the dashboard.

moedn commented 3 years ago

@penyuan (I'm cleaning up the issues) the crawler will be published here: https://github.com/OPEN-NEXT/LOSH-krawler (developed by @ahane) It's all python and blocks aim to be as simple as possible. As shown in the REAMDE, the crawler will pull data from selected APIs (which are specified in the crawler), map them onto the ontology published in this repo, create a large RDF, convert it into Wikibases-specific JSON and push it to the selected Wikibase-Instance

To move data from and to Wikibase; I'd recommend pushing to Wikibase's API directly. @hoijui developed a module that can push RDF ontologies to Wikibase using the old API (link while @addshore is working on a new API which would very much simplify that process (link). Of course there would be also a way to include the dashboard data as RDF and then have it converted by the crawler. However, I do think submitting to Wikibase directly may be easier :D

Hope that makes it more or less clear :) For any specific question feel free to re-open this issue or mail me or the developers directly

iop-alliance / OpenKnowHow

Software architecture of the crawler? #121