Create es river that brings data into elasticboard

mishu- commented 10 years ago

Need

Right now the current implementation relies on parsing dump files updated via a cron, it would be nice to have a cleaner way to bring it github data to the es database

Proposed Solution

Create an es river (http://www.elasticsearch.org/blog/the-river/) which pulls data from github to es directly.

Notes

This issue is a stub.

mishu- commented 10 years ago

cc @mihneadb

mihneadb commented 10 years ago

Worth mentioning that the way to bring data in is either:

subscribe with a listener to github's API for the given repo
poll the events every delta T for new events and check for dupes

1 requires an action from someone who has access rights to the repo (even if it's a public repo), 2 doesn't.

mishu- commented 10 years ago

Can you please provide the links for documentation for both 1 and 2 pls?

mihneadb commented 10 years ago

http://developer.github.com/v3/repos/hooks/
http://developer.github.com/v3/activity/events/ (ctrl-f for 300 :) )

Part of the email conversation:

Hey Mihnea,

> How do you suggest I get the events that I haven't seen so far? Using a timestamp? The github archive scraper collects lots of events, I'm guessing more than 300 so there should be a way, right?

As I mentioned before, we can only provide a history of up to 300 events currently. If you need to collect more than that, the only way to do it is to periodically fetch events from the API and store them locally. I'm guessing that the (Unofficial) GitHub Archive project is doing exactly that - polling our API with a high frequency to pick up all events. If you need to go further back in history and need to do it now - there is no workaround for that except querying the archive project.

> By doing a simple check (cat | sort | uniq | wc -l) I found that indeed there are just 300 unique events. However, your API didn't reply with "last" as a page number, it let my script keep polling.

Ooops, sorry about that! I noticed that our documentation says that we will return a "last" link, and in fact we aren't. I'll see if we can do something to correct that - thanks for the report!

Glad you were able to figure out what was going on! Let me know if you have any other questions or feedback.

Cheers,
Ivan

mihneadb commented 10 years ago

Found some useful resources for this (there doesn't seem to be an already-implemented gh river).

https://github.com/elasticsearch/elasticsearch-river-twitter/blob/master/src/main/java/org/elasticsearch/river/twitter/TwitterRiver.java http://blog.trifork.com/2013/01/10/how-to-write-an-elasticsearch-river-plugin/

mihneadb commented 10 years ago

https://github.com/uberVU/elasticsearch-river-github

uberVU / elasticboard