Closed mishu- closed 10 years ago
cc @mihneadb
Worth mentioning that the way to bring data in is either:
1
requires an action from someone who has access rights to the repo (even if it's a public repo), 2
doesn't.
Can you please provide the links for documentation for both 1 and 2 pls?
Part of the email conversation:
Hey Mihnea,
> How do you suggest I get the events that I haven't seen so far? Using a timestamp? The github archive scraper collects lots of events, I'm guessing more than 300 so there should be a way, right?
As I mentioned before, we can only provide a history of up to 300 events currently. If you need to collect more than that, the only way to do it is to periodically fetch events from the API and store them locally. I'm guessing that the (Unofficial) GitHub Archive project is doing exactly that - polling our API with a high frequency to pick up all events. If you need to go further back in history and need to do it now - there is no workaround for that except querying the archive project.
> By doing a simple check (cat | sort | uniq | wc -l) I found that indeed there are just 300 unique events. However, your API didn't reply with "last" as a page number, it let my script keep polling.
Ooops, sorry about that! I noticed that our documentation says that we will return a "last" link, and in fact we aren't. I'll see if we can do something to correct that - thanks for the report!
Glad you were able to figure out what was going on! Let me know if you have any other questions or feedback.
Cheers,
Ivan
Found some useful resources for this (there doesn't seem to be an already-implemented gh river).
https://github.com/elasticsearch/elasticsearch-river-twitter/blob/master/src/main/java/org/elasticsearch/river/twitter/TwitterRiver.java http://blog.trifork.com/2013/01/10/how-to-write-an-elasticsearch-river-plugin/
Need
Right now the current implementation relies on parsing dump files updated via a cron, it would be nice to have a cleaner way to bring it github data to the es database
Proposed Solution
Create an es river (http://www.elasticsearch.org/blog/the-river/) which pulls data from github to es directly.
Notes
This issue is a stub.