tenex / opensourcecontributors

Find all contributions for a user through the GitHub Archive
91 stars 11 forks source link

Use event_id as ID where possible #52

Open hut8 opened 8 years ago

hut8 commented 8 years ago

This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.

Thoughts, @joshjordan ?

joshjordan commented 8 years ago

I think that is definitely worthwhile. I didn't realize Mongo was trying to keep 4 indexes in memory. Is it also possible to specify which indexes should be on disk vs in memory?

On Mon, Feb 1, 2016 at 9:29 AM Liam notifications@github.com wrote:

This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.

Thoughts, @joshjordan https://github.com/joshjordan ?

— Reply to this email directly or view it on GitHub https://github.com/tenex/github-contributions/issues/52.

s2t2 commented 8 years ago

I just came across a few event objects missing an _event_id attribute and was wondering what was going on. Regardless of how you decide to handle this in mongo on the back-end, as an API consumer of these events, it would be confusing to expect an integer _event_id and instead get a string representation of the _id attribute.

hut8 commented 8 years ago

The _event_id attribute is only present in events that were from the "Event API", which includes "events" from January 1, 2015 on. Prior to that, the GitHub Archive was using the Timeline API, which didn't have an "Event ID" per se. The main reason I'm actually using an index on the _event_id field (or dealing with that field at all) is to work around the fact that you can't atomically load thousands of documents in MongoDB, so a unique index on it guarantees duplicates aren't inserted. I should probably document that better :smile: