Open hut8 opened 8 years ago
I think that is definitely worthwhile. I didn't realize Mongo was trying to keep 4 indexes in memory. Is it also possible to specify which indexes should be on disk vs in memory?
On Mon, Feb 1, 2016 at 9:29 AM Liam notifications@github.com wrote:
This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.
Thoughts, @joshjordan https://github.com/joshjordan ?
— Reply to this email directly or view it on GitHub https://github.com/tenex/github-contributions/issues/52.
I just came across a few event objects missing an _event_id
attribute and was wondering what was going on. Regardless of how you decide to handle this in mongo on the back-end, as an API consumer of these events, it would be confusing to expect an integer _event_id
and instead get a string representation of the _id
attribute.
The _event_id
attribute is only present in events that were from the "Event API", which includes "events" from January 1, 2015 on. Prior to that, the GitHub Archive was using the Timeline API, which didn't have an "Event ID" per se. The main reason I'm actually using an index on the _event_id field (or dealing with that field at all) is to work around the fact that you can't atomically load thousands of documents in MongoDB, so a unique index on it guarantees duplicates aren't inserted. I should probably document that better :smile:
This just occured to me. The pre-2015 events (in the
timeline
directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB_id
attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.Thoughts, @joshjordan ?