We might be able to simplify and streamline our sync with a sync API. Currently for StackExchange sites we fetch users ordered by reputation (descending!) and push them into a database. The sync happens in scheduled batches and every time when we resume, we first take a look in the working table of our database to find the user with the lowest (!) reputation so that we know where we left off.
This pattern can actually made more generic. Let's assume we have something like this:
var syncService = new ninya.api.SyncService({
comesLaterPredicate: function(userA, userB) { ... }
});
syncService
.resumeSync('apiKey')
.then(function(lastUser) {
//figure out where we left off with the sync last time based on the `lastUser`
//this would be in a loop
syncService.addOrUpdate({
id: 'stackOverflow_4711',
location: 'Hannover, Germany'
name: 'Some Guy'
...
});
});
If we make the assumption that any sync will always have to maintain some sort of order then all that we are interested in is actually which was the last user that was added the last time that we ran the sync. It may be the user with the lowest reputation or with the least followers etc. However, we can't just memorize the last user because we should not assume that users are added in sequential order (because we may want to run addOrUpdate in parallel). But that shouldn't be a tough challenge because we can feed the syncService with a comesLaterPredicate which just compares each new pushed user with the one that we currently hold as the last user and only overwrite it, if it matches our criteria (e.g. has a lower reputation). The cool thing is that we only have to maintain this last known user per sync target instead of a huge table with all the users which we need to scan for the actual user with the lowest reputation. This is quite cool because it means that we can directly push users into elasticSearch and drop postgres entirely.
Another thing the syncService would take care of is to not allow to overwrite entities which were created from a sync with a different API key. This is not only to prevent us from shooting ourselves in the foot but also because we might want to make this API public in the future to make it easier for the community to come up with new ninya integrations.
We would also need a method syncService.resetSync('apiKey') to start over when we reached our sync target (e.g. the top 150k users of StackOverflow).
Ok, this is just from the top of my head and might sound like gibberish to anybody else but still I'm dragging @kyjan @PascalPrecht in here ;-)
We might be able to simplify and streamline our sync with a sync API. Currently for StackExchange sites we fetch users ordered by reputation (descending!) and push them into a database. The sync happens in scheduled batches and every time when we resume, we first take a look in the working table of our database to find the user with the lowest (!) reputation so that we know where we left off.
This pattern can actually made more generic. Let's assume we have something like this:
If we make the assumption that any sync will always have to maintain some sort of order then all that we are interested in is actually which was the last user that was added the last time that we ran the sync. It may be the user with the lowest reputation or with the least followers etc. However, we can't just memorize the last user because we should not assume that users are added in sequential order (because we may want to run
addOrUpdate
in parallel). But that shouldn't be a tough challenge because we can feed thesyncService
with acomesLaterPredicate
which just compares each new pushed user with the one that we currently hold as the last user and only overwrite it, if it matches our criteria (e.g. has a lower reputation). The cool thing is that we only have to maintain this last known user per sync target instead of a huge table with all the users which we need to scan for the actual user with the lowest reputation. This is quite cool because it means that we can directly push users intoelasticSearch
and droppostgres
entirely.Another thing the
syncService
would take care of is to not allow to overwrite entities which were created from a sync with a different API key. This is not only to prevent us from shooting ourselves in the foot but also because we might want to make this API public in the future to make it easier for the community to come up with new ninya integrations.We would also need a method
syncService.resetSync('apiKey')
to start over when we reached our sync target (e.g. the top 150k users of StackOverflow).Ok, this is just from the top of my head and might sound like gibberish to anybody else but still I'm dragging @kyjan @PascalPrecht in here ;-)