topcoder-platform / tc-elasticsearch-feeder-service

2 stars 11 forks source link

Improvement - Use Redis As The Unique Queue For Cronjob To Populate Index #6

Open skyhit opened 6 years ago

skyhit commented 6 years ago

I would like to improve the service to use Redis as a unique queue for cronjob to populate index.

the refactoring will be like

  1. the Redis support duplication detection and working as a unique queue, so we will just need to update once if in the different periods, the same challenge or match need to update, see https://redis.io/commands/sadd and https://redis.io/commands/spop
  2. the endpoints will push the challenge id or match id into Redis as a candidate to aggregate data and populate the index
  3. there are cronjobs which will run periodically to find the changed challenge ids and match ids.
  4. there will be running threads, which will monitoring Redis to pop challenge ids and match ids and do the real aggregation and populate into index.
  5. so for initial load, we can have endpoints to do this purposely any time, just load every challenge ids and match ids and add into the Redis set (the running threads) will take care of that.

the current architecture, we need to adjust the environment variables, in order to do the initial load

@sushilshinde @ajefts Let me know your thought about this approach.

sushilshinde commented 6 years ago

I think we should stick to elastic search because it has REST-based interface, and tuned for a search that's what we need most.

For duplicates, there are many practices to avoid that https://qbox.io/blog/minimizing-document-duplication-in-elasticsearch

For initial load, the code should take care of the index is already populated.

@cwdcwd your comments

skyhit commented 6 years ago

@sushilshinde you misundersand my approach, the final goal is not changed, it is going to populate the elasticsearch indexes.

what I am suggesting is the way to populate the elasticsearch indexes.

  1. the main part will be thread which is running all the time, the pickup the changed challenge ids or match ids from Redis cache, and populate the indexes.

  2. but there can be different ways to update the challenge ids and match ids in Redis cache, like a endpoint, which can be used purposely if we see some data in index is not updated, and we can force an update by pushing the challenge id into Redis cache, and the job in 1 will pick it up and populate the indexes.

it can be a cronjob will be monitoring the changes.

it can be other ways which just push the challenge ids and match ids into Redis cache