Requirements

Distributed search processing: request to single computing cluster and return, processing node can be different for each request
Caching result: processSearch() is expensive operation, thus reduce its cost by introduce cache
- Cache update strategy method should be introduced.
- (Assume) same keyword goes to same result.
  Design
Communicate via one single queue: To satisfy Requirement #1, request is enqueue onto request queue, so that, computing node (called Worker) which is ready takes request from request queue
Introduce distributed cache: To satisfy Requirement #2, some method for sharing keyword-result matching info between every Worker in the cluster is needed.
- We can assume same keyword goes to same result, so when arbitrary Worker gets request already processed (processing result is in cache as well), ~~not execute processSearch() but get from cache~~. get from cache and if not exist, execute processSearch(), save into cache.
- But size of cache is limited, of course, so cache expiration method should be needed. LRU (Least Recently Used) policy is proper, I think. When meets cache size limit, discard most least recently used value. This means keywords frequently ~~do processSearch()~~ request is always in cache.
Design for item in cache:
```
 case class CacheItem(val keyword: String, val result: String, var lastAccess: Timestamp, val createdAt: Timestamp)
```
- and, Priority Queue is proper data structure for choosing discard item If so, lastAccess can be the priority (ascending order. if priority queue is implemented with heap, ~~min-heap could be best way I think~~ min-heap with lastAccess updated method — may not the best method using min-heap).
- To update changed item, by fixed interval, scheduler search item exceed its TTL and discard it.
- To consider: when TTL is exceeded for unchanged item, do processSearch() even though it is not necessary to do. (especially, request same keyword)
Scale issue: When size of cache grows over 10B (or more), which system can bear it? 😫 We can solve this issue by introducing some method like DB sharding (with hashing for keyword, i.e.).

interview-preparation / what-we-do

[System Design and Scalability] interview questions #5 #124

Requirements

Design