interview-preparation / what-we-do

0 stars 8 forks source link

[System Design and Scalability] interview questions #5 #124

Closed rygh4775 closed 5 years ago

rygh4775 commented 5 years ago

Cache: Imagine a web server for a simplified search engine. This system has 100 machines to respond to search queries, which may then call out using processSearch(string query) to another cluster of machines to actually get the result. The machine which responds to a given query is chosen at random, so you cannot guarantee that the same machine will always respond to the same request. The method processSearch is very expensive. Design a caching mechanism for the most recent queries. Be sure to explain how you would update the cache when data changes.

btakeya commented 5 years ago

Requirements

  1. Distributed search processing: request to single computing cluster and return, processing node can be different for each request
  2. Caching result: processSearch() is expensive operation, thus reduce its cost by introduce cache
    • Cache update strategy method should be introduced.
    • (Assume) same keyword goes to same result.

      Design

  3. Communicate via one single queue: To satisfy Requirement #1, request is enqueue onto request queue, so that, computing node (called Worker) which is ready takes request from request queue
  4. Introduce distributed cache: To satisfy Requirement #2, some method for sharing keyword-result matching info between every Worker in the cluster is needed.
    • We can assume same keyword goes to same result, so when arbitrary Worker gets request already processed (processing result is in cache as well), not execute processSearch() but get from cache. get from cache and if not exist, execute processSearch(), save into cache.
    • But size of cache is limited, of course, so cache expiration method should be needed. LRU (Least Recently Used) policy is proper, I think. When meets cache size limit, discard most least recently used value. This means keywords frequently do processSearch() request is always in cache.
  5. Design for item in cache:

     case class CacheItem(val keyword: String, val result: String, var lastAccess: Timestamp, val createdAt: Timestamp)
    • and, Priority Queue is proper data structure for choosing discard item If so, lastAccess can be the priority (ascending order. if priority queue is implemented with heap, min-heap could be best way I think min-heap with lastAccess updated method — may not the best method using min-heap).
    • To update changed item, by fixed interval, scheduler search item exceed its TTL and discard it.
    • To consider: when TTL is exceeded for unchanged item, do processSearch() even though it is not necessary to do. (especially, request same keyword)
  6. Scale issue: When size of cache grows over 10B (or more), which system can bear it? 😫 We can solve this issue by introducing some method like DB sharding (with hashing for keyword, i.e.).