nelson-yeh-fy / Adventure

0 stars 0 forks source link

SD News Feed (Instagram) #48

Open sync-by-unito[bot] opened 4 years ago

sync-by-unito[bot] commented 4 years ago

Other News Feed Applications: Facebook, Twitter, Flickr, Pinterest, TikTok


Discussion/Presentation Flow ( R-E-A-D-M-E ):

-R-E- (Requirement and Estimate) 5-10 mins

FRs - Prioritize it and identify expectations

  1. Users can upload, download photos and videos
  2. Users can follow other users
  3. System generates News Feed consisting top photos from people user follows, considers most recent/most like/most popular user's photos.

NFRs - Latency&Throughput / Availability&Consistency / Read&Write ratio

  1. Latency is 200ms for News Feed generation
  2. High Availability, eventual consistency to see photos later is acceptable.
  3. Read most application 1000:1
  4. Back-of-envelope estimation, understand bottlenecks.

(Not in scope):

  1. Users can search photos/videos based on its title
  2. Users can like, comment others' photos/videos/comments
  3. Users can add photo tags, or tag users into photos

-A-D- (API, Data modeling) 15-20 mins

Use case diagram, sketch design

  1. A few box to represent functional interactions.
  2. Separate photos and metadata.

Take FRs/NRFs into account

  1. NFR (Low latency and Read most), 1-1: Memory cache is a candidate (Redis/memCached). 1-2: Read performance is important for follower service, therefore it makes sense to use a key-value cache, or other NoSQL stores.
  2. NFR (High availability), 2-1: Replica, Master-Slave is a candidate. 2-2: Distributed wide column store is a candidate since they have replicas to offer reliability (Cassandra, Dynamo, HBase).
  3. FR (News Feed), Feeds are generated as time passes by, so HBase / Cassandra’s timestamp index is a great fit for this use case.

API: POST/PUT/GET/DELETE /api/REST/1.0/assets/image/content "Content-Type: multipart/form-data; boundary=ExampleFormBoundary", post the image as binary data. { id: int, fullImageUrl: varchar(256), createdAt: datetime, createdBy: int, description: varchar(24) } POST/GET/DELETE /api/REST/1.0/assets/user/{user_id}/followUser { followUserId: int }

Data modeling discuss the flow among various components, e.g.:

  1. Photos: Key: PhotoID(int), Value: UserID(int), PhotoPath(varchar256), Lat(int), Lon(int), , CreateDate(datetime);
  2. Users: Key: UserID(int), Value: Name(varchar50), Email(varchar32), CreationDate(datetime), LastLogon(datetime);
  3. Relationship(user/photo): Key: UserID, Value: PhotoID; or Wide-column stores, discussed later.
  4. Relationship(follower): Key: UserID, Value: FollowUserID, or Graph stores

-M-E- (Main system, Extended system) 10-20 mins

_Divide and Conquer. _

  1. Divide and extend the system architecture, separate it into micro-services
  2. Prevent single point of failure.

Extended system 5-15 mins

  1. S3 to store pictures, and cloudfront to do CDN
  2. Pre-generate news-feed: 1.Pre-fetching, 2.Push instead of pull, 3.Cache first
  3. Sharding
  4. support the celebrity and non-celebrity use-case differently, to reduce I/O and computation costs.
  5. certain photos are so popular that the majority of people read them. This dictates that we can try caching 20% of daily read volume of photos and metadata.

┆Issue is synchronized with this Trello card by Unito ┆Attachments: Instagram Ranking to generate News Feed | image.jpeg | image.jpeg

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

Although a relational database can do almost all the storage work, please remember do not save a blob, like a photo, into a relational database, and choose the right database for the right service. For example, read performance is important for follower service, therefore it makes sense to use a key-value cache.

Feeds are generated as time passes by, so HBase / Cassandra’s timestamp index is a great fit for this use case. Users have relationships with other users or objects, so a relational database is our choice by default in an user profile service.

Column-oriented Store The abstraction of a column-oriented store is like a giant nested map: ColumnFamily<RowKey, Columns<Name, Value, Timestamp>>. The main reason we want to use a column-oriented store is that it is distributed, highly-available, and optimized for write.

Out-of-box choices: Cassandra, HBase, Hypertable, Amazon SimpleDB, etc.

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

[Old one, revised in new comment]

Hey guys, I hope you're doing well and staying safe. I'm working on LC and SD examples recently and doing my homework to prepare future interviews. Covid and the job market is unclear but I want to get myself ready soon. Welcome and Appreciate your suggestions if any, thanks!!

SD Example#1 - Design Instagram:

Requirement is the key I believe, I'd use acronym README to go through the process as below, I guess I should try to lead the discussion, ask questions and narrow down design scope, identify interviewer's expectations asap.

other note: put NFRs (non-functional requirements, especially the scalability) on top of white board for keeping scaling in mind (because scaling can't be an after-thought / at least identify the main scaling problem we're about to have), and then prioritize FRs (functional requirements , 3 items), estimate the exceptions to calibrate NFRs. I'd try to lead the presentation flow if possible.

I have experiences in team/project managements and can talk a few factors about how I discuss and prioritize FRs and NFRs scope, but I feel for the SD part that doesn't need to spend too much time on it I'd think.

Discussion/Presentation Flow ( R-E-A-D-M-E ):

-R- (Requirements) FRs - Prioritize and focus on three items: a. User can upload photos/videos b. User can follow other users c. User's news-feed wall shows most recent/likeness/popular photos.

-E- (Estimate) NFRs - Latency&Throughput / Availability&Consistency / Read&Write ratio a. Latency less then 200ms for fetching news-feed (others' photos/videos) b. High Availability, all photos can't be lost, eventual consistent is acceptable. c. Back-of-envelope estimation, calibrate NRFs.

-A- (API, Naive architecture and APIs) a. A few black-box to represent functional interactions (Use Case diagram maybe) b. API (Class diagram maybe)

-D- (Data Model, to discuss data flow among various components) a. Naive modeling including key-value stores: photos: Key: PhotoID, Document: UserID, Lat, Lon, likeness, upload_date; users: Key: UserID, Document: First Name, Last Name, email, last_logon_date,..; relationship: Key: UserID, Value: PhotoID; Key:UserID, Value: FollowerUserID. b. Check NFRs and refine data modeling, for example:

-M- (Main system) a. Low latency: Key-Value store like Redis/memCached are candidates. b. Highly available: b-1: Replica, Master-Slave, Master-Master and LB b-2: Distributed stores like Cassandra, Dynamo, HBase. c. Wide-column store like Cassandra store timestamps, which is suitable for news feed feature because people read latest news.

-E- (Extended system) a. ELB, S3 to store pictures, cloudfront to do CDN b. HBase, Casandra for high availability and for newsfeed. c. Microservice, divide and conquer. d. Redis extensively, Memcached for caching e. Scoring and Ranking system can be added. Pre-generate news-feed. Story: 1.Pre-fetching, 2.Push instead of pull, 3.Cache first https://instagram-engineering.com/making-instagram-com-faster-part-3-cache-first-6f3f130b9669

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

Deep Dive: Scoring and Ranking system can be added. Pre-generate news-feed.

Story: 1.Pre-fetching, 2.Push instead of pull, 3.Cache first https://instagram-engineering.com/making-instagram-com-faster-part-3-cache-first-6f3f130b9669

Screenshot - 7_28_2020 , 9_3440 PM.png ( https://trello-attachments.s3.amazonaws.com/5f1e4f876a5b017435b16478/5f1f4784600cab8ce9fcfdb9/e6ba40c5467599441c785c7e734e5be6/Screenshot-_7_282020%2C_9_34_40_PM.png )

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

1.4 (Optional) Step 4. Back-of-the-envelope Calculation The final step, estimating how many machines are required, is optional, because time is probably up after all the discussions above and three common topics below. In case we run into this topic, we’d better prepare for it as well. It is a little tricky… we need come up with some variables and a function first, and then make some guesses for the values of those variables, and finally calculate the result.

The cost is a function of CPU, RAM, storage, bandwidth, number and size of the images uploaded each day.

N users 10^10 i images / (user * day) 10 s size in bytes / image 10^6 viewed v times / image 100 d days h requests / sec 10^4 (bottleneck) b bandwidth (bottleneck) Server cost: $1000 / server Storage cost: $0.1 / GB Storage = Nisd Remember the two bottlenecks we mentioned in section 1.3.3 Web Tier? – requests per second (rps) and bandwidth. So the final expression would be

Number of required servers = max(Niv/h, Nivs/b)

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

Reference: https://techtakshila.com/system-design-interview/chapter-4 https://www.youtube.com/watch?v=1sPgogJlKWM https://www.youtube.com/watch?v=_BfMH4GQWnk https://instagram-engineering.com/making-instagram-com-faster-part-1-62cc0c327538 https://instagram-engineering.com/making-instagram-com-faster-part-2-f350c8fba0d4 https://instagram-engineering.com/making-instagram-com-faster-part-3-cache-first-6f3f130b9669

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

REST API for uploading photo/video, remember to use multipart/form-data, this way we don't need to include ampersand(&) and equal symbol to represent name, value pairs. https://stackoverflow.com/questions/4007969/application-x-www-form-urlencoded-or-multipart-form-data

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

Other News Feed Applications: Facebook, Twitter, Flickr, Pinterest, TikTok

Discussion/Presentation Flow ( R-E-A-D-M-E ):

-R-E- (Requirement and Estimate) 5-10 mins1. FRs - Prioritize it and identify expectations a. Users can upload, download photos and videos b. Users can follow other users c. System generates News Feed consisting top photos from people user follows, considers most recent/most like/most popular user's photos.

  1. NFRs - Latency&Throughput / Availability&Consistency / Read&Write ratio a. Latency is 200ms for News Feed generation b. High Availability, eventual consistency to see photos later is acceptable. c. Read most application 1000:1 d. Back-of-envelope estimation, understand bottlenecks.

(Not in scope): a. Users can search photos/videos based on its title b. Users can like, comment others' photos/videos/comments c. Users can add photo tags, or tag users into photos

-A-D-M- (API, Data model, Main system) 15-25 mins3. High-level design a. A few box to represent functional interactions. b. Separate photos and metadata. c. Basic data modeling, understand the flow among various components, e.g.: Photos: Key: PhotoID(int), Value: UserID(int), PhotoPath(varchar256), Lat(int), Lon(int), , CreateDate(datetime); Users: Key: UserID(int), Value: Name(varchar50), Email(varchar32), CreationDate(datetime), LastLogon(datetime); Relationship(user/photo) Key: UserID, Value: PhotoID; or Wide-column stores, discussed later. Relationship(follower) Key: UserID, Value: FollowUserID, or Graph stores d. Basic APIs for the flow. e.g.: REST APIs

  1. Take NRFs into account and revise design a. NFR (Low latency and Read most), a-1: Memory cache is a candidate (Redis/memCached). a-2: Read performance is important for follower service, therefore it makes sense to use a key-value cache, or other NoSQL stores. b. NFR (High availability), b-1: Replica, Master-Slave is a candidate. b-2: Distributed wide column store is a candidate since they have replicas to offer reliability (Cassandra, Dynamo, HBase). c. FR (News Feed), Feeds are generated as time passes by, so HBase / Cassandra’s timestamp index is a great fit for this use case.

5 Divide and Conquer. a. Divide and extend the system architecture, separate it into micro-services b. Prevent single point of failure.

-E- (Extended system, optimization) 5-15 minsa. S3 to store pictures, and cloudfront to do CDN b. Pre-generate news-feed: 1.Pre-fetching, 2.Push instead of pull, 3.Cache first c. Sharding d. support the celebrity and non-celebrity use-case differently, to reduce I/O and computation costs. f. certain photos are so popular that the majority of people read them. This dictates that we can try caching 20% of daily read volume of photos and metadata.

sync-by-unito[bot] commented 4 years ago

➤ Nelson 3513 commented:

receive the message via long poll, web socket, or server-side event