Discussion/Presentation flow ( using acronym R-E-A-D-M-E ):
-R-E- (Requirement and Estimate)5-10 mins
FRs - Prioritize and identify expectations
Given a URL, system generates a short URL (7 digit) for it.
The shortened URL must be unique.
Redirect user to the original URL.
User is able to setup expiration date.
NFRs - Latency&Throughput / Availability&Consistency / Read&Write ratio
High Availability.
Low Latency and Read most (1000:1)
-A-D- (API, Data modeling)5-10 mins
Use case diagram, take FRs & NFRs into accountAPI:
POST/PUT/GET/DELETE /api/REST/1.0/user//shortURL
{
shorten_URL_decoded_id: int
original_URL: varchar(512),
created_at: datatime,
expired_at: datatime
}
Data modeling discuss the flow among various components,
a. shortened URL has no relationship with other URLs, using KV stores is fine.
b. NFR (Low latency and Read most) => KV stores and KV cache.
c. NFR (High availability) => Distributed KV stores
Key: recordID(int), Value: originalURL(varchar(512), created_at: datetime, expired_at: datetime
-M-E- (Main system, Extended system)10-20 mins
Algorithms to generate URL1.Using MD5/SHA, and then Base62 encoding for displaying.
Pros: not rely on a central server to calculate, less complexity; easier scaling.
Cons: the same URL leads to the same MD5, reach a DB becomes a must to check it's existing or not.
Cons: MD5 collision vulnerabilities, we can't use it as DB key because it's not unique. Base62 encoding to take first 8digits for displaying, leads to more possible collisions.
2.Using counter to generate keys,
Cons: single point of failure in that counter service
Cons: only RDBMS can guarantee it'll be unique, NoSQL's BASE can't guarantee, but RDBMS is hard to scale.
3.Creating KGS (Key Generation Service) and Base62 encoding.
Note: using usual incremental DB keys, encode keys as shortened URL.
Pros: No collision, Base62^7 represents 3.5 trillion URLs. (62^8=218 Tri)
Cons: additional complexity & concurrency problem.
Solve the concurrency problem by loading keys into service cache, and move them to 'used keys' immediately.
Extended services and cache and purge expired data
Shard them by the range or by hash (more evenly),
security about guessing the next shortened URL? seems not an issue because URL is supposed to be public. If need, we add random unique seed to make it un-predictable.
**3. client-aside cache and purge expired data, worker service to delete and retrieve keys back.
Analytics, cache to record hit count and batch sync to backend.**
5.How to prevent visiting restricted sites?
a. maintaining a blacklist of hostnames in a KV store. b. scaling techniques like bloom filter.
I ask myself why do people care URL shortener if it's not just an interview question? the reason may includes these:
(My google drawing): https://docs.google.com/drawings/d/1HZM267GFam1retVr4nbwdiLfsQTw8lguiJmPuaK0cqI/edit
Discussion/Presentation flow ( using acronym R-E-A-D-M-E ):
-R-E- (Requirement and Estimate) 5-10 mins
FRs - Prioritize and identify expectations
NFRs - Latency&Throughput / Availability&Consistency / Read&Write ratio
-A-D- (API, Data modeling) 5-10 mins
Use case diagram, take FRs & NFRs into account API: POST/PUT/GET/DELETE /api/REST/1.0/user//shortURL
{
shorten_URL_decoded_id: int
original_URL: varchar(512),
created_at: datatime,
expired_at: datatime
}
Data modeling discuss the flow among various components,
a. shortened URL has no relationship with other URLs, using KV stores is fine.
b. NFR (Low latency and Read most) => KV stores and KV cache.
c. NFR (High availability) => Distributed KV stores
Key: recordID(int), Value: originalURL(varchar(512), created_at: datetime, expired_at: datetime
-M-E- (Main system, Extended system) 10-20 mins
Algorithms to generate URL 1.Using MD5/SHA, and then Base62 encoding for displaying. Pros: not rely on a central server to calculate, less complexity; easier scaling. Cons: the same URL leads to the same MD5, reach a DB becomes a must to check it's existing or not. Cons: MD5 collision vulnerabilities, we can't use it as DB key because it's not unique. Base62 encoding to take first 8digits for displaying, leads to more possible collisions.
2.Using counter to generate keys, Cons: single point of failure in that counter service Cons: only RDBMS can guarantee it'll be unique, NoSQL's BASE can't guarantee, but RDBMS is hard to scale.
3.Creating KGS (Key Generation Service) and Base62 encoding. Note: using usual incremental DB keys, encode keys as shortened URL. Pros: No collision, Base62^7 represents 3.5 trillion URLs. (62^8=218 Tri) Cons: additional complexity & concurrency problem. Solve the concurrency problem by loading keys into service cache, and move them to 'used keys' immediately.
Extended services and cache and purge expired data
Bloom filter https://llimllib.github.io/bloomfilter-tutorial/#:~:text=A%20Bloom%20filter%20is%20a,may%20be%20in%20the%20set.
Ref: https://www.educative.io/courses/grokking-the-system-design-interview/m2ygV4E81AR https://medium.com/swlh/how-to-build-a-tiny-url-service-that-scales-to-billions-f6fb5ea22e8c https://tianpan.co/notes/84-designing-a-url-shortener https://blog.codinghorror.com/url-shortening-hashes-in-practice/ https://buffer.com/library/url-shorteners/ https://www.youtube.com/watch?v=fMZMm_0ZhK4 Why URL Shortener: https://blog.rebrandly.com/utm-parameters-made-simple-guide-for-marketers/ https://buffer.com/library/utm-guide/
┆Issue is synchronized with this Trello card by Unito ┆Attachments: Tushar Roy - Design a service like TinyUrl | Tech Dummies - tinyurl | bitly system design | image.jpeg | Screenshot - 10_28_2020 , 4_17_01 PM.png