mikeizbicki / cmc-csci143

big data course materials
41 stars 76 forks source link

Final assignment for non-graduating students #352

Closed mikeizbicki closed 7 months ago

mikeizbicki commented 1 year ago

Due date: Sunday, 14 May@midnight

Background: There are many videos linked below. Each video describes how a major company uses the technology we've covered in class, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to get the gist of them all.

Instructions: The assignment is out of 8 points. To get a point: watch a video and write 3 facts that you learned from the video. Submit the assignment by replying to this post with your facts for all of the videos in a single reply.

NOTE: I realize that many of you have a lot going on right now, and so I won't be offended if you decide to "punt" this assignment. The point-value is intentionally small so that it will have a minimal impact on your grade if you're not able to complete it. That said, I think this is one of the more interesting assignments in this class and so I'd recommend you find time to watch the videos. You can get partial credit on this assignment for watching only some of the videos, and you can get extra credit for watching more than 8.

Videos:

About postgres:

  1. Scaling Instagram Infrastructure

    https://www.youtube.com/watch?v=hnpzNAPiC0E

  2. The Evolution of Reddit.com's Architecture

    https://www.youtube.com/watch?v=nUcO7n4hek4

  3. Postgres at Pandora

    https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38

  4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

    https://www.youtube.com/watch?v=BgcJnurVFag

  5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

    https://www.youtube.com/watch?v=4GB7EDxGr_c

  6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

    https://www.youtube.com/watch?v=eZhSUXxfEu0

  7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

    https://www.youtube.com/watch?v=kd-F0M2_F3I

  8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

    https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6

  9. Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

    https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

About miscellaneous devops stuff:

  1. Mastering Chaos - a guide to microservices at netflix

    https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s

  2. Why Google Stores Billions of Lines of Code in a Single Repository

    https://www.youtube.com/watch?v=W71BTkUbdqE

    (If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)

  3. Kubernetes (abbreviated k8s) is like docker-compose on steroids. The documentary covers the history of docker and k8s and some of the technical differences between the two. The k8s documentary has 2 parts, and you must watch both parts to count as a single video

    https://www.youtube.com/watch?v=BE77h7dmoQU

    https://www.youtube.com/watch?v=318elIq37PE

Aser-Abdelfatah commented 1 year ago

1 - Why Google Stores Billions of Lines of Code in a Single Repository

2- Mastering Chaos - a guide to microservices at netflix

3- Scaling Instagram Infrastructure

4 - Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

5- The Evolution of Reddit.com's Architecture

6 - Kubernetes

7-Data modeling, the secret sauce of building & managing a large scale data warehouse

8 - PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

emjuliet commented 1 year ago

1. The evolution of Reddit:: 1 point

KaranGoel1 commented 1 year ago
  1. Scaling Instagram Infrastructure
  1. Reddit’s infrastructure
  1. PostgreSQL at Pandora
  1. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
  1. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!
  1. Breaking PostgreSQL at Scale
  1. Citus: Postgres at any scale
  1. Data modeling, the secret sauce of building & managing a large scale data warehouse
  1. Lessons learned scaling our SaaS on Postgres to 8+ billion events
  1. Microservices at Netflix
  1. Why Google stores billions of lines of code in a single repository
ifreer23 commented 1 year ago
  1. Scaling Instagram Infrastructure

    • In 2017 when this was filmed, everyday, Instagram had 400 million users, 4 billion likes, 100 million photo/videos were uploaded, and it had scaled 4x in the prior three years.
    • Guo explains how the implementation of indexes through a denormalized table helped speed up the process of counting likes many orders of magnitude than the previous query.
    • When implementing continuous profile collection there is a tradeoff between cost of collecting the data and the visibility Instagram gets into their code base. It greatly improves their productivity when debugging performance issues
  2. The Evolution of Reddit.com’s Architecture

    • In 2017 when this was filmed, everyday Reddit had 320 million users, 1 million posts, 5 million comments, 75 million votes and 70 million searches.
    • r2 is a big python “blob” that is the oldest single component of Reddit, which was started in 2008. It is currently being split up into various backend services.
    • Having many rows in the data table per “Thing” object allows Reddit to make changes to the site without having to alter a table in production.
  3. PostgresSQL at Pandora

    • Pandora originally used Oracle, but they wanted too much money, so one Friday they closed as an Oracle shop and the following Monday they opened as a Postgres shop. Wanting to go open source was another motivating factor for them too.
    • The Music Meta Data class of Pandora is a massively redundant set of read only databases
    • Over the hundreds of pg_dumps that they have run they’ve had a 99% success rate on. The only thing they fail on is typically running out of space.
  4. PostgreSQL at 20 TB and Beyond

    • Adjust is a market leader in mobile advertisement attribution- acting as a referee in pay-per-install advertising
    • At Adjust in 2017 they had 100,000 requests per second, over 2 trillion data points tracked in 2017, and over 400 TB of data being analyzed
    • The Materializer is an incremental map reduce job basically where it aggregates new events, copies the aggregations to the shards, runs every few minutes, and uses new data only
  5. Large Databases, Lots of Servers, on Premises, in the Cloud

    • If someone comes to you and says Postgres doesn’t scale out- he’s lying!
    • Leboncoin has 28.1 million unique visits, 27 million classified ads online, and 800,000 new ads every day.
    • One of their hardware servers made by Hewlett Packard has 3 TB of rem in the server.
  6. Breaking Postgres at Scale

    • PostgreSQL can handle databases of any size
    • The largest community-PostgreSQL data base Christophe Pettus has worked on was multiple petabytes
    • On a small database in PostgreSQL (about 10 GB), it is hard to go wrong as nearly everything will run fast. Even using sequential scans for everything.
  7. Citus: Postgres at any Scale

    • Citus provides distributed database superpowers to PostgreSQL through distributed tables with co-location, reference tables, query routing, and more.
    • Marco Slot stated that it is best to use Citus when managing 100GB and beyond.
    • Also, Citus is useful for multi-tenant applications (any SaaS) or real-time analytics dashboards.
  8. Data modeling, the secret sauce of building&managing a large scale data Warehouse

    • Microsoft uses Citus to provide insights to customer experience, like Windows diagnostic data from an audience of 1.5 Billion devices
    • A measure is time series data about a specific use case executed on a Windows 10 or Windows 11 device
    • Some technical challenges with Device Centric Aggregation are large number of concurrent queries (10 million queries, 2 million distinct queries per day), and the rich dimension and metrics data types.
sophiahuangg commented 1 year ago

PostgreSQL at Pandora

  1. Out of all the pg_dump commands they've done, they've had about a 99% success rate which is extremely impressive.
  2. They've had scaling issues with vacuums which runs for more than a week per table
  3. A ton of data - have > 1650 Postgres clusters and 335 TB data

Scaling Instagram Infrastructure

  1. 400m users visit Insta each day, top account has 110+ m followers (2017)
  2. 4x increase in scaling than three years ago
  3. Unit testing: important for checks and balances. Determines if they roll something out.

Data modeling, the secret sauce of building & managing a large scale data warehouse

  1. Chose Citus because they have a good SQL engine - integrated into Postgres
  2. Works in Windows DE section - collecting data from 1.5 billion devices to build dashboards and learn about customer experience
  3. They use a measure data schema that is a time series data

Why Google Stores Billions of Lines of Code in a Single Repository

  1. As of Jan 2015, Google repo has 1 billion files, 2 billion lines of code, 45k commits per workday, history of 35 million commits
  2. The advantages of having on repository is "one source of truth" - never need to wonder where the authority comes from. Lots of useful libraries that don't need to be reinvented.
  3. Having a single repository may be too easy to add dependencies - some teams forget to think about the implications that lead to problems that end up breaking things

The Evolution of Reddit.com's Architecture

  1. Reddit needed to improve process of migration - needed more peer reviews. Can't use same autoscaler technology for stateful services and stateless services.
  2. Architecture seems super interesting. Their diagram is still a work in progress (as of 5 years ago), and the middle is r2 that has been Reddit since 2008.
  3. Frontend engineers at Reddit built out more modern frontend apps that are in Node. Act as API client themselves.

Breaking PostgreSQL at Scale

  1. More memory does not help write performance. "Rule" is to try to fit the largest 1-3 indexes in memory.
  2. Not all indexes will be good/helpful. Adding 7 indexes to a table will slow insert times in a table by about a factor of 15.
  3. You can partition your data like in time series data to divide the table into more manageable chunks. Data needs a strong partitioning key which means it needs a relatively invariant key that is almost never changed but is used frequently.

Mastering Chaos - A Netflix Guide to Microservices

  1. A stateless service is not a cache or database - you're not storing data. Auto-scaling like mitosis.
  2. Netflix has a production ready checklist that has some sort of automation behind every task such as for staging deployments so you don't push out bad code.
  3. Interesting because they mention their issue arose from the codebase being monolithic, but this seems to work well for Google.

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

  1. All growth was done while running on Postgres.
  2. Scaling at an easy stage with Heroku was super easy. More of a matter creating primary indexes.
  3. In the beginning, worry about creating the right product and finding PMF before devoting time to optimization and scaling.
kanaluM commented 1 year ago

Scaling Instagram Infrastructure

The Evolution of Reddit.com's Architecture

Postgres at Pandora

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

Citus: Postgres at any Scale (Citus)

Data modeling, the secret sauce of building & managing a large scale data warehouse

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

Mastering Chaos - a guide to microservices at netflix

Why Google Stores Billions of Lines of Code in a Single Repository

Kubernetes Documentary (both parts)

laurenleadbetter commented 1 year ago

Scaling Instagram Infrastructure:

The Evolution of Reddit.com's Architecture

Mastering Chaos - A Netflix Guide to Microservices

Why Google Stores Billions of Lines of Code in a Single Repository

PostgreSQL at Pandora

Breaking PostgreSQL at Scale

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018

kushpetal commented 1 year ago

(11)

1. Scaling Instagram Infrastructure

2. The Evolution of Reddit.com’s Architecture

3. Postgres at Pandora

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

5. Large Databases, Lots of servers, on Premises, in the Cloud - Get Them All!

6. Breaking Postgres at Scale

7. Citus: Postgres at any scale

8. Data Modeling, the secret sauce of building & managing a large scale data warehouse

9. Lessons learned scaling our SaaS on Postgres to 8+ billion events

10. Mastering Chaos - a guide to mastering microservices at Netflix

11. Why Google Stores Billions of Lines of Code in a Single Repository

JSanders24 commented 1 year ago
  1. Scaling Instagram Infrastructure
  1. The Evolution of Reddit.com Architecture
  1. Postgres At Pandora
  1. PostgreSQL at 20TB and Beyond
  1. Large Databases, Lots of Servers, on Premises, in the Cloud-Get Them All!
  1. Breaking PosgreSQL at Scale
  1. Citus: Postgres at any Scale
  1. Data Modeling, The secret sauce of building and managing a large scale data warehouse
  1. Lessons learned scaling our SaaS on Posgres to 8+ Billion events
  1. Mastering Chaos
  1. Why google stores billions of lines of code in one repository
  1. Kubernetes
BITEEE0308 commented 1 year ago

(4)

  1. Scaling Instagram Infrastructure
  1. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
  1. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
  1. The Evolution of Reddit.com's Architecture
pmukneam commented 1 year ago
  1. Scaling Instagram Infrastructure

    • Instagram uses Postgre to store user, media, friendship data. They use Cassandra (NoSQL) to store user feeds, activities, etc. On the computing side, they use django frameworks. RabbitMQ with Celery, and memcache.
    • They use denormalized database to reduce runtimes of query such as counting number of likes of a post. But it still takes so much time. They then use memcache lease to address the issue.
    • Interestingly, they store the code/development/etc. via a single master branch and not multiple branches.
  2. The Evolution of Reddit.com's Architecture

    • Mostly written in Python, using common libraries. Also use Node.js for front-end development. Additionally, they use Cassandra/Postgres for storage.
    • There is a problem with the queue system. By implementing more processors, the queue for lock gets longer and longer. So with the same number of processors, they 'partition' the queue instead.
    • They use denormalized listing of a comment tree to store the parent comment and children comments in one place, so they don't have to look up the whole tree, but only specific ones that they want to display.
  3. Postgres at Pandora

    • They have an alert system to monitor 'out of norm' behaviors from their tables.
    • The original replication is really slow and done sequentially. Now they are using Steaming Replication. Note that this can be only done with the same version of Postgres.
    • They implement their own diagnostic system called Clustr to investigate their problem. This architecture is used for sharded database. They use this implementation when trying to migrate/update their database without the downtime of Postgres database. However, some of the features won't be available such as JOIN operation. There will also be additional overhead space taken.
  4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

    • Apparently, there is not many people who have a deep understanding of database and a practical skill at the same time, which post challenges in the industry.
    • It's a really challenging process to increase the bits (e.g., from 32-bit integer to 64bit) in your database, especially if you are dealing with Big data.
    • Small error/things you overlooked could be potent when dealing with big data. For example, if you are wasting 10 percent of your space it might not look like much, but if it's 10 percent of 400 TB, that's 4 TB, which doesn't sound good.
  5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

    • They automate their server on Cloud via netbox/ AWS CloudFormation/ Colins/ Puppet (post install automation).
    • Good hardware is essential for database. Pay for a good hardware to avoid data corruption and other problems.
    • They use Streaming replication. While using pg_dump, we need to make sure that there is no error in other part of your database such as automation system, etc.
  6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

    • Postgres can handle any size of database. In a small database like 10G, your database should just fit in 10G memory. You can use pg_dump to back up your database. And when doing an update, just do it. The longer we wait, more complicated it gets.
    • Next, when the database gets bigger, other than looking at the size of database and memory, try to fit top 3 largest indexes in memory. Note that more memory doesn't improve write speed. At this point, pg_dump is not efficient enough. One of the choice is pgBackRest.
    • When the database is 1TB and above, we can't start to fit all data in memory now. So, just get as much as possible. We will notice that VACUUM takes a really long time, but do wait for it to complete. And DO NOT turn off autovacuum.
  7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

    • Citus is Postgres extension that helps scaling Postgres database. Citus parallelize many regular Postgres command such as COPY. We can consider using Citus when the database is bigger than 100GB. They use hash-partitioning by calculating the hash for incoming data, then distribute it to an appropriate shard.
    • Note that if the query cannot be sped up with parallelism, Citus won't help much and will just do things that Postgres normally does.
    • To migrate to Citus, just add tenant ID columns to large tables and use reference tables for other tables. Then use tenant ID to filter queries.
  8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

    • Windows Data Mesh uses VeniceDB, a Postgres spin-off.
    • Windows need to handle over 10M query per day, which posts a technical problem.
    • They are also using partial covering index. Normally, they are accumulating over 50+ indexes per table.
verynicocool commented 1 year ago

The Evolution of Reddit.com's Architecture

  1. Reddit's architecture includes a monolithic application called "r2," which has been the core of Reddit since around 2008. R2 is a large Python codebase and acts as the original monolithic application that powers Reddit.

  2. Reddit's front-end engineers have been building modern front-end applications using Node.js. These applications act as API clients and communicate with the APIs provided by Reddit's API gateway or r2 itself.

  3. Reddit uses a combination of technologies for data storage and caching. The core data model of Reddit, called "thing," is stored in PostgreSQL with memcache in front of it. Additionally, Reddit heavily utilizes Cassandra, a distributed database, for storing new features and ensuring high availability.

Scaling Instagram Infrastructure

  1. Instagram initially started running on AWS but later moved its service completely inside Facebook's data center to take advantage of scaling services. However, when Facebook conducted drills called storms, Instagram was not able to operate in multiple data centers.

  2. Instagram focused on optimizing CPU demand to make each server more efficient. They collected data on CPU instruction usage by different endpoints and implemented tools to monitor and analyze this data.

  3. Instagram identified that code itself was a significant contributor to memory usage. They took steps to reduce the amount of code in memory by optimizing and removing dead code.

PostgresSQL at Pandora

  1. Pandora initially used Oracle as their database system but switched to PostgreSQL due to financial constraints. They wanted to go open source and scale out, which was not compatible with the costly Oracle solution.

  2. To manage their PostgreSQL instances and clusters effectively, Pandora implemented a set of predefined DBMS classes. Each class was defined by its PostgreSQL configuration and the authentication file.

  3. Pandora uses replication for high availability. They have local replicas and remote disaster recovery replicas. The standard procedure is to failover to the local replica, and only in cases where both the primary and local replica are affected, they switch to the disaster recovery replica.

Large Databases, Lots of Servers, on Premises, in the Cloud

  1. The speaker mentions that Postgres can scale out since 2008-2009, and they have successfully scaled Postgres by using external tools like Slony for replication and load balancing. They highlight that Postgres can handle large databases and lots of servers, supporting web-scale applications.

  2. The speaker discusses the infrastructure of Leboncoin, mentioning that they have 70 database servers distributed across two data centers and the cloud. They indicate that they have a 3-terabyte live database, which is one of the largest databases at Leboncoin, along with other larger databases.

  3. The speaker emphasizes the importance of backups and high availability for database systems. They mention using PG dump for nightly backups, encrypting and storing them off-site in the cloud using AWS Storage Gateway.

Citus PostgreSQL at any Scale

  1. Citus is an open-source extension to PostgreSQL that transforms it into a distributed database. It allows you to create distributed tables where data is spread across multiple Postgres servers, enabling you to handle tables of any size. Users have successfully managed tables of up to a petabyte in size with Citus.

  2. Citus introduces the concept of reference tables, which are replicated to every server in the Citus cluster. Although writes to reference tables can be slower due to the need to update all servers, they offer benefits such as the ability to create foreign keys and perform joins with distributed tables.

  3. Citus provides several superpowers to PostgreSQL, including the ability to route queries to the appropriate node, high-performance parallelism for commands like COPY, SELECT, INSERT, UPDATE, and DELETE, scaling out stored procedure calls, and support for distributed transactions.

Data modeling, the secret sauce of building & managing a large scale data warehouse

  1. The data schema used in the data warehouse follows a dimensional modeling approach. This approach organizes data into dimensions and facts, representing the relationships between various data elements.

  2. To optimize query performance, the data warehouse utilizes partial covering indexes. These indexes are created on the reporting tables and are designed to handle the high dimensionality of the data.

  3. The data warehouse leverages extended data types provided by the Citus extension for Postgres. Examples of these extended data types mentioned include HyperLogLog and T-Digest.

Lessons learned scaling our SaaS on Postgres to 8+ billion events

  1. Sharding tables by customer ID and distributing the data across multiple database nodes is key for maintaining high performance at scale. The Citus Postgres extension makes this process easy and is recommended for scaling applications with analytics on Postgres.

  2. Clearing out old stale data that is infrequently accessed from hot storage can optimize margins and improve report performance. By periodically removing historical event data that is no longer frequently accessed, ConvertFlow was able to reduce storage needs and work with less data in hot storage.

  3. As ConvertFlow scaled to billions of events, they encountered the limitation of the 2.1 billion max value for Postgres' integer column, which was used as the default ID primary key. To overcome this, they worked with Citus to migrate their analytical tables' primary keys to the big integer column type, ensuring continued scalability for their database.

SybelFrancois commented 1 year ago

Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

  1. Windows Data Engineering Group uses Citus. It is a scalable PostgreSQL database that handles high-cardinality data and can run complex queries.
  2. The Group prefers to use dynamic computation due to the fact that they run large scale queries which require high computational and storage resources, and consequently pre-computing or materializing the query results in excessive use of computing time and storage resources.
  3. The use of special data structures, such as HyperLogLog, can efficiently aggregate data across hundreds of millions of devices. In addition, related to data storage and organization, the use of JSON and JSONB is efficient for staging tables due to their flexibility.

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

  1. Adopting a rollup strategy for most commonly accessed reports in a dashboard can be beneficial. Incrementing lifetime statistics for campaign use, conversions, and conversion rate reports allows fetching a pre-calculated report to visualize the totals for customer's campaigns in a single page without making dozens of separate queries to events table.
  2. Clearing out data from hot storage optimizes margins and improves the speed of reports since it requires less data in hot storage. That said, the practice of mitigating index bloat, meaning reindexing (concurrently) tables could reduce storage impact, which makes maintenance operations easier, improves margins and avoids locking up one of production tables.
  3. Horizontal scaling as a strategy for handling big data, which can be used by the Hyperscale (Citus), offering in Azure to add more worker nodes to Postgres database as database grows, enables to continue processing billions of events.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! — Flavio Gurgel

  1. Postgres is a highly scalable database system, which they used to scale out from 2009 to the present, accommodating a growing number of servers, users, and data. It can handle web-scale operations and can be expanded to accommodate even more servers if necessary.
  2. In order to ensure the reliability and integrity of a large database, servers with 3TB of RAM, ensuring all hardware is under warranty, and using RAID 10 for disks can be used. That comes with investing in high-quality hardware, battery backup cache, using ECC (Error-Correcting Code) RAM, and double network cards.
  3. To back up PostgreSQL databases, nightly backups using the 'pg_dump' tool can be performed, using custom or directory mode depending on the size of the database. These backups are encrypted and sent off-site to the cloud for additional security and disaster recovery. In addition, AWS Storage Gateway would enable having a backup system in each data center that uses the same S3 bucket, which makes all PostgreSQL dumps available across an infrastructure.

Scaling Instagram Infrastructure

  1. The backend stack of Instagram is distributed to different data centers with the servers categorized into storage and computing. Storage servers store global data with replication, while computing servers are stateless and process user requests. Its backend (utilizing both C++ and Python) consists of the web tier that runs Django with Python, which receives user requests and accesses various back-end storage or services.
  2. Instagram uses a continuous integration (CI) process for development, meaning engineers collaborate easily by working on a single master branch with feature flags, allowing for easier upgrades, refactoring, and performance data analysis. Instagram has a monitoring and alerting system called "Extend" that helps discover problems quickly and sometimes requires reverting to previous versions. Instagram continuously rolls out the master repository whenever a diff is checked in, resulting in around 40-60 rollouts a day. A typical commit goes out within an hour of landing on the master branch.
  3. Instagram uses Cassandra to store user feeds, activities, and more. It has no master server, and all replicas have the same copy of data with eventual consistency. Also, C Profile is a tool used to deep dive into the codebase to understand the impact of specific functions on code performance.

The Evolution of Reddit.com's Architecture

  1. Reddit faced an issue where cached listings were referring to items that didn't exist in Postgres, causing pages to crash, which they solved by building a tool to clean up these listings and remove any bad data.
  2. Reddit comments are stored in a threaded manner. Since it can be complicated to render, they store the parent relationships of the whole tree in one place for efficiency. However, comment processing systems, especially with large threads can slow down the site. They then developed a system to allow threads to get dedicated processing, but this created other bugs.
  3. They use Autoscaler, which watches utilization metrics reported by load balancers and automatically increases or decreases the number of servers requested from AWS, allowing the site to scale according to demand. But later in 2016, they had problems during a migration from EC2 classic into VPC, which involved the Autoscaler. Now they have built a next-gen Autoscaler, using lesions from these past mistakes.

PostgreSQL at Pandora

  1. Cluster from Pandora, is a non-ACID database and a high availability system and compromises consistency for availability, designed for shared databases.
  2. When data is written or read, Cluster sends the request to each database in the group (two or more) and then returns the data to the application. If there is a discrepancy in the data from each database, a read reconciliation process is applied. Pandora has at least 1650 Postgres clusters, containing at least 335 terabytes of data, with the advertising-supported radio application alone increasing by two terabytes per month.
  3. Cluster has some scaling issues, particularly with Write-Ahead Logging (WAL) files and the Vacuum process.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

  1. PostgreSQL uses custom data types to optimize data storage. For instance, they have a custom datatype that allows all countries to be represented in a single byte, which can save significant storage space over billions of rows.
  2. They use another one called Istore for modeling sparse integer arrays, which offers many of the benefits of columnar storage but in a row-oriented format. Istore supports General Inverted Index (GIN indexing), allowing efficiency for querying of membership in an object.
  3. Postgres's Auto Vacuum feature can be used to manage garbage collection, but has to adapt large-scale environments to the default settings. Postgres source code, written for humans to read, assists in developing extensions.

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

  1. Auto vacuum should never be turned off completely, but in some cases, manual vacuuming of specific tables may be necessary in addition to auto vacuum.
  2. Partial indexes can be useful for specific queries, especially in cases where only a small subset of data is actively used. But also, Monitoring index usage using PG stat user indexes can help identify unused indexes that can be safely dropped.
  3. Upgrading a larger database may require logical replication-based upgrades or using tools like PG Upgrade, depending on the complexity of the schema and data size. Additionally, as a database grows, traditional full backups may become impractical, and alternatives like file system snapshots or SAN-based snapshots should be considered.
kevinm126 commented 1 year ago

1. Scaling instagram infrastructure

  1. amazon uses multiple data centers spread out over different parts of the world in order to improve scalability and reliability.
  2. Instead of the branch management model for source control, Instagram uses a 1 master approach which requires continuous integration to keep the singular master branch up and working
  3. In order to optimize the amount of CPU instructions instagram carries out they rewrote some of the extensively used and stable functions in C/C++ which made these instructions less costly.

2. The Evolution of Reddit.com's Architecture

  1. One of the problems reddit had with their listings was that the order of listing would be invalidated most often when someone voted on a listing. Their solution was to created a denormalized index of links that would be updated with respect to the voting on a listing but was no longer a cache as it originally was
  2. When a thread on reddit is receiving a high magnitude of comments, due to the tree like nature of the reddit comment threads, the site can become very slow. To solve this problem, threads experiencing this are able to be manually picked out to get their own dedicated queue called the "Fastlane"
  3. r2, which is the oldest reddit component, built in 2008, is now split up into a bunch of backend services as opposed to the original blob

3. Postgres at Pandora

  1. Pandora originally used pg_dump for its backups but switched over to pg_basebackup due to pg_dump having a number of issues, although it had a close to 100% success rate
  2. Pandora uses replication to enable a high availability solution and has a process by which data bases can failover to its replicas
  3. one of the drawbacks of using Clustr is that five extra metadata fields are added to each row, adding overhead

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

  1. They use the MapReduce paradigm from the data in its backend servers to its analytics charts and then again from its analytics charts to the clients
  2. They create custom data types in order to save about 8 bytes per row which adds up over the millions of rows in the long run
  3. In order to solve their problems with Autovacuum, adjust changed their trigger from 50 rows plus 20% of the table being dead to 150k rows plus 0% of the table.

5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

  1. Small changes in optimization can have a substantial affect on performance when working with large amounts of data.
  2. pg_dump is a tool in PostgreSQL which is built in to facilitate back up of PG databases.
  3. PostgreSQL is scalable and has progressed rapidly since the early 2000's to get to that point.

6. Breaking Postgres at Scale

  1. Small databases are easy to work with and don't require the processing and hard problems that are presented by large databases. Often the course of action with a small database are the most basic possible options which still work with the db at hand
  2. for databases biggere than able to fit in memory a rule of thumb is to set effective_cache_size to be bigger than your largest index
  3. Once working with an extremley large databases incremental backup becomes very important but shared_buffers should not be increased drastically as it will not significantly increase performance

7. Citus: Postgres at any Scale

  1. Citus can be used as a document/Key-value store with parallell queries
  2. Citus allows distributed tables to be co-located with other distributed tables which allows for foreign keys between distributed tables and distributed columns
  3. One of the benefits of using Citus for multi-tenant applications is that there is no limit on CPU, memory and storage

8. Data modeling, the secret sauce of building & managing a large scale data warehouse

  1. One of microsoft's use cases for citus is processing the insights on customer experience which is layed out and represented in dashboards
  2. They use JSON for staging tables and Hstore for dynamic columns in reporting tables
  3. Citus is a great distributed SQL execution engine, seamlessly integrated into postgres

9. Lessons learned scaling our SaaS on Postgres to 8+ billion events

  1. All the growth and scaling done at the company was done while working in Postgres
  2. It is possible to over optimize
  3. Postgres is extremely scalable
  4. Its important to find product market fit before scaling your platform

10. Mastering Chaos - a guide to microservices at netflix

  1. The microservice architectural style is an approach to developing an application as a suite of small services
  2. Microservices are an abstraction
  3. microservices allow for horizontal scaling

11. Why Google Stores Billions of Lines of Code in a Single Repository

  1. The size of the content of google's monolithic repository is 86 terabytes at the time of filming
  2. the are 45 thousand commits per day to googles repository
  3. changes to any directory must go through a code review before being committed. this helps keep the repository sane