Closed mikeizbicki closed 7 months ago
1 - Why Google Stores Billions of Lines of Code in a Single Repository
2- Mastering Chaos - a guide to microservices at netflix
3- Scaling Instagram Infrastructure
4 - Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
5- The Evolution of Reddit.com's Architecture
6 - Kubernetes
7-Data modeling, the secret sauce of building & managing a large scale data warehouse
8 - PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
1. The evolution of Reddit:: 1 point
Queues in 2012, the keys started getting really backed up during peak traffic and delayed votes and scores.
2. Postgres Open Silicon Valley 2017: 1 point
Their pg_dump was slow to generate and even slower to restore.
3. PostgreSQL at 20TB and Beyond: 1 point
They IStore which is like Hstore but for integers.
4. Citrus Con: 1 point
A typical Windows measure could have data points from 200M devices per day.
5. Citrus Con 2022: 1 point
Their long-term solution was to use Citus to scale their Postgres database.
6. Why Google Stores Billions of Lines of Code in a Single Repository: 1 point
The Google workflow works like this: syncs user workspace to repo, then code it written and reviewed and committed.
7. Kubernetes: The Documentary: 1 point
Docker got immediate buzz and the vision was about building tools for mass innovation.
8. Mastering Chaos: 1 point
Scaling Instagram Infrastructure
The Evolution of Reddit.com’s Architecture
PostgresSQL at Pandora
PostgreSQL at 20 TB and Beyond
Large Databases, Lots of Servers, on Premises, in the Cloud
Breaking Postgres at Scale
Citus: Postgres at any Scale
Data modeling, the secret sauce of building&managing a large scale data Warehouse
Scaling Instagram Infrastructure
The Evolution of Reddit.com's Architecture
Postgres at Pandora
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
Citus: Postgres at any Scale (Citus)
Data modeling, the secret sauce of building & managing a large scale data warehouse
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
Mastering Chaos - a guide to microservices at netflix
Why Google Stores Billions of Lines of Code in a Single Repository
Kubernetes Documentary (both parts)
Scaling Instagram Infrastructure:
The Evolution of Reddit.com's Architecture
Mastering Chaos - A Netflix Guide to Microservices
Why Google Stores Billions of Lines of Code in a Single Repository
PostgreSQL at Pandora
Breaking PostgreSQL at Scale
Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018
(11)
1. Scaling Instagram Infrastructure
2. The Evolution of Reddit.com’s Architecture
3. Postgres at Pandora
4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
5. Large Databases, Lots of servers, on Premises, in the Cloud - Get Them All!
6. Breaking Postgres at Scale
7. Citus: Postgres at any scale
8. Data Modeling, the secret sauce of building & managing a large scale data warehouse
9. Lessons learned scaling our SaaS on Postgres to 8+ billion events
10. Mastering Chaos - a guide to mastering microservices at Netflix
11. Why Google Stores Billions of Lines of Code in a Single Repository
(4)
Scaling Instagram Infrastructure
The Evolution of Reddit.com's Architecture
Postgres at Pandora
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
pg_dump
, we need to make sure that there is no error in other part of your database such as automation system, etc.Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
pg_dump
to back up your database. And when doing an update, just do it. The longer we wait, more complicated it gets.pg_dump
is not efficient enough. One of the choice is pgBackRest
.VACUUM
takes a really long time, but do wait for it to complete. And DO NOT turn off autovacuum.Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
COPY
. We can consider using Citus when the database is bigger than 100GB. They use hash-partitioning by calculating the hash for incoming data, then distribute it to an appropriate shard.Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
The Evolution of Reddit.com's Architecture
Reddit's architecture includes a monolithic application called "r2," which has been the core of Reddit since around 2008. R2 is a large Python codebase and acts as the original monolithic application that powers Reddit.
Reddit's front-end engineers have been building modern front-end applications using Node.js. These applications act as API clients and communicate with the APIs provided by Reddit's API gateway or r2 itself.
Reddit uses a combination of technologies for data storage and caching. The core data model of Reddit, called "thing," is stored in PostgreSQL with memcache in front of it. Additionally, Reddit heavily utilizes Cassandra, a distributed database, for storing new features and ensuring high availability.
Scaling Instagram Infrastructure
Instagram initially started running on AWS but later moved its service completely inside Facebook's data center to take advantage of scaling services. However, when Facebook conducted drills called storms, Instagram was not able to operate in multiple data centers.
Instagram focused on optimizing CPU demand to make each server more efficient. They collected data on CPU instruction usage by different endpoints and implemented tools to monitor and analyze this data.
Instagram identified that code itself was a significant contributor to memory usage. They took steps to reduce the amount of code in memory by optimizing and removing dead code.
PostgresSQL at Pandora
Pandora initially used Oracle as their database system but switched to PostgreSQL due to financial constraints. They wanted to go open source and scale out, which was not compatible with the costly Oracle solution.
To manage their PostgreSQL instances and clusters effectively, Pandora implemented a set of predefined DBMS classes. Each class was defined by its PostgreSQL configuration and the authentication file.
Pandora uses replication for high availability. They have local replicas and remote disaster recovery replicas. The standard procedure is to failover to the local replica, and only in cases where both the primary and local replica are affected, they switch to the disaster recovery replica.
Large Databases, Lots of Servers, on Premises, in the Cloud
The speaker mentions that Postgres can scale out since 2008-2009, and they have successfully scaled Postgres by using external tools like Slony for replication and load balancing. They highlight that Postgres can handle large databases and lots of servers, supporting web-scale applications.
The speaker discusses the infrastructure of Leboncoin, mentioning that they have 70 database servers distributed across two data centers and the cloud. They indicate that they have a 3-terabyte live database, which is one of the largest databases at Leboncoin, along with other larger databases.
The speaker emphasizes the importance of backups and high availability for database systems. They mention using PG dump for nightly backups, encrypting and storing them off-site in the cloud using AWS Storage Gateway.
Citus PostgreSQL at any Scale
Citus is an open-source extension to PostgreSQL that transforms it into a distributed database. It allows you to create distributed tables where data is spread across multiple Postgres servers, enabling you to handle tables of any size. Users have successfully managed tables of up to a petabyte in size with Citus.
Citus introduces the concept of reference tables, which are replicated to every server in the Citus cluster. Although writes to reference tables can be slower due to the need to update all servers, they offer benefits such as the ability to create foreign keys and perform joins with distributed tables.
Citus provides several superpowers to PostgreSQL, including the ability to route queries to the appropriate node, high-performance parallelism for commands like COPY, SELECT, INSERT, UPDATE, and DELETE, scaling out stored procedure calls, and support for distributed transactions.
Data modeling, the secret sauce of building & managing a large scale data warehouse
The data schema used in the data warehouse follows a dimensional modeling approach. This approach organizes data into dimensions and facts, representing the relationships between various data elements.
To optimize query performance, the data warehouse utilizes partial covering indexes. These indexes are created on the reporting tables and are designed to handle the high dimensionality of the data.
The data warehouse leverages extended data types provided by the Citus extension for Postgres. Examples of these extended data types mentioned include HyperLogLog and T-Digest.
Lessons learned scaling our SaaS on Postgres to 8+ billion events
Sharding tables by customer ID and distributing the data across multiple database nodes is key for maintaining high performance at scale. The Citus Postgres extension makes this process easy and is recommended for scaling applications with analytics on Postgres.
Clearing out old stale data that is infrequently accessed from hot storage can optimize margins and improve report performance. By periodically removing historical event data that is no longer frequently accessed, ConvertFlow was able to reduce storage needs and work with less data in hot storage.
As ConvertFlow scaled to billions of events, they encountered the limitation of the 2.1 billion max value for Postgres' integer column, which was used as the default ID primary key. To overcome this, they worked with Citus to migrate their analytical tables' primary keys to the big integer column type, ensuring continued scalability for their database.
Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022
Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! — Flavio Gurgel
Scaling Instagram Infrastructure
The Evolution of Reddit.com's Architecture
PostgreSQL at Pandora
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
1. Scaling instagram infrastructure
2. The Evolution of Reddit.com's Architecture
3. Postgres at Pandora
4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!
6. Breaking Postgres at Scale
7. Citus: Postgres at any Scale
8. Data modeling, the secret sauce of building & managing a large scale data warehouse
9. Lessons learned scaling our SaaS on Postgres to 8+ billion events
10. Mastering Chaos - a guide to microservices at netflix
11. Why Google Stores Billions of Lines of Code in a Single Repository
Due date: Sunday, 14 May@midnight
Background: There are many videos linked below. Each video describes how a major company uses the technology we've covered in class, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to get the gist of them all.
Instructions: The assignment is out of 8 points. To get a point: watch a video and write 3 facts that you learned from the video. Submit the assignment by replying to this post with your facts for all of the videos in a single reply.
Videos:
About postgres:
Scaling Instagram Infrastructure
https://www.youtube.com/watch?v=hnpzNAPiC0E
The Evolution of Reddit.com's Architecture
https://www.youtube.com/watch?v=nUcO7n4hek4
Postgres at Pandora
https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
https://www.youtube.com/watch?v=BgcJnurVFag
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
https://www.youtube.com/watch?v=4GB7EDxGr_c
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
https://www.youtube.com/watch?v=eZhSUXxfEu0
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
https://www.youtube.com/watch?v=kd-F0M2_F3I
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13
About miscellaneous devops stuff:
Mastering Chaos - a guide to microservices at netflix
https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s
Why Google Stores Billions of Lines of Code in a Single Repository
https://www.youtube.com/watch?v=W71BTkUbdqE
(If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)
Kubernetes (abbreviated k8s) is like docker-compose on steroids. The documentary covers the history of docker and k8s and some of the technical differences between the two. The k8s documentary has 2 parts, and you must watch both parts to count as a single video
https://www.youtube.com/watch?v=BE77h7dmoQU
https://www.youtube.com/watch?v=318elIq37PE