Open mikeizbicki opened 7 months ago
Scaling Instagram Infrastructure
Postgres at Pandora
Why Google Stores Billions of Lines of Code in a Single Repository
The Evolution of Reddit.com’s Architecture
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
Citus: Postgres at Any Scale
Data modeling, the secret sauce of building and managing a large scale data warehouse
Lesson learned scaling our SaaS on Postgres to 8+ billion events
Why Google Stores Billions of Lines of Code in a Single Repository (1) 1) There are many advantages of a monolithic repository but the main reason why it works and why it's worth investing in is that it is great for collaborative culture. 2) The Diamond Dependency problem is when it is difficult to build A since A depends on both B and C and B and C both depend on D or D.1 and D.2, this is a problem if there's a lot of repos to update at the same time. 3) In order for monolithic repos to work at such a large scale, you also need to consider code health, one such example is API visibility, where we set it to private and this encourages more consideration for more "hygienic code"
The Evolution of Reddit.com's Architecture (1) 1) Permissions are really useful considering a Postgres crash they encountered was that they were able to write to the secondary database. 2) Sanity checks are really important as well as observability is key and multiple layers of safeguards + simple-to-understand code are key to preventing problems. 3) Timers in code are good, they also give us cross-sections, as well as p99, which gives us a lot of information that helps us trace the problems/causes of weird cases.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018 (1) 1) Unless we have multiple TB of data, we're better off optimizing a single instance; this approach is better for dealing with heavy velocity and big data 2) By default, autovacuum doesn't kick in until we have 50 rows plus 20% of the table being "dead" but if we have a lot of rows, it doesn't keep up 3) Istore is a way to model integer arrays allowing us to model time series and stuff, allowing us to have arithmetic operations
Mastering Chaos - A Netflix Guide to Microservices (1) 1) Autoscale is fundamental as it can replace nodes easily and a loss of a node isn't of a mode isn't a problem for us; it also gives us computing efficiency. 2) A stateful service is a database/cache where the loss of a node is a notable event and may take hours to replace that node. We can deal with this with an EV cache where there multiple copies. 3) Some solutions to excessive load include workload partitioning, request level caching, and secure token fallback which is embedding a token into a device itself and it'll hopefully have enough information on the customer, letting them access the platform.
Why Google Stores Billions of Lines of Code in a Single Repository
Data modeling, the secret sauce of building & managing a large scale data warehouse
Why Google Stores Billions of Lines of Code in a Single Repository
The Evolution of Reddit.com's Architecture
Why Google Stores Billions of Lines of Code in a Single Repository
Scaling Instagram Infrastructure
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
1. Scaling Instagram Infrastructure:
2. The Evolution of Reddit.com's Architecture
3. Postgres at Pandora:
4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
The Evolution of Reddit.com’s Architecture:
Data modeling, the secret sauce (Citus Con)
Lessons learned scaling our SaaS on Postgres (Citus Con)
Why Google Stores Billions of Lines of Code in a Single Repository
Mastering Chaos - A Netflix Guide to Microservices
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
Scaling Instagram Infrastructure
PostgreSQL at Pandora
v1: Scaling Instagram Infrastructure
v2: The Evolution of Reddit.com's Architecture
v3: Postgres at Pandora
v4: PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
v5: Data modeling, the secret sauce of building & managing a large scale data warehouse
v6: navitaging chaos: netflix
v7: why google store billions of lines of code in a single repository
v8: kubernetes documentary
Scaling Instagram Infrastructure
Postgres at Pandora
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!
Citus: Postgres at any Scale
Why Google Stores Billions of Lines of Code in a Single Repository
Lessons learned scaling our SaaS on Postgres to 8+ billion events
Mastering Chaos - a guide to microservices at netflix
The Evolution of Reddit.com's Architecture
Mastering Chaos - a guide to microservices at netflix
Scaling Instagram Infrastructure
Scaling Instagram Infrastructure
https://www.youtube.com/watch?v=hnpzNAPiC0E
The Evolution of Reddit.com's Architecture https://www.youtube.com/watch?v=nUcO7n4hek4
Scaling Instaram Infrastructure
The Evolution of Reddit.com’s Architecture
Why Google Stores Billions of Lines of Code in a Single Repository
Lessons Learned Scaling our SaaS on Postgres to 8+ Billion Events
Citus: Postgres at any Scale
Data modeling, the secret sauce of building & managing a large scale data warehouse
1. Scaling Instagram Infrastructure
2. The Evolution of Reddit.com's Architecture
3. Postgres at Pandora
4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
Scaling Instagram Infrastructure
perf_event_open
in c and cProfile.Profile()
in pythonData modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022
The evolution of reddit.com’s architecture:
1 – R2’s oldest data model stores data in postgres
2 – referring to items in postgres that didn’t exist caused a lot of errors in ~2011 – the cause of that was R2 trying to remove dead databases
3 – autoscaler changes the number of servers used based on demand – it does this because each post is tracked by a daemon
Postgresql at 20tb and beyond 1 – mapreduce is used to make things faster 2 – restarting postgres doesn’t disrupt the backend servers because of how the servers are run 3 – the mapreduce procedure used by materializer aggregates data and transfers between servers
Data modeling, the secret sauce of building & managing a large scale data warehouse 1 – the data that is being discussed is collected from an audience of 1.5 billion devices and it goes through a processing pipeline for it to be more usable for the team 2 – there is an inner and an outer query when the team tries to identify device id – this is a query the presenter described as “bread and butter” for their team – they pre-compute the inner query 3 – they use json for staging tables and then hstore for dynamic columns – the hstore type is smaller, so there is less io for fetching the data
Lessons learned scaling our SaaS on Postgres to 8+ billion events 1 – at the beginning of their company, they ran on postgres due to that being the database that rails and Heroku used. Scaling up within the early stage was easy because it meant creating indexes 2 – the speaker advised developing a marketable product that you know customers would buy prior to thinking about hypothetical issues of scale 3 – postgres is cheap and easy to deploy and also scale – it is easy for early stage companies and has been proven to be usable even for very large websites
Scaling Instagram infrastructure 1 – there is a difference between storage and computing servers. Storage servers are stored using postgres 2 – Cassandra is used to store user activities and is important for scalability 3 – database replication occurs across regions but computing resources are contained within one region
Mastering chaos – a Netflix guide to microservices 1 – a problem that they faced early on is that there was a monolithic codebase and also a monolithic database – this caused crashes every time something went down 2 – things being deeply interconnected means that there is a lot of difficulty to make any changes – this is why this is a bad way to build services today 3 – a microservice is a style of building an application that is a collection of small services – each service has its own task. This is a response to monolithic applications
Citus: postgresql at any scale 1 – citus is a service that creates distributed database abilities within postgresql. Citus distributes tables across clusters 2 – unique constraints on tables must include the distribution column (within distributed tables) – this is a limitation od distributed tables 3 – there are some missing features in this model, such as replicating triggers, and table inheritance
Breaking postgresql at scale 1 – how you use postgres changes operationally as a database grows, but postgres can generally handle any size that you need 2 – for small databases (like 10 gigabytes), postgresql will run quickly, even if you are always doing sequential scans 3 – don’t just randomly create indexes – they take up space/time. It is better to create indexes in response to specific problems, don’t just add them randomly because they will do things like slow down insert time
The Evolution of Reddit.com's Architecture
Citus: Postgres at any Scale
2/2 points
Scaling Instagram Infrastructure
1 - A sizable portion of people working at Facebook/Instagram have not been there for a long time: 30% of the engineers joined in the last 6 months, and there are interns/bootcampers etc.
2 - When launching new features, Instagram uses gates to control access and gradually releases in the following order: engineers, dogfood, employees, some demographics, world.
3 - Instagram uses no branches in order to enjoy the benefits of continuous integration, collaborate easily, and they can still easily bisect problems and revert when needed.
The Evolution of Reddit.com's Architecture
1 - Reddit had issues with vote queues filling up too quickly and slowing processes down, so they partition votes into different queues so that there would be less fighting/waiting at the same lock. Ultimately, they split up the queues altogether so that the votes would not be aiming for domain listing
2 - For comment trees, there are threads that experience a lot of comments, so Reddit created a fastlane for these special cases. However, the fastlane filled up quickly and used too much memory and also caused skipping of child comments past parent comments that were not in the fastlane. Reddit used queue quotas to solve this problem
3 - Reddit faced difficulties in migrating to new servers because their autoscaler terminated transition to new servers after they were restarted, so they realized that there need to be more sanity checks when doing commands that make major changes.
Scaling Instagram Infrastructure:
The Evolution of Reddit.com’s Architecture:
PostgreSQL at Pandora
PostgreSQL at 20TB and Beyond
Large Databases, Lots of Servers, on Premises, in the Cloud
Breaking PostgreSQL at Scale
Citus PostgreSQL at any Scale
Data Modeling, the Secret Sauce of Building & Managing a Large Scale Data Warehouse
The Evolution of Reddit.com's Architecture
Why Google Stores Billions of Lines of Code in a Single Repository
Infrastructure Transition and Scaling: Initially hosted on Amazon Web Services (AWS), Instagram later moved to Facebook data centers, expanding its infrastructure from one to multiple data centers to enhance scalability and fault tolerance.
Database and Data Center Strategy:
Performance Optimization Tools:
Software Architecture and Technologies:
Load Balancing and Management:
Database Management and Scalability:
The rate of commits being done to the repo is growing exponentially, and a majority of the commits are done automatically.
Google's workflow is to connect users to the workspace, write code, and review both done through people and automation, then committed. Each commit has an owner, and if you're trying to commit, you need permission.
A major positive about this is that they don't have to do tedious and time-consuming merging as all the code base is unified and doesn't have much confusion about the version of a file.
20 TB is the maximum amount of DB storage for PostgreSQL DB, and to expand further, we use multiple servers that can track and make the DB size of an entire system bigger.
They aggregate their data and then shuffle that data and then map reduce the data, which then outputs.
At some point, they were running out of 32-bit integers to track the advertisement data trackers; it took them two months; they solved this by changing the data type of the trackers.
The largest community in PostgreSQL is petabytes in size big, and PostgreSQL can handle it; however, what works for a 10 GB DB is different from a petabyte DB.
Basic backups for a 10GB database are pg_dumps and cron job it every 6 hours or S3 dumps.
Monitoring by process logs through pg_badger or pg_statement provides real times of query performances or specific new relic, data dogs, and other web app monitoring applications.
Talks about how scaling in a startup looks like, i.e., short-term and long-term goals and how you can scale accordingly.
Used Citus to break up the database into multiple nodes and results in sharper and faster queries.
Reindexing the tables made the storage of the tables decrease as over the years PostgreSQL has given many updates, and also update the tables concurrently, or else halts in the DB queries happen.
Changed from Oracle as they wanted to use open-source and a much more affordable option.
Implementing monitoring current activities, and looking at the current logs of PostgreSQL and looking through the error logs and long query logs.
Created an in-house DB called cluster and it was not ACID intended for sharded database and each group has two or more DB's and tied together by the cluster application.
Transaction ID wrap around took around two weeks, takes way too long
Scaling Instagram Infrastructure
Why Google Stores Billions of Lines of Code in a Single Repository
Kubernetes: The Documentary
The Evolution of Reddit.com's Architecture
Scaling Instagram Infrstructure
The Evolution of Reddit.com's Architecture
Postgres at Pandora
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
Breaking Postgres at Scale (PostgreSQL Experts, Inc.)
Mastering Chaos - a guide to microservices at netflix
Why Google Stores Billions of Lines of Code in a Single Repository
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow)
Scaling Instagram Infrastructure
The Evolution of Reddit.com's Architecture
(1)Data modeling, the secret sauce of building & managing a large scale data warehouse
(2)Lessons learned scaling our SaaS on Postgres to 8+ billion events
(3)PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
(4)Mastering Chaos - a guide to microservices at Netflix
(5)Why Google Stores Billions of Lines of Code in a Single Repository
(6)Postgres at Pandora
(7)The Evolution of Reddit.com's Architecture
(8)Scaling Instagram Infrastructure
(1) Scaling Instagram Infrastructure
(2) Why Google Stores Billions of Lines of Code in a Single Repository
(3) Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022
(4) Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022
(5) Mastering Chaos - A Netflix Guide to Microservices
(6) The Evolution of Reddit.com's Architecture
(7) PostgreSQL at Pandora
(8) PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018
Postgres at Pandora
Pandora monitors the frequency of different activities that are damaging to the company’s service, like long duration transactions, PostgreSQL and syslog errors, and blocks The company also captures historical data, both short and long term, to know what metrics are normal for every postgres instance Pandora made their own internal application, Clustr, to maintain an architecture of data updates that affect several databases (interestingly, the Clustr system is not ACID compliant)
Mastering Chaos - a guide to microservices at netflix
CAP Theorem says that a service with different databases contained in different networks must either place preference on either consistency or availability. If you cannot perform an update on one database, either stop the update process and return an error, or update the ones that are available. Netflix chose availability and relied on “eventual consistency” Netflix purposefully induces “chaos” into their stateless services, in which nodes are deleted and the system is checked to still be running as normal. Netflix uses a global cloud management/delivery service called Spinnaker, which allows them to integrate automated components into their deployment path easily. As the company learns more about how to improve their own service, it can add new automated systems to their deployment.
Mastering Chaos - a guide to microservices at netflix
Why Google Stores Billions of Lines of Code in a Single Repository
Scaling Instagram Infrastructure
The Evolution of Reddit.com's Architecture
Citus: Postgres at any Scale
Lesson learned scaling our SaaS on Postgres to 8+ billion events
Postgres at Pandora
Data modeling, the secret sauce of building & managing a large scale data warehouse
Data modeling, the secret sauce of building & managing a large scale data warehouse
Mastering Chaos - a guide to microservices at Netflix
Why Google Stores Billions of Lines of Code in a Single Repository
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!
a. The monolithic repository structure allows for easier management of dependencies, reducing technical debt by ensuring library updates are quickly and uniformly applied across the entire codebase.
b. Google handles thousands of commits per week, heavily leveraging automated systems to manage the volume and ensure consistent code quality.
c. The use of a single repository supports a collaborative work environment by simplifying dependency issues and reducing conflicts, which facilitates smoother development cycles.
a. Focus on developing the right product for the target market first, before addressing scalability is more important, to ensure that the product adequately meets market demands.
b. Transition from a single-node PostgreSQL setup to a more scalable clustered architecture with multiple worker nodes and coordinator nodes, and incorporate database sharding with Citus as event volumes grow, to efficiently handle larger data volumes and improve system performance.
c. Implement data management strategies like clearing old data and rolling up frequently accessed reports to reduce storage costs and enhance data retrieval performance as data volume increases.
a. Utilizing standby databases enhances scalability by improving data availability and enabling load balancing across servers.
b. Employing pg_dump and tools like Barman ensures data integrity and optimizes storage through alternating backup patterns.
c. Combining on-premises servers with cloud services increases flexibility and ensures data consistency with high availability strategies like streaming and logical replication.
a. Citus, as an extension of PostgreSQL, plays a crucial role in enabling Windows to manage and analyze data from approximately 1.5 billion devices efficiently. It enhances PostgreSQL's capability to handle massive scale data by distributing the database workload across multiple nodes, which facilitates the creation of data-rich decision dashboards that refresh every few hours.
b. Windows leverages PostgreSQL’s advanced data types like JSON and hstore. JSON is used in staging tables for its flexibility in handling semi-structured data, and hstore is utilized for dynamic columns in reporting tables. These data types are pivotal for managing the diverse and voluminous data from billions of devices, allowing for efficient querying and data structure flexibility.
c. The system focuses on "measures," which are time series data specific to use cases executed on Windows devices. To handle the complexity and high volume of this data, Windows uses Citus to ensure scalability and performance.
a. Netflix highlights the dangers of cascading failures within a microservices architecture, where the failure of a single service can potentially bring down the entire system. An example of this is the over-reliance on a single cache cluster for both online and offline processes, which poses significant risks if the cache layer fails. Implementing proper fallbacks and diversifying dependencies are critical to preventing such cascading issues.
b. Inspired by the concept of the autonomic nervous system, Netflix emphasizes the importance of creating an environment where best practices are automatic and intuitive. This is achieved through a cycle of continuous learning and automation, ensuring that the system can adapt and maintain operational efficiency without manual intervention in routine processes.
c. To handle scalability challenges, such as handling up to a million requests per second, Netflix employs strategies like workload partitioning and caching. Additionally, following a significant outage in their US-East AWS server, Netflix implemented a strategy to split their load across multiple regional servers to distribute traffic more evenly and enhance service reliability. This approach not only helps in managing high traffic volumes but also in improving the system's overall resilience to failures.
a. Pandora transitioned from SLONY replication, which only replicated tables and sequences, to more robust replication strategies to enhance data availability and consistency.
b. Pandora's detailed monitoring system tracks a variety of metrics, including error frequencies, queue durations, and blocked processes.
c. Initially using Oracle, Pandora switched to PostgreSQL for its cost-effectiveness, scalability, and open-source benefits. They have developed in-house solutions, such as a custom database management system named "Cluster".
a. Instagram's scaling strategy encompasses three key dimensions—scaling out by adding more servers and data centers, scaling up by enhancing the efficiency of each server to handle more tasks, and scaling the development team to enable rapid deployment of features without compromising system stability.
b. The infrastructure utilizes Django for stateless computing with data stored in systems like Cassandra and Memcache, ensuring efficient data synchronization across datacenters.
c. Instagram continuously optimizes its software for performance, including memory optimization and CPU usage reduction through targeted code refinements and advanced monitoring tools.
a. Features like Fastlane queues to manage highly active threads and ensure they don't slow down the rest of the site. This helps in handling real-time data more efficiently by reducing delays caused by popular comments or threads.
b. To maintain stability and performance on a high-traffic platform like Reddit, implementing thorough sanity checks, observability, and multiple layers of safeguards is crucial. These practices are essential for identifying and mitigating issues before they affect server performance and user experience.
c. Reddit has encountered significant challenges with locks in their architecture, particularly impacting the processing of queues, such as the "vote" feature. This has led to efforts towards adopting lockless data models to improve efficiency and reduce bottlenecks.
The Evolution of Reddit.com's Architecture
Data modeling, the secret sauce of building & managing a large scale data warehouse
Scaling Instagram Infrastructure:
Postgres at Pandora:
Evolution of Reddit's Infrastructure
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
Why Google Stores Billions of Lines of Code in a Single Repository
Mastering Chaos
Large Databases Lots of Servers on Premises in the Cloud
Breaking Postgres at Scale
Postgres at Pandora
Citus: Postgres at any scale
Postgres at Pandora
Scaling Instagram Infrastructure
Why Google Stores Billions of Lines of Code in a Single Repository
Evolution of Reddit.com’s Architecture
Scaling Instagram Infrastructure
The Evolution of Reddit.com's Architecture
PostgreSQL at 20Tb and beyond
PostgreSQL at Scale.
Mastering Chaos - A Netflix Guide to Microservices
Why Google Stores Billions of Lines of Code in a Single Repository
Scaling instagram infrastructure:
Breaking Postgres at scale:
Scaling Instagram Infrastructure:
The Evoltuion of Reddit.com's Architecture:
Scaling Instagram Infrastructure:
The Evolution of Reddit.com’s Architecture:
Why Google Stores Billions of Lines of Code in a Single Repository:
Mastering Chaos - A Netflix Guide to Microservices
PostgreSQL at Pandora:
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale:
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!:
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
Due date: Before you take the final exam
Background: There are many videos linked below. Each video describes how a major company uses postgres, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to learn something meaningful from each of them.
Instructions:
For each video below that you choose to watch, write 3 facts that you learned from the video. Make a single reply to this post that contains all of the facts.
The assignment is worth 2 points. If you watch:
So you can get up to a 4/2 on this assignment.
Videos:
About postgres:
Scaling Instagram Infrastructure
https://www.youtube.com/watch?v=hnpzNAPiC0E
The Evolution of Reddit.com's Architecture
https://www.youtube.com/watch?v=nUcO7n4hek4
Postgres at Pandora
https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
https://www.youtube.com/watch?v=BgcJnurVFag
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
https://www.youtube.com/watch?v=4GB7EDxGr_c
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
https://www.youtube.com/watch?v=eZhSUXxfEu0
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
https://www.youtube.com/watch?v=kd-F0M2_F3I
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13
About general software engineering:
Mastering Chaos - a guide to microservices at netflix
https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s
Why Google Stores Billions of Lines of Code in a Single Repository
https://www.youtube.com/watch?v=W71BTkUbdqE
(If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)
The kubernetes documentary. Recall that google's developed kubernetes as a more powerful version of docker-compose. (Note that this documentary has 2 parts, and you must watch both parts to count as a single video.)
https://www.youtube.com/watch?v=BE77h7dmoQU
https://www.youtube.com/watch?v=318elIq37PE