State of the Industry Assignment + Extra Credit

mikeizbicki / cmc-csci143

big data course materials

40 stars 76 forks source link

State of the Industry Assignment + Extra Credit #512

Open mikeizbicki opened 7 months ago

mikeizbicki commented 7 months ago

Due date: Before you take the final exam

Background: There are many videos linked below. Each video describes how a major company uses postgres, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to learn something meaningful from each of them.

Instructions:

For each video below that you choose to watch, write 3 facts that you learned from the video. Make a single reply to this post that contains all of the facts.
The assignment is worth 2 points. If you watch:
- 1 video, you get 1 point
- 2 videos, 2 points
- 4 videos, 3 points
- 8 videos, 4 points
So you can get up to a 4/2 on this assignment.

Videos:

About postgres:

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E
The Evolution of Reddit.com's Architecture

https://www.youtube.com/watch?v=nUcO7n4hek4
Postgres at Pandora

https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

https://www.youtube.com/watch?v=BgcJnurVFag
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

https://www.youtube.com/watch?v=4GB7EDxGr_c
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

https://www.youtube.com/watch?v=eZhSUXxfEu0
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

https://www.youtube.com/watch?v=kd-F0M2_F3I
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

About general software engineering:

Mastering Chaos - a guide to microservices at netflix

https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s
Why Google Stores Billions of Lines of Code in a Single Repository

https://www.youtube.com/watch?v=W71BTkUbdqE

(If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)
The kubernetes documentary. Recall that google's developed kubernetes as a more powerful version of docker-compose. (Note that this documentary has 2 parts, and you must watch both parts to count as a single video.)

https://www.youtube.com/watch?v=BE77h7dmoQU

https://www.youtube.com/watch?v=318elIq37PE

rachelHoman commented 7 months ago

Why Google Stores Billions of Lines of Code in a Single Repository
1. There are thousands of commits a week and there are more commits by automated systems than by humans.
2. All users work in the style of truck-based development combined with a centralized repo and there isn't a significant use of branching for development
3. Changes to base libraries are instantly propagated through the dependency chain to the final product to avoid technical debt build up.
Lessons learned scaling our SaaS on Postgres to 8+ billion events
1. The speaker suggested creating the right product before worrying about scale
2. Moving their Postgres database from a single node to a clustered node database that has a coordinator node and multiple worker nodes helped scale their project for 1 billion events
3. They decided to clear old stale data that was in hot storage so that users were getting up to date statistics and if they had less data then they had lower costs and more profit

epaisano commented 7 months ago

Scaling Instagram Infrastructure

There are two different types of services—storage and computing. Storage servers store global data that used to be consistent across multiple data centers, whereas computing servers process requests by users, are temporary, and can be constructed by the global data.
Memcache is very important in scalability and without it the databases would be failing. It is a high-performance key-value store in memory. It provides millions of reads and writes per second.
Each process has two parts of memory—the shared part and the private part. The code, in fact, is a big part of where the memory is used up, so code should be optimized.

Postgres at Pandora

Pg_dump is slow to generate, and it is even slower to restore. Further, it can block other processes, so it’s not great. This is why Pandora moved to pg_basebackup which has its own problems, nonetheless. Some of the difficulties surround the connection to the database as it may be terminated under the stress for the data.
Pandora originally had SLONY replication, meaning that it only replicated tables and sequences.
Autovacuum runs for more than a week per table, which can be inconvenient due to the locks that it needs.

Why Google Stores Billions of Lines of Code in a Single Repository

The rate of change, in 2015, at which how many changes were committed was increasing rapidly—today it must be even more.
In 2015, there were about 15k commits per workday by humans!
The google workflow is usually as follows: the user links their workspace to the repo, they write their code, they review their code, and, finally, they commit (not before their code is reviewed by humans and other software, though)!

The Evolution of Reddit.com’s Architecture

Locks can slow queues down and inhibit us from getting real time data (this happened to them in mid-2012)
Sometimes, when reddit has a parent comment that is very popular, they have to make their own queue for it called a Fastlane because it tends to slow the rest of the site down otherwise.
Sanity checks are unbelievable important—especially when working on a high-traffic tool like reddit. Especially for when terminating servers!

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Adjust, like many other large-scale companies, use MapReduce to break things down to speed up the process and keeping them close to real time.
You can create data structures to decrease the number of bytes that data takes up (Adjust does this with country names)
Autovacuum is not triggered nearly enough to keep being efficient, so Adjust changed the requirement at which autovacuum is triggered.

Citus: Postgres at Any Scale

Citus can help make making tables more efficient by allowing the creation of reference tables where you can then just join the table with the information needed.
Citus has better COPY performance because they send the commands to each of the worker nodes in parallel, which speed everything up.
With three different implementations of INSERT…SELECT (Co-located, Re-Partitionaning, and Merge step), Co-located works at about 100M rows per second, Re-partitioning at about 10M rows per second, and Merge step at about 1M rows/second.

Data modeling, the secret sauce of building and managing a large scale data warehouse

To track Wi-Fi connectivity or audio and video quality, they would use a measure, which is time series data on a Windows device.
One of the reasons they use Citus is called dynamic compute to compute the inner aggregations over precomputing the inner aggregations which can be costly.
They are using partial covering index. Some tables can have more than 50 indexes, so it can take up a lot of storage, so instead of a random IO, they use a sequential IO.

Lesson learned scaling our SaaS on Postgres to 8+ billion events

Optimizing from 1 to 1M events was not too hard, as running out of storage wasn’t an issues and queries were already fast. So, it just meant creating primary indexes. They found that instead of concurrently thinking of optimization during product building—first focus on building the product.
The reason they had trouble scaling up to 100M from 1M is because as the events database grew, the amount of data needed to bring into memory for each query was too large for a single node database.
When they wanted to change to the big integer column type for the analytical tables primary keys, they were able to do it withing about 30 minutes by creating a new big integer id column and indexing it and then updating the told ids and switching over.

myngpog commented 7 months ago

Why Google Stores Billions of Lines of Code in a Single Repository (1) 1) There are many advantages of a monolithic repository but the main reason why it works and why it's worth investing in is that it is great for collaborative culture. 2) The Diamond Dependency problem is when it is difficult to build A since A depends on both B and C and B and C both depend on D or D.1 and D.2, this is a problem if there's a lot of repos to update at the same time. 3) In order for monolithic repos to work at such a large scale, you also need to consider code health, one such example is API visibility, where we set it to private and this encourages more consideration for more "hygienic code"

The Evolution of Reddit.com's Architecture (1) 1) Permissions are really useful considering a Postgres crash they encountered was that they were able to write to the secondary database. 2) Sanity checks are really important as well as observability is key and multiple layers of safeguards + simple-to-understand code are key to preventing problems. 3) Timers in code are good, they also give us cross-sections, as well as p99, which gives us a lot of information that helps us trace the problems/causes of weird cases.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018 (1) 1) Unless we have multiple TB of data, we're better off optimizing a single instance; this approach is better for dealing with heavy velocity and big data 2) By default, autovacuum doesn't kick in until we have 50 rows plus 20% of the table being "dead" but if we have a lot of rows, it doesn't keep up 3) Istore is a way to model integer arrays allowing us to model time series and stuff, allowing us to have arithmetic operations

Mastering Chaos - A Netflix Guide to Microservices (1) 1) Autoscale is fundamental as it can replace nodes easily and a loss of a node isn't of a mode isn't a problem for us; it also gives us computing efficiency. 2) A stateful service is a database/cache where the loss of a node is a notable event and may take hours to replace that node. We can deal with this with an EV cache where there multiple copies. 3) Some solutions to excessive load include workload partitioning, request level caching, and secure token fallback which is embedding a token into a device itself and it'll hopefully have enough information on the customer, letting them access the platform.

oliver-ricken commented 7 months ago

Why Google Stores Billions of Lines of Code in a Single Repository

Back when this talk was presented, on an average workday ⅔ of commits made to the repository are automated, and the other ⅓ of commits are made by humans.
Branching is rarely used as a version control strategy for development at Google; if a branch is created, it is usually because a new feature is being released.
Google has developed many source systems, such as Piper and CitC, which allow the monolithic repository structure to function properly. Piper is Google’s version of GitHub, and CitC is the system in which developers view and edit files across the codebase.

Data modeling, the secret sauce of building & managing a large scale data warehouse

Min Wei talked a bit about what Citus Data actually is in this talk, and why Microsoft acquired it. Essentially, from what I gathered Citus is an open-source Postgres extension that distributes data and queries in a way that improves performance and scalability.
In the use case that Min Wei presented, they use a partial covering index. Some “commonly queried tables” can have up to 50 indices on them.
In the data schema for a measure table, which is time series diagnostic data from Windows devices, the rows in the table are separated into top level and secondary dimensions. I’m a bit curious as to how this hierarchical structure is actually implemented within Postgres.

ben-smith23 commented 7 months ago

Why Google Stores Billions of Lines of Code in a Single Repository

One advantage of a monolithic code base is to reduce technical debt and manage deoendicies easier, as library updates are propagated through the entire code base.
A single repository require significant amounts of tooling in order to place an emphasis on code health and scalability, which is very expensive but worth the benefits of a single repository.
The code base is a tree, with certain people in charge of subdirectories. It is their responsibility to determine whether something is commited or not.

The Evolution of Reddit.com's Architecture

The speaker emphasizes the importance of timing your code, as it allows one to get a cross section of the code and root out problematic areas within it.
Reddit has something called a Fastlane which allows Reddit to manually mark a thread and give it dedicated processing. A bug was found in it in 2016 due to comments being stuck in the non-fastlane queue.
Peer-reviewed checklists and sanity checks are crucial, especially with “destructive actions.”

rajabatra commented 7 months ago

Why Google Stores Billions of Lines of Code in a Single Repository

The first thing I learned is how much data is stored in their monolithic repository. 1 billion files, 2 billion lines of code, and 86 terabytes of data.
I learned about the google workflow. All their code is synced, then an engineer writes code. It is then review by humans and automated tests, and finally able to be committed.
Finally, I learned about all the tools that have been built to mantain the code. For example, checking to see dead code and unused dependencies.

Scaling Instagram Infrastructure

I learned some of the issues instagram faced. this includes how a rm -rf operation on a container was wiping out a couple of the servers. As well, it is good to have multiple datacenters as they can be unreliable at times
I learned how instagram has so many frameworks and databases in their stack. They split it up into computing and storage.
I learned how instagram minimizes their cpu usage and server usage. they do this by monitoring computer instructions and see how much a new feature increases the performance.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

There is a 5 minute window between when a request is made and when it shows up on the dashboard.
There are one to two hundred thousand requests coming in a second and even more during peak times. There are also over 400tb of data to analyze
The general architecture is they take request from internet, write to backend, materialize to analytics shards, and show on dashboard. The companyt uses mapreduce to aggregate data from many servers

Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

Windows diagnostic data is from an audience of 1.5 billion devices.
A typical windows measure could have data points from 200m devices per day and they get 2 million distinct queries per day
Use citus because it works with PosthresSql, and built in custom data types, and is both scalable and reliable

meghnapamula commented 7 months ago

1. Scaling Instagram Infrastructure:

There is a concept called ScaleUp, which is the idea of using as few CPU instructions as possible to achieve a goal. This consists of writing good code and using as few servers as possible. Each server would be able to serve more users when this method is implemented.
There are two types of memories: shared and private memory. Code is a big part of this memory, so it is important to reduce the amount of code to free up memory space. Instagram removed dead code and optimized their code to achieve this.
There are risks to everyone working on Master: if people misconfigure, updates could be prematurely released. But according to the speaker, they haven't really seen any downsides.

2. The Evolution of Reddit.com's Architecture

r2 is the original Reddit, and it is monolithic in that each server has the same code. However, each server may be running different parts of the code, but they all have the same general code.
There have been queue issues in the past, although it is usually processed quickly. Reddit had issues with this in 2012 with their "vote" feature. They found through adding timers that the locks were the feature that were causing the problems.
Locks are not great in general, and should only be used if you really have to use them. Reddit is trying different data models so they can go "lockless" (as of the time this video was released).

3. Postgres at Pandora:

Pandora decided to transition to Postgres from Oracle because Postgres is free and open source, and they wanted to increase their scale. Postgres offered all the benefits they were looking for.
In terms of monitoring current activity, Pandora was not doing much but they have since increased their emphasis. Now, they are doing more including monitoring current activity, errors, long duration queries and transactions, blocked processes, and more.
Pandora made an in-house developed product called Clustr. It is a non-ACID database, so it is made for Pandora's specific needs. It is intended for shared databases.

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

PostgreSQL is VERY big: it has over 100k requests per second (at minimum). It also has over 400 TB of data to analyze, and a very high data velocity.
There are some challenges regarding data modeling. This includes the large volume of data, the fact that most data is new, the fact that there are a lot of btrees, and more.
They were experiencing an issue in which 50 rows and 20% of table were dead rows, by default. This is an issue related to AutoVaccuum, and they tried to reduce that 20% to 8%. One approach was to change it to 150k rows + 0% -- however, this also caused issues. AutoVaccuum required significant fine-tuning in order to work well.

5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

One difference between common stack and a growing stack lies in scaling out. A common stack has one master database, but a growing stack has two standby databases. Scaling out is very important.
pgdumps are very important. They allow for testing or pgrestore, and this tells you whether or not your database is corrupt.
Pairs of servers are related in the sense that each server is backed up by one Barman. These are in alternative patterns, so their backups will change and alternate. having multiple Barman helps with disk space.

6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

In the beginning (for smaller amounts of data), pgdumps are the best things to start out with for backups.. They are a simple and efficient backup strategy. They also work quickly. It is better to use pgdump than spend tons of time coming up with an elaborate backup strategy.
A best practice is to be intentional with work_mem and base it on actual temporary files being created in the logs. You can set it to two or three times the largest file.
You should keep shared buffers to 16-33GB. Going above this will increase checkpoint activity without much actual performance benefit.

7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

Citus is an open source extension for PostgreSQL that turns it into a distributed database. Some of the abilities it provides are: distributed tables with co-location, reference tables, and query routing, among others.
Citus is useful in instances where there is a database that is bigger than 100GB or will become bigger, when you are dealing with muli-tenet applications, and for real-time analytics dashboards.
Microsoft Windows is one of the biggest users of Citus. It uses Citus to make ship/no-ship decisions. Citus allows for split-second analytics on billions of JSON events.

8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

A primary use case of Citus is providing customer experience insights. It allows for data rich decision dashboards and ad hoc queries.
A measure refers to time series data on a specific use case executed on a device running Windows 10 or 11. It has dimension columns and metric columns.
The data Windows is dealing with is very large as a typical Windows measure could easily have data points from 200M devices per day.

lbielicki commented 7 months ago

The Evolution of Reddit.com’s Architecture:

Many core data models (called “thing”- stores accounts, links, subreddits) which uses Postgres
Thing is r2’s oldest data model
starting to move Thing into its own service (from r2).. Upside: no longer tangled in legacy code

Data modeling, the secret sauce (Citus Con)

Windows collects data from 1.5 billion devices
use hyperloglog to aggregate across millions of devices (hyperloglog is a custom data type built into Citus) for example, in finding the denominator of failure rate
some common tables have 50+ indexes (very high dimensional)

Lessons learned scaling our SaaS on Postgres (Citus Con)

Postgres easy to deploy and scale, but needed to expand past single node databases
Citus Postgres extension to distribute Postgres (sharding)
cluster nodes: coordinator node and multiple worker nodes, 2-4x as fast

Why Google Stores Billions of Lines of Code in a Single Repository

1billion+ files, 86 terabytes, 45 thousand commits per workday
CitC cloud-based system allows users to see their local changes overlaid on top of the Piper repository (where the single repo is stored)
trunk-based development; Piper users work at “head”, and all changes immediately update the view of other users (no branching).. Creates “one source of truth”

Mastering Chaos - A Netflix Guide to Microservices

Cap theorem: in the presence of a network partition, you must choose between consistency and availability (Netflix: chose availability, “eventual consistency’ through cassandra).
“Stateless service”- not a cache or database, frequently accessed metadata, loss of a node is no issue, survives instance failure after applying Chaos Monkey… vs “stateful service”- databases and caches, custom apps w/ large amounts of data, loss of a node is notable
Conway’s law- any piece of software reflects the organizational structure that produced it (relevant to the idea of integrating new software/breaking away things towards the goal of long term architecture)

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

Mapreduce from backend servers to analytics shards.. And from shards to client dashboards (usually under 5 minutes)
use istore for integers, useful for modelling timeseries/ supports GIN indexing
even in 400TB analytics environment, PostgreSQL still capable of scaling up

Scaling Instagram Infrastructure

Three priorities for scaling: scale out (more hardware, servers) scale up (take full advantage of servers) scale dev team.
use postgres to store user/media/”friendship” data, typically stored with a master copy and several replicas to account for high rates of reading/writing
Scaling up: use as few CPU instructions/reduce CPU demand, use as few servers as possible.. For example, multiple url generation to allow for differences in devices can be simplified to one function call

PostgreSQL at Pandora

Migrated to PostgreSQL from oracle, primarily due to finances, open source, and high-scale
developed in-house distributed database called Clustr; not ACID compliant, but works well with sharded databases (compromise consistency for availability)
Scaling issues: WAL backlog (150 WAL files per minute); wraparound, vacuum: autovacuum runs for more than a week per table

giffiecode commented 7 months ago

v1: Scaling Instagram Infrastructure

there are three dimensions of scaling. scale out means adding more servers and data centers. scale up means make each added servers count. scale dev team means enabling more developers to move fast without breaking things.
instagram tech stack centers around django, which is python code that's stateless and handle tasks. django is the compute. storage consists of cassandra, memecache, and other storing systems in the database. instagram needs to store data in different memcache and makes the data sync across datacenters
instagram would monitor cpu activities through linux, analyze c profile and optimize. after iterations, they realize c code is faster

v2: The Evolution of Reddit.com's Architecture

reddit's techstack consists of the core - R2, api, listing, search, thing, rec, cdn. r2 has a load balancer that separate systems
loadbalancer separate the functions and minimize the negative impact of one queue on the overall user experience. for example, when vote queue is long, users who are only retrieving data through the front-end don't need to wait
vote queue has slowed down multiple times throughout the development of reddit. they first parallel it and then introduce the load balancer. reddit tries to log each action to make debug fast

v3: Postgres at Pandora

Pandora's architecture revolves around a core nexus, leveraging music metadata and radio functionalities. Their data warehouse, supported by an in-house database like Clustr, forms the backbone for managing vast volumes of user data efficiently, enabling seamless scalability.
The talk highlighted Pandora's robust monitoring system, focusing on detecting and addressing issues promptly. This includes tracking errors, frequency of occurrences, identifying and resolving long-duration queues and transactions, monitoring for blocked processes, and staying vigilant about any errors logged in syslog files.
Pandora emphasizes monitoring changes and the rate of change within their systems. They employ sophisticated alerting mechanisms to detect any deviations from normal operations promptly. This proactive approach allows them to respond swiftly to evolving conditions and maintain service reliability.

v4: PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Citus offers a solution for scaling PostgreSQL databases seamlessly. By serving as an extension, Citus transforms PostgreSQL into a distributed database, capable of handling datasets ranging from 100GB to larger sizes suitable for multi-tenant applications and real-time analytics dashboards.
Citus enables the creation of a data warehouse environment, particularly suited for relatively static queries. This setup allows for efficient querying of vast datasets, enhancing analytical capabilities without compromising performance.
The architecture involves a coordinator node that orchestrates communication with worker nodes. Through commands like "create distributed_table," administrators can seamlessly connect different shards, ensuring data distribution across the cluster for optimal performance and scalability.

v5: Data modeling, the secret sauce of building & managing a large scale data warehouse

Following its acquisition by Microsoft, Citus serves various use cases, including handling Windows diagnostic data from an enormous user base of 1.5 billion devices
1. These use cases often involve managing data-rich environments for decision-making dashboards and accommodating ad hoc queries efficiently.
2. Citus manages time-series data related to specific Windows 10 use cases. Leveraging PostgreSQL's rich data types, particularly hstore and JSON, administrators can optimize data storage and retrieval. They typically utilize JSON for staging tables to handle flexible data structures and hstore for dynamic columns in reporting tables, ensuring efficient querying and analysis.

v6: navitaging chaos: netflix

Netflix operates under the understanding of the CAP theorem, which states that in the event of a network partition, a system must choose between consistency and availability. Netflix opts for eventual consistency, particularly leveraging Cassandra for its distributed database needs.
Following a significant error on Christmas Eve in the us-east-1 AWS region, where all services were affected due to a single point of failure, Netflix diversified its infrastructure across multiple AWS locations to enhance fault tolerance and resilience.
Auto scaling groups play a vital role in Netflix's infrastructure. They enable dynamic scaling of resources based on demand fluctuations, ensuring optimal resource utilization and performance across their services.

v7: why google store billions of lines of code in a single repository

Google's repository hosts an enormous amount of data: 1 billion files, 35 million commits, and over 2 billion lines of code, with a staggering 45,000 commits per workday and a daily peak of 800,000 queries per second, including configuration files.
Google opts for a monolithic repository for various reasons, including unified versioning, extensive code sharing and reuse, facilitating atomic changes, enabling collaboration across teams, enhancing code readability, and eliminating ambiguity regarding file versions.
Google follows a trunk-based development model with a single repository. Users typically work at the head, ensuring a single, consistent version, with branching being rare and primarily used for releases.

v8: kubernetes documentary

The DevOps movement evolved from virtualization to cloud computing, with AWS becoming a dominant force in 2013, notably introducing services like S3, revolutionizing infrastructure management.
Kubernetes functions like a post office, managing the logistics of application deployment by abstracting away underlying infrastructure details. Containers are likened to envelopes, encapsulating applications and their dependencies for streamlined deployment and execution.
Embracing open-source principles allows for transparency and collaboration, enabling anyone to view and modify code. This fosters the creation of vibrant communities, driving innovation and collective problem-solving.

danzhechen commented 7 months ago

Scaling Instagram Infrastructure

Instagram's ability to scale efficiently involved three primary strategies: scaling out by increasing the number of servers to match user growth, scaling up by improving server efficiency to handle more operations, and scaling the development team to ensure that new features could be deployed quickly without compromising the system's stability.
The transition from AWS to Facebook's data centers was pivotal. This move facilitated better disaster recovery preparedness and allowed Instagram to leverage Facebook's advanced monitoring and scaling tools.
Instagram significantly improved performance through software optimizations, such as memory optimization and CPU usage reduction. These improvements were achieved by refining code efficiency, optimizing data storage and retrieval processes, and employing tools to pinpoint and rectify high-cost functions in their codebase.

Postgres at Pandora

Initially, Pandora used Oracle but switched to PostgreSQL due to cost issues and the desire for scalability and open-source flexibility. The transition involved creating predefined DBMS classes to manage numerous PostgreSQL instances effectively.
Pandora faced significant challenges in managing data consistency, backup processes, and replication strategies. Innovations include implementing monitoring tools for database activity, optimizing backup processes with tools like PG dump and PG base backup, and managing replication to enhance data availability and disaster recovery.
To address limitations in high availability and scalability, Pandora developed an in-house solution called "Cluster" (without an 'e'). This system allows for more flexible database management, including seamless schema updates and PostgreSQL version upgrades without downtime.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

The presentation highlights the importance of careful planning in infrastructure management, especially when dealing with large databases and multiple servers. It stresses the need for scaling databases both vertically and horizontally, utilizing cloud services alongside on-premises servers to ensure flexibility and scalability.
High availability is emphasized through strategies like streaming replication and logical replication to ensure data consistency and system reliability. The presenter discusses the implementation of multiple standby databases and replication slots to manage potential data lags and system failures.
Effective backup strategies are crucial for data security and recovery. The talk covers the use of tools like Barman for managing physical backups and PG Dump for logical backups, ensuring data can be restored accurately and efficiently.

Citus: Postgres at any Scale

Citus extends PostgreSQL to manage distributed databases, effectively allowing PostgreSQL to scale horizontally across multiple servers. This extension helps overcome hardware limitations on storage and memory, enabling databases to grow without physical constraints.
The video highlights Citus’s capability to distribute tables across various servers, which helps in managing very large datasets efficiently. Citus supports distributed transactions, ensuring that operations either completely succeed or fail across all involved servers, thus maintaining data integrity.
Citus is particularly effective in handling real-time analytics and multi-tenant SaaS applications. It provides the flexibility to accommodate growing data demands by distributing loads across several nodes.

Why Google Stores Billions of Lines of Code in a Single Repository

Google's approach consolidates all its code into one massive repository, providing a single source of truth. This structure simplifies access and modification of the code, ensuring that all developers work with the most current and correct version of any file.
The monolithic repository facilitates extensive code reuse and sharing across different teams at Google. It significantly simplifies dependency management, allowing changes to any library to be propagated instantly through the entire codebase. This setup prevents the "diamond dependency problem" where different projects might depend on multiple versions of the same package, which can lead to compatibility issues.
Google has developed sophisticated tools to handle the scale of its repository. Tools like Piper, Critique, and Tricorder integrate with the code repository to support code review, static analysis, and automated testing.

Lessons learned scaling our SaaS on Postgres to 8+ billion events

ConvertFlow initially chose PostgreSQL due to its compatibility with Ruby on Rails and ease of setup on Heroku. This choice supported their early growth, handling up to a million events without significant issues. This stage emphasized the importance of selecting technology that aligns with the team's familiarity and the project's immediate needs.
As event volumes grew, the original single-node PostgreSQL setup began to struggle, particularly when the service scaled to around 100 million events. The team initially experimented with Firebase to alleviate load issues but eventually transitioned to Citus, a PostgreSQL extension that allows efficient database sharding. This shift was critical in managing larger data volumes effectively, improving query performance by distributing the workload across multiple nodes.
Upon reaching billions of events, further optimizations were necessary. ConvertFlow implemented strategies such as data rollups for frequently accessed reports and periodic clearing of old data from active storage to maintain performance and manage costs.

Mastering Chaos - a guide to microservices at netflix

One of the biggest challenges in a microservices architecture is managing dependencies and interactions between services. Netflix experienced issues where failures in one service could cascade and impact others, affecting overall system stability. To manage this, Netflix developed Hystrix, a library designed to handle the complexity of inter-service communication by implementing patterns like circuit breakers, fallbacks, and isolation techniques.
To ensure the robustness of their system, Netflix uses a technique called "fault injection testing" which involves deliberately introducing faults into the system to test resilience and confirm that fallback mechanisms function correctly in production.
Netflix emphasizes the importance of identifying and defining "critical" microservices that are essential for basic user functionality, such as browsing and streaming. By focusing testing and reliability strategies on these critical services, they can ensure that the core functions of their platform remain operational even when peripheral services are compromised.

The Evolution of Reddit.com's Architecture

Reddit began with a monolithic architecture centered around a large Python application known as 'r2'. The platform started transitioning towards a more modern architecture by introducing microservices. This shift was primarily motivated by the need to improve scalability and manageability, with new services written in Python to maintain compatibility with the existing codebase.
The front-end engineers at Reddit, frustrated with the outdated aspects of 'r2', developed new, modern front-end applications using Node.js. These applications interact with Reddit's back-end through APIs, reflecting a shift towards separating the user interface from the data processing layers.
Reddit's architecture for handling listings (ordered lists of links) and comments involves complex caching and data normalization strategies to manage the high volume of user interactions efficiently. For listings, Reddit caches IDs and sort data to avoid expensive database queries. For comments, the platform uses a threaded model, where the parent-child relationships are stored to facilitate easier rendering of comment threads.

Eshaan-Lumba commented 7 months ago

Mastering Chaos - a guide to microservices at netflix

Fault Injection Testing (FIT) allows Netflix to test their service after they think that it is working. Through it, they can have functional tests through the whole system, testing each microservice individually but also within the context of any adjacent microservices that use it. Additionally, it allows them to test their scalability requirements using a percentage of the real traffic the system would face.
Eventual Consistency is the notion of preferring Availability over Consistency as per the CAP Theorem. Netflix uses Cassandra for this, where a write is usually done to a single node, which then propagates the information to other nodes later on. Information is considered written when a certain fraction of the nodes have received the information.
Putting all your eggs in one basket is always a bad idea. For example, Netflix used to only use the US-East-1 AWS servers, and when it went down, all of Netflix also went down. Hence, they split their load across multiple regional servers so that when one fails, the remaining servers can handle the failed server's load for the meantime.

Scaling Instagram Infrastructure

Instagram uses one main database that they write to from Django, and that database writes to the other "read" replica databases eventually. When Django then needs to perform a read, it reads directly from the "read" replica databases in order to prioritize availability.
In order to scale memcache across regions, Instagram uses cache invalidation. After the Postgres replication is performed, a demon is run on all the Postgres replicas that invalidates the cache in its region. After this, each cache is aligned with its read replica database.
To scale their memory to a 20+% capacity increase, Instagram reduced the amount of code stored by running in optimized mode (-O) and removing dead code. The also moved configuration data into a shared memory and disabled garbage collection in Python.

abizermamnoon commented 7 months ago

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E

Initially, when comments are made, they are inserted into the Postgres server and updated to a local memcache.
If users accessing the same media are served from different data centers, there is a problem as the memcache in one data center isn't updated with new comments from the other center. This results in users seeing stale comments.
To address this, instead of updating memcache from Django, they use Postgres class replication to update memcache on each Postgres replica, ensuring consistency across regions. This approach forces users to read directly from Postgres, avoiding stale cache issues.

The Evolution of Reddit.com's Architecture https://www.youtube.com/watch?v=nUcO7n4hek4

Python application at the heart of Reddit stores accounts, links, subreddits, and comments in postgres with memcache
Listings are ordered a set of links. Instead of directly querying the database, listings are put in memcache with asynchronous job queues handling updates. Queue partitioning was implemented to address slowdowns caused by lock contentions
Reddit has faced challenges from transitioning away from postgres because of replication failures and incorrect data in cached listings

nati-azmera commented 7 months ago

Scaling Instaram Infrastructure

Scaling involves “scaling out” which is adding more servers, “scaling up” which is making sure that these servers are used efficiently and “scaling dev team” which involves expanding the engineering team to be more productive.
Services could be of 2 types: storage or computing. Storage servers store global data and need to be consistent across data centers, while computing servers process requests on a user-demand basis.
There is a need for Instagram to move their data centers closer to where users are as that would help them reduce latency.

The Evolution of Reddit.com’s Architecture

Adding timers helped Reddit identify why the processing of requests was taking a very long in 2012 and ultimately helped them figure out that the locks were the sources of these issues.
Looking at the p99s also helps us identify where the problem cases might be, so having a strategy to get info out of them is very important.
Specialized solutions, such as creating a separate queue (Fastlane), are necessary to manage high-traffic areas effectively and prevent slowdowns in the overall system.

Why Google Stores Billions of Lines of Code in a Single Repository

In 2015, Google’s giant monolithic repo contained over 1 billion files, over 35 million total commits and sees about 45 thousand commits on an average workday, with two-thirds of these daily commits coming from automated systems!
This strategy allows for easy code reuse by engineers, collaboration across teams, and allows engineers to make backwards changes easily.
However, there are also some very important costs associated with this model including by introducing codebase complexity, unnecessary dependencies, and inefficiencies in the form of maintaining abandoned projects that are in the repo.

Lessons Learned Scaling our SaaS on Postgres to 8+ Billion Events

At the very early stages of a startup, scaling shouldn’t be the priority and startup founders should focus on identifying the right customers and markets- “Premature optimization is the root of all evil.”
The speaker stresses the need for startup founders to expect their stack and vendors to change as their company grows and gets more users.
However, he believes that Postgres is a great database for startups to initially start with and scale.

GusAlbach commented 7 months ago

Citus: Postgres at any Scale

While not directly programming related, I wasn't aware of the google cloud credits available to startups and those pursuing education.
I learned that postgres was a single-node database. This likely was already covered somewhere in the reading, but this seems to suggest that single-node databases have only one VM.
I learned that the small integer column has an upper bound of 2.1 billion values.

Data modeling, the secret sauce of building & managing a large scale data warehouse

A measure is time series data bout a specific use case executed on a windows 10 or windows 11 device.
Windows uses the partial covering index due to the reason of incredibly expensive storage on the cloud.
I may have missed it in the first vidoe, but from this video I learned that Citus works as an addition to postgres, which allows for multi-node usage rather than the typicla single-node.

tylerting commented 7 months ago

1. Scaling Instagram Infrastructure

Instagram employs three scaling strategies: scaling out by adding more servers for server efficiency and productivity
Memcache enhances Instagram's infrastructure by swiftly delivering frequently accessed data from memory, reducing database load and latency.
To reduce latency, Instagram aims to locate data centers closer to its user base.

2. The Evolution of Reddit.com's Architecture

Reddit's architecture initially revolved around a monolithic application called 'r2', which served as the backbone of the platform. This monolithic structure deployed the same code across all servers, although different servers handle different parts of code.
Reddit intially faced challenges with processing requests in 2012 and eventually found out the locks as the source of slowdowns.
Analyzing percentile metrics helps Reddit pinpoint problem areas in their system, enabling effective troubleshooting

3. Postgres at Pandora

Pandora migrated from Oracle to PostgreSQL primarily for cost reasons and to take advantage of the open-source nature of PostgreSQL.
They developed an in-house distributed database called Clustr, which sacrifices ACID compliance for better scalability and performance.
Pandora faced scaling issues such as WAL (Write-Ahead Logging) backlog and vacuum inefficiencies, which required careful management and optimization.

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

PostgreSQL is utilized at a massive scale, handling over 100k requests per second and managing more than 400 TB of data for analytics purposes, with a significant data velocity.
Challenges in data modeling arise due to the large volume of data, frequent influx of new data, and the prevalence of btree structures and this necessitates careful optimization for efficient performance.
AutoVacuum requires fine-tuning to address issues such as dead rows in tables.

5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

Implementing high availability strategies such as streaming replication and logical replication ensures data consistency and system reliability in large-scale PostgreSQL deployments.
Effective backup strategies using tools like Barman are crucial for data security and recovery in large database environments.
Utilizing cloud services complemented with on-premises servers ensures flexibility and scalability in managing large databases and multiple servers.

6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

Proper configuration of PostgreSQL settings is essential for scaling from small to large datasets, requiring adjustments in memory allocation, caching, and parallelism. Tuning PostgreSQL for high-traffic environments involves optimizing settings related to connection pooling and query execution.
Regular monitoring and performance testing help identify bottlenecks and inefficiencies.

7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

Citus extends PostgreSQL into a distributed database, allowing for handling datasets from 100GB to larger sizes.
Citus supports distributed transactions to ensure data integrity across all involved servers, maintaining consistency in a distributed environment.
Citus enables efficient querying of vast datasets, enhancing analytical capabilities without compromising performance.

8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

He emphasized prioritizing product development over scaling in a startup's early stages and recommended choosing well-supported technologies like Postgres.
ConvertFlow initially faced performance issues and high costs, leading them to temporarily use Firebase before migrating to Citus for scalable database management.
The speaker highlighted the importance of adapting technology stacks as businesses grow, showcasing their migration from Firebase to Citus for scalable database management, which enabled them to efficiently handle billions of events.

echen4628 commented 7 months ago

Scaling Instagram Infrastructure

You can use postgres replication to sync up data between servers deployed on different data centers. Each postgres server can also be used to invalidate memcache, but this design will increase load on the posgres servers.
Memcache lease is a way to reduce load on the actual postgres servers. Clients uses lease-get to request data from the memcache, and they can either get permission to go to the database to get the data itself or wait or use stale data.
There are functions in c and python that help you profile specific parts of the code (for example: perf_event_open in c and cProfile.Profile() in python

Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

Cubing is a type of pre-computation. It can be used to compute the inner parts of a two-level aggregation. If one of the things you are grouping by has high cardinality (any values it can take up), this pre-computation can become much more expensive (the curse of dimensionality). I thought this was pretty interesting, but I’m not sure if i got the details right.
hyperloglog is a data structure that can be used to estimate the number of unique items. It is useful if you just want an estimate and care about computational efficiency
H store column allows you to add json data into postgres where the value for each json item could be different. This allows you to make larger “schema”-like changes to the tables without actually changing the schema. One of the downside of using H store is that the column is full text so for indexing you need to use something like GIN and also it’s harder to query for specific properties on the columns (got this info from the video and then using ChatGPT).

nliyanage commented 7 months ago

The evolution of reddit.com’s architecture: 1 – R2’s oldest data model stores data in postgres
2 – referring to items in postgres that didn’t exist caused a lot of errors in ~2011 – the cause of that was R2 trying to remove dead databases 3 – autoscaler changes the number of servers used based on demand – it does this because each post is tracked by a daemon

Postgresql at 20tb and beyond 1 – mapreduce is used to make things faster 2 – restarting postgres doesn’t disrupt the backend servers because of how the servers are run 3 – the mapreduce procedure used by materializer aggregates data and transfers between servers

Data modeling, the secret sauce of building & managing a large scale data warehouse 1 – the data that is being discussed is collected from an audience of 1.5 billion devices and it goes through a processing pipeline for it to be more usable for the team 2 – there is an inner and an outer query when the team tries to identify device id – this is a query the presenter described as “bread and butter” for their team – they pre-compute the inner query 3 – they use json for staging tables and then hstore for dynamic columns – the hstore type is smaller, so there is less io for fetching the data

Lessons learned scaling our SaaS on Postgres to 8+ billion events 1 – at the beginning of their company, they ran on postgres due to that being the database that rails and Heroku used. Scaling up within the early stage was easy because it meant creating indexes 2 – the speaker advised developing a marketable product that you know customers would buy prior to thinking about hypothetical issues of scale 3 – postgres is cheap and easy to deploy and also scale – it is easy for early stage companies and has been proven to be usable even for very large websites

Scaling Instagram infrastructure 1 – there is a difference between storage and computing servers. Storage servers are stored using postgres 2 – Cassandra is used to store user activities and is important for scalability 3 – database replication occurs across regions but computing resources are contained within one region

Mastering chaos – a Netflix guide to microservices 1 – a problem that they faced early on is that there was a monolithic codebase and also a monolithic database – this caused crashes every time something went down 2 – things being deeply interconnected means that there is a lot of difficulty to make any changes – this is why this is a bad way to build services today 3 – a microservice is a style of building an application that is a collection of small services – each service has its own task. This is a response to monolithic applications

Citus: postgresql at any scale 1 – citus is a service that creates distributed database abilities within postgresql. Citus distributes tables across clusters 2 – unique constraints on tables must include the distribution column (within distributed tables) – this is a limitation od distributed tables 3 – there are some missing features in this model, such as replicating triggers, and table inheritance

Breaking postgresql at scale 1 – how you use postgres changes operationally as a database grows, but postgres can generally handle any size that you need 2 – for small databases (like 10 gigabytes), postgresql will run quickly, even if you are always doing sequential scans 3 – don’t just randomly create indexes – they take up space/time. It is better to create indexes in response to specific problems, don’t just add them randomly because they will do things like slow down insert time

agemeda commented 7 months ago

The Evolution of Reddit.com's Architecture

Reddit uses SQL queries and primary keys just like we are learning now to index their subreddits and listings.
Using print statements helped reddit determine what was causing voting issues. It was domain listings that had to be split up and the queries would be processed separately to allow better voting results.
Comment trees can be optimized and organized using offline job processing, and put them into ordered batches.

Citus: Postgres at any Scale

Postgres is able to plug queries to another extension if it has certain features, in this case create_distributed_table would go through the Citus planner instead.
Complex queries can be fully sent to worker nodes if they filter by distribution column and the distributed tables are co-located.
Citus is beneficial as it does not have limits on CPU, memory, storage, I/O capacity and has the full range of PostgreSQL capabilities and extensions at hand.

baron-zeng commented 7 months ago

2/2 points

Scaling Instagram Infrastructure

1 - A sizable portion of people working at Facebook/Instagram have not been there for a long time: 30% of the engineers joined in the last 6 months, and there are interns/bootcampers etc.

2 - When launching new features, Instagram uses gates to control access and gradually releases in the following order: engineers, dogfood, employees, some demographics, world.

3 - Instagram uses no branches in order to enjoy the benefits of continuous integration, collaborate easily, and they can still easily bisect problems and revert when needed.

The Evolution of Reddit.com's Architecture

1 - Reddit had issues with vote queues filling up too quickly and slowing processes down, so they partition votes into different queues so that there would be less fighting/waiting at the same lock. Ultimately, they split up the queues altogether so that the votes would not be aiming for domain listing

2 - For comment trees, there are threads that experience a lot of comments, so Reddit created a fastlane for these special cases. However, the fastlane filled up quickly and used too much memory and also caused skipping of child comments past parent comments that were not in the fastlane. Reddit used queue quotas to solve this problem

3 - Reddit faced difficulties in migrating to new servers because their autoscaler terminated transition to new servers after they were restarted, so they realized that there need to be more sanity checks when doing commands that make major changes.

gibsonfriedman commented 6 months ago

Scaling Instagram Infrastructure:

I thought it was interesting that the entirety of Instagram is able to run on only 7 data centers worldwide, I would’ve guessed that they used many more before watching this presentation.
There’s a difference between storage and computing services in data centers, storage is used for global data across many data centers whereas computing is temporary and can be reconstructed from global data.
Memcache is a key-value store that can provide millions of read/writes per second but isn’t able to provide consistency across different networks.

The Evolution of Reddit.com’s Architecture:

Reddit uses asynchronous queues to handle expensive actions on their site such as commenting on a post to allow them to process in the background without slowing their main processes.
Reddit at its core is just a list of links and its original purpose can be boiled down to a simple SQL query but as this got too expensive they started using a cache and mutating it.
Reddit is able to manually mark larger threads to be designated to their own queues called the “fastlane queue” so that larger threads don’t take up too much of their processing power at once.

PostgreSQL at Pandora

Pandora decided to use PostgreSQL because it was free and they wanted to be open source.
They use a nexus environment that contains subjective data combined with objective data on the music along with user data to create classes within their Postgres service.
The utilize replication in their production database in both a local and remote storage to protect from outages.

PostgreSQL at 20TB and Beyond

They utilize two map reduces to aggregate data from other servers to transfer to their server in about 5 minutes usually which can be 100-200k actions a minute.
At a large scale it's hard to change the data types because of issues with rewriting tables and minimizing downtime emphasizing the importance of getting the data types right initially.
Incremental map reduces can help to reduce further and aggregate more data on demand.

Large Databases, Lots of Servers, on Premises, in the Cloud

RAID 10 servers should be used always though they are more expensive since they provide better storage services than RAID 5 or 8 servers.
Your backup data centers should be located further away from your main data center to ensure that your data will be safe regardless of what happens to the primary data center.
Pg_dump backups are important to use because they have many testing options to make sure that your restore is safe and able to be accesses.

Breaking PostgreSQL at Scale

Databases under 10gb can be queried any way even if its a sequential scan and they will still run fine.
To find the memory a PostgreSQL database needs you can check if it fits into the memory, if not you can fit the top 1-3 largest indexes and more memory doesn't always help performance.
If there's too much data for pg_dump to be used, PITR can be used for backups which can restore to a point in time.

Citus PostgreSQL at any Scale

Citus is best used at 100gb and beyond and for either multi-tenant applications such as SAAS apps or real-time analytics dashboards such as IoT.
Citus has clusters that consist of multiple PostgreSQL servers that appear as only one server with each server being able to handle 100-200 transactions a second.
Citus adds a feature to Postgres called create_distributed_table which distributes a table across a cluster of databases.

Data Modeling, the Secret Sauce of Building & Managing a Large Scale Data Warehouse

Windows collects data from over 1.5 billion devices in their databases.
Windows has about 10 million queries and 2 million distinct queries per day causing a technical challenge in their databases.
A windows measure could have data points from 200 million devices a day and only fails occasionally.

justinchiao commented 6 months ago

The Evolution of Reddit.com's Architecture

More hardware does not equal more performance. The code also needs to written to take advantage of the increased hardware.
Permissions should be used appropriately to prevent accidental operations.
Databases shouldn't be expected to have a general solution for every case. Exceptions such as fast lane demonstrate how special cases should be handled when necessary.

Why Google Stores Billions of Lines of Code in a Single Repository

Monolithic repos allow Google developers to work off of the same version of everything.
Monolithic model demands and allows for maintenance of code health. Tools were made to find dead code and unused dependencies can be removed.
Custom repo management tools are required for Google's scale. These tools are the biggest con of monolithic model. A huge investment is required for this model to work.

KentaWood commented 6 months ago

"Mastering Chaos - A Netflix Guide to Microservices" talk

Designing for Resilience: Emphasizing the importance of designing microservices for resilience to prevent cascading failures, ensuring that issues in one service don’t compromise the entire system.
Decentralized Data Management: Highlighting the need for decentralized data management to enhance scalability and fault tolerance across various microservices.
Continuous Delivery and Automation: Leveraging continuous delivery and automation to efficiently manage, deploy, and scale microservices, allowing for frequent updates with minimal downtime.

Scaling Instagram's Infrastructure

Infrastructure Transition and Scaling: Initially hosted on Amazon Web Services (AWS), Instagram later moved to Facebook data centers, expanding its infrastructure from one to multiple data centers to enhance scalability and fault tolerance.
Database and Data Center Strategy:
- Utilizes Django for backend interactions with a Postgres database.
- Data writes are directed to a primary data center, while two additional data centers serve as replicas handling read operations to distribute load and increase read efficiency.
Performance Optimization Tools:
- Deployed tools like COLLECT for experimentation to assess CPU usage impacts of new features and ANALYZE with Python Cprofile to optimize high-impact functions, aiming to reduce CPU consumption and server requirements.

Overview of Reddit's Backend Service

Software Architecture and Technologies:
- Reddit’s backend is primarily developed in Python, with a focus on splitting from its legacy codebase. It uses Thrift or HTTP for client interactions and integrates Content Delivery Networks (CDNs) to efficiently route requests.
Load Balancing and Management:
- Implements Elastic Load Balancers (ELB) to manage and distribute user requests across its infrastructure, isolating application areas to prevent slowdowns in one section from affecting the performance of others.
Database Management and Scalability:
- Manages data using SQL and memcache for caching mechanisms. Encounters scalability challenges, particularly with database locks in high-traffic scenarios, which can delay vote processing and content updates on popular posts.

Why Google Stores Billions of Lines of Code in a Single Repository

The rate of commits being done to the repo is growing exponentially, and a majority of the commits are done automatically.
Google's workflow is to connect users to the workspace, write code, and review both done through people and automation, then committed. Each commit has an owner, and if you're trying to commit, you need permission.
A major positive about this is that they don't have to do tedious and time-consuming merging as all the code base is unified and doesn't have much confusion about the version of a file.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of PostgreSQL)

20 TB is the maximum amount of DB storage for PostgreSQL DB, and to expand further, we use multiple servers that can track and make the DB size of an entire system bigger.
They aggregate their data and then shuffle that data and then map reduce the data, which then outputs.
At some point, they were running out of 32-bit integers to track the advertisement data trackers; it took them two months; they solved this by changing the data type of the trackers.

Breaking PostgreSQL at Scale

The largest community in PostgreSQL is petabytes in size big, and PostgreSQL can handle it; however, what works for a 10 GB DB is different from a petabyte DB.
Basic backups for a 10GB database are pg_dumps and cron job it every 6 hours or S3 dumps.
Monitoring by process logs through pg_badger or pg_statement provides real times of query performances or specific new relic, data dogs, and other web app monitoring applications.

Lessons learned scaling our SaaS on PostgreSQL to 8+ billion events (ConvertFlow, another adtech company)

Talks about how scaling in a startup looks like, i.e., short-term and long-term goals and how you can scale accordingly.
Used Citus to break up the database into multiple nodes and results in sharper and faster queries.
Reindexing the tables made the storage of the tables decrease as over the years PostgreSQL has given many updates, and also update the tables concurrently, or else halts in the DB queries happen.

Postgres at Pandora

Changed from Oracle as they wanted to use open-source and a much more affordable option.
Implementing monitoring current activities, and looking at the current logs of PostgreSQL and looking through the error logs and long query logs.
Created an in-house DB called cluster and it was not ACID intended for sharded database and each group has two or more DB's and tied together by the cluster application.
Transaction ID wrap around took around two weeks, takes way too long

henrylong612 commented 6 months ago

Scaling Instagram Infrastructure

Facebook conducts drills called "storms" to ensure that Instagram's services can withstand regional failures of data centers.
A "rm -rf" command nearly wiped out some crucial parts of Instagram's data services.
By adding the number of likes to a table, Instagram was able to reduce the query time down from 100 milliseconds to 10 microseconds.
Instagram does not uses branches when doing continuous integration.
Instagram does load testing in a gated production environment.

Why Google Stores Billions of Lines of Code in a Single Repository

Google's single repository, which contains all of its code, has about 1 billion files, 2 billion lines of code, and has 35 million historic commits.
The number of changes committed to the repository has exhibited exponential growth since 2004, when Gmail was launched. Most of the growth of total commits per week has been driven by automated commits and changes via programs like Rosie.
Just like Instagram, Google does not make significant use of branching for development.

Kubernetes: The Documentary

For a long time, Amazon was dominant in the cloud services department with their service AWS.
Docker used containers to allow scalability to be more accessible to hobbyist computer scientists, including coders with both development and operations background.
Google made the decision to make Kubernetes, which can be likened to a post-office, open source.
The release of Kubernetes was announced at Dockercon on June 10, 2014. Many other container orchestration projects were also released on the same day.
Initially, other companies like Facebook and Netflix chose Docker Swarm and Mesos over Kubernetes. As time went on, larger products like PokemonGo used Kubernetes.

The Evolution of Reddit.com's Architecture

Reddit is the 4th largest website in the U.S. serving 320 million users every month.
Like Instagram uses caches to bypass the need for expensive queries to count likes, Reddit uses caches to bypass the need for expensive queries to count votes.
Like we do, Reddit used a bunch of timers and print statements to understand what was causing large delays in vote processing.
Throughout its existence, Reddit has dealt with a lot of bugs and setbacks in regards to its data maintenance.

adamzterenyi commented 6 months ago

Scaling Instagram Infrstructure

Instagram had 130 million new users sign up in the first half of 2017.
The company created a denormalized table for its like, which sped up index lookups for like counts by 10x.
Instagram ships code 40-60 times a day!

The Evolution of Reddit.com's Architecture

In 2018, Reddit had 70 million searches each day.
In mid 2012, vote queues stacked for hours and hardware upgrades and scaling did not fix the issue. Instead, it was lock competition/contention that caused this; to fix it, they fully split processing after looking at 99th percentile outliers.
After "fastlaning" a thread in 2016, the queue filled up with messages and filled the message broker's memory. This issue was caused by a missing parent. Now they use queue quotas to prevent resource hogging.

Postgres at Pandora

Pandora switched to Postgres due to significant advantages open source coding provided and the heavy-handedness of Oracle.
The company employed people who would analyze a track and generate meta data for it. Crazy how far we've come.
For Pandora, disadvantage of using clustering meant that joins, subselects, and more wouldn't be possible, such that they had a "flat-file database".

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

Adjust tracked over 2 billion data points in 2017
They do this with only standard Postgres and some extensions.
With data on this scale, it is important to hand-hold new rollouts and features. Small errors and gains will be much larger.
The company is confident in scaling Postgres indefinitely.

Breaking Postgres at Scale (PostgreSQL Experts, Inc.)

It's hard to do much wrong with a ten gigabyte database, even with sequential scans.
When you're above 1TB, many more discontinuities happen and memory must be used sparingly depending on how much you can afford hardware-wise.
Additionally, consider "moving to fast local storage from slower SAN-based" options

Mastering Chaos - a guide to microservices at netflix

A stateless service is different from a cache or database and allows for frequent access of metadata and allows for compute efficiency (thanks to usage of on demand capacity). When there is a ddos attack, you can absorb the attack while figuring out defenses.
ChaosMonkey has really helped Netflix to absorb node losses, especially with the use of stateless services. "Redundancy is fundamental"
Avoid storing your business logic and state in one application. It is important to split up your data, especially since losing nodes is a much larger, more consequential event.

Why Google Stores Billions of Lines of Code in a Single Repository

Storing Google's code in a single repository removes the difficulty of implementing changes in one codebase to another, but increases the difficulty of merging code.
Tens of thousands of commits occur each day, and the rate of change for the number of commits is growing to today. Twice the number of commits are submitted by automated systems than humans.
Google uses trunk-based development, the second pillar of a monolithic development model. This means developers are using the most recent code and avoid using development branches. This is so cool to me, as it means everyone is operating off the same codebase.

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow)

"Premature optimization is the root of all evil". It is important to solve the right problem when resources are so thinly spread as a startup.
Initially used Ruby on Rails, Heroku, Postgres to set up to optimize, but didn't really need to devote time to this, especially since Postgres can scale easily.
As you scale as a company, be ready to change your stack and migrate to new vendors.

tylerheadley commented 6 months ago

Scaling Instagram Infrastructure

Instagram works to improve in 3 dimensions of scalability:
- Scaling out: ability to add more hardware
- Scaling up: more efficient code
- Scaling the dev team: ability to grow the team without breaking things
Python code can be sped up by "Cythonizing" it (C in python) for functionality that is stable and used extensively.
Instagram uses certain metrics in production to monitor how a new feature performs under load, such as the ratio of 500s HTTP status codes to 200s codes.

The Evolution of Reddit.com's Architecture

Software engineers at large companies such as Reddit still use basic print statements for debugging code.
Reddit scales its usage of AWS services in automatically in accordance with fluctuation in site traffic.
You should always be very cautious when terminating servers or migrating services.

sjanefullerton commented 6 months ago

(1)Data modeling, the secret sauce of building & managing a large scale data warehouse

Decision dashboards are widely used within engineering teams within Windows. There are 100s of dashboards that are refreshed every few hours.
The Windows database has a set of internal jobs that are building table for consumption. This is for efficient query support.
Device-centric aggregation, also called one device / one vote, is how Window's engineering team sees the noisy devices (i.e. a faulty device with too much noise which may dampen the signal about the actual set of quality issues).

(2)Lessons learned scaling our SaaS on Postgres to 8+ billion events

ConvertFlow is a no-code platform for e-commerce marketing teams to create personalized and launch visitor conversion campaigns on their websites without waiting on developers.
To optimize queries, in the case of this video, includes creating the primary indexes that are needed to quickly query reports for specific customers and their individual ConvertFlow campaigns.
To migrate a database off of Citus Cloud to Azure, one can stream cahnges to a replica database on Azure and then switching over to the Azure connection stream in a scheduled maintenance window. This included scaling with Citus on Azure.

(3)PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

Fossasia Analytics Shards: at the time of the video they had 16 shards (which is approaching 60/64 terabytes).
There are 3 major infrastructure pieces this team was comfortable with and could scale: Postgres, Kafka, H a proxy.
Under a heavy load of a data model, it is very painful to change. This can possibly be helped using a few seconds of table walks in the whole process (such has Fossasia discussed in the video).

(4)Mastering Chaos - a guide to microservices at Netflix

A disaster scenario, when you have one service that fails -- with improper defenses against that one service failing, it can cascade and take town your entire service for your members (known as cascading failure)
An anti-pattern example can be where a subscriber service leans on EVC too much. That is, online & offline calls going to the same cluster / the same ev cache cluster so batch processes doing recommendations, looking up subscriber info, and the real-time call path. The anti pattern here is the fallback when the entire EVC layer goes down, was still a fallback to the service and the database.
The autonomic nervous system is functions our body just takes care of automatically; you can set up a similar environment to make best practices subconscious using a cycle of continuous learning and automation (as Netflix did).

(5)Why Google Stores Billions of Lines of Code in a Single Repository

The google repository is laid out in a tree structure. Every directory in the tree has a set of owners who can decide whether modifications to the files in their directory are allowed to be committed or not.
Google (in 2015) had the following source systems: Piper and CitC. Piper is custom system that hosts the monolithic repository. CitC is a file system that supports normal code browsing, UNIX tools, etc. without needing to explicitly clone or sync any state locally.
Google uses trunk-based development. A combination of trunk-based development with a centralized repository defines the monolithic model of source code management. So for Google, this type of development helps avoid the merges that often happen with long live branches.

(6)Postgres at Pandora

Pandora uses replication for a high availability solution. This means Pandora employs a method where data is duplicated across multiple servers or databases.
Clustr architecture (a non acid database) is a high availability system. This means it compromises consistency for availability.
PostgreSQL provides utilities like PG_dump and PG_restore for backing up and restoring databases. However, PG_dump and PG_restore are impractical to use, in most cases. That is, if you have any sort of large size issues.

(7)The Evolution of Reddit.com's Architecture

Listings is the foundation of Reddit. It's an ordered list of links.
Locks are bad for throughput and if you have to use them, you should partition on the right thing
Tree structure is sensitive to ordering. There was a statement on how inconsistencies trigger automatic recompute.

(8)Scaling Instagram Infrastructure

Scale up, in this video's context, means to use as few CPU instructions as possible to achieve something (writing good code to reduce the CPU demand on the infrastructure) and use as few servers as possible.
The memory layout is ran with n parallel processes where m is greater than the number of CPU cores on the system.
InfoQ ships the code 40-60 rollouts per day (they continuously roll out their master repo whenever a diff is checked in).

amyyu116 commented 6 months ago

(1) Scaling Instagram Infrastructure

One thing I learned in this video is the general web structure of Instagram’s backend; there were some familiar terms like PostgreSQL and Django but there were some unfamiliar ones like Cassandra, Celery, and RabbitMQ
Denormalized data, even though it seemed inconvenient in practice in our assignments, is still useful in some scenarios for Instagram
Scaling up to the level of such a big platform has a lot of different challenges; a lot of optimizations need to happen to reduce CPU load

(2) Why Google Stores Billions of Lines of Code in a Single Repository

Due to the sheer size of the repository, there are owners of designated areas of the repository who must approve changes
Google practices “trunk-based development,” where there is little use of branches and changes are made to the repository in a single-file order
Advantages to this is that there is no confusion about which repository people should be working on; “single source of truth”, also helps avoid the “diamond dependency problem” where a library can be difficult to build due to changes in dependencies

(3) Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

Difficulties with scalability happened around 100 million events; Firebase was used to alleviate traffic with faster querying
Citus (Postgres extension) enabled significant scaling of Postgres database, allowed tables to be “sharded” to lower expenses and speed up queries; also divided server into a cluster of nodes
Data rollups can save a lot on querying operations, helping them scale up 10x to billions of events, also started clearing up very old irrelevant data to save up on costs

(4) Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

Windows works off “measures”, which is time series data about a specific use case executed on Windows devices
Some tables are high in dimension, up to 50+ indexes, they also collect data from up to 1.5 billion devices
Also uses Citus for similar reasons (scalability and works with Postgres well)

(5) Mastering Chaos - A Netflix Guide to Microservices

Uses Cassandra (again!) because it has a lot of flexibility and it worked for how Netflix wanted to handle network partitions and will provide eventual consistency
Scalability issues included excessive overload, saw up to 800k to a million requests per second, implemented workload partitioning and workload caching as a solution
Following a huge crash of their US-East AWS server, they implemented a split of their load/traffic into multiple regional servers

(6) The Evolution of Reddit.com's Architecture

Main block of Python code (monolith) is known as r2 and they also employ Cassandra
Voting system was handled with caches and queues; particularly popular posts get their own queue to manage site load/traffic
They work with “Thing” types (lol) that is represented by a pair of tables in psql

(7) PostgreSQL at Pandora

Pandora chose Postgres for its scalability and economic efficiency
Experienced scaling issues with WAL files and streaming replication
Cons: autovacuum ran for more than a week per table; pg_dump and pg_restore are impractical

(8) PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018

Mapreduces are useful to speed up processes
Autovacuum does not keep up well with tables that are largely scaled up and requires fine tuning/”hand holding”
PostgreSQL is able to support the 400TB environment they’ve built, although there were challenges in data maintenance and a lot of map reducing involved to optimize processes

westondcrewe commented 6 months ago

Postgres at Pandora

Pandora monitors the frequency of different activities that are damaging to the company’s service, like long duration transactions, PostgreSQL and syslog errors, and blocks The company also captures historical data, both short and long term, to know what metrics are normal for every postgres instance Pandora made their own internal application, Clustr, to maintain an architecture of data updates that affect several databases (interestingly, the Clustr system is not ACID compliant)

Mastering Chaos - a guide to microservices at netflix

CAP Theorem says that a service with different databases contained in different networks must either place preference on either consistency or availability. If you cannot perform an update on one database, either stop the update process and return an error, or update the ones that are available. Netflix chose availability and relied on “eventual consistency” Netflix purposefully induces “chaos” into their stateless services, in which nodes are deleted and the system is checked to still be running as normal. Netflix uses a global cloud management/delivery service called Spinnaker, which allows them to integrate automated components into their deployment path easily. As the company learns more about how to improve their own service, it can add new automated systems to their deployment.

pangsark commented 6 months ago

Mastering Chaos - a guide to microservices at netflix

Cap Theorem is choosing between consistency or availability in a network partition which helps system designers understand the trade-offs involved in designing a robust system.
Cassandra is a great service to embrace eventual consistency where they don't expect every single write to be read back immediately from any one of the sources the data was written to. Thus, a client might write to one node which writes to other nodes.
Conway's Law is where any piece of software reflects the organizational structure that produced it

Why Google Stores Billions of Lines of Code in a Single Repository

Although a single repository for this size of code base (1 billion files, 2 billion lines of code) might be expensive and hard to keep the code base healthy, it is worth the benefits of not being confused about the version of things and not having outdated dependencies within the files
Although there were 15K commits by humans, there were more commits by automated systems where now only a third of the commits are made by humans
Branching is rare within version control in Google which is similar to Instagram and their development.

Scaling Instagram Infrastructure

There are three always to scaling within Instagram: scaling out which grows the server or hardware for user growth, scaling up which makes the servers more efficient, and scaling the development team to make sure there are new features while maintaining the stability of Instagram.
Instagram uses PostgreSQL for its primary data storage, with data replicated across "read" replicas to enhance availability. To manage memcache across regions and ensure data consistency, Instagram employs cache invalidation techniques following PostgreSQL replication.
Memcache plays a critical role in scaling by acting as a high-performance, in-memory key-value store, handling millions of reads and writes per second. This capability is crucial to prevent database failures due to overload.

The Evolution of Reddit.com's Architecture

Reddit began with a monolithic architecture using a large Python application known as 'r2'. To improve scalability and manageability, Reddit has been transitioning towards a microservices architecture.
Reddit incorporates timers in their code to diagnose delays in request processing. They specifically look at p99 metrics (99th percentile), which help them identify and trace the root causes of unusual delays or issues within their system.
To efficiently manage high volumes of data interactions, Reddit uses complex caching strategies and asynchronous job queues.

Citus: Postgres at any Scale

Citus is an open-source extension for PostgreSQL that transforms it into a distributed database system. This allows PostgreSQL to scale horizontally by distributing data and queries across multiple servers, thus enhancing performance and handling larger datasets efficiently.
Citus enhances the COPY command performance by executing these commands in parallel across worker nodes. This parallelism significantly speeds up data movement operations, making it highly efficient for large-scale data processes.
Citus supports distributed transactions, which ensures that all database transactions either completely succeed or fail across all involved nodes.

Lesson learned scaling our SaaS on Postgres to 8+ billion events

The transition of their PostgreSQL database from a single node to a clustered node setup, which includes a coordinator node and multiple worker nodes, was instrumental in scaling the project to handle 1 billion events.
The initial choice of PostgreSQL, paired with Ruby on Rails and hosted on Heroku, supported the platform's early growth stages without major issues up to a million events.
To further improve scalability and manage larger datasets, the team adopted Citus, a PostgreSQL extension that allows for efficient database sharding.

Postgres at Pandora

Pandora originally utilized Oracle but switched to PostgreSQL due to its open-source nature and cost-effectiveness.
Pandora has significantly ramped up its monitoring of database activities. This includes tracking errors, long duration queries and transactions, and blocked processes.
To address specific needs within their infrastructure, Pandora developed an in-house distributed database called Clustr. This system is not ACID compliant, prioritizing scalability and performance over strict consistency, tailored for handling shared databases effectively.

Data modeling, the secret sauce of building & managing a large scale data warehouse

Citus, as an extension of PostgreSQL, is crucial in managing large scale data such as Windows who gets inputs from around 1.5 billion devices.
Citus enables Windows to provide data-rich decision dashboards that refresh every few hours
Windows employs PostgreSQL's rich data types, such as JSON for staging tables and hstore for dynamic columns in reporting tables. These data types facilitate flexible data structures and efficient querying, which are essential for managing the diverse data collected from billions of devices.

JTan242 commented 6 months ago

Data modeling, the secret sauce of building & managing a large scale data warehouse

Citus is a SQL execution engine that works well with real-time queries over large datasets
Citus has great scalability as it scales PostgreSQL across multiple machines, (1920 cores, 15TB DRAM, 960TB Premium SSD)
Citus is also well design to handle tables high in dimensions, with up to 50 or more indexes and also collects data from up to 1.5 billion devices

Mastering Chaos - a guide to microservices at Netflix

A microservice architecture is a modern approach to software development, where a single application is built as a collection of small services
Companies like Netflix employ fault injection testing to introduce faults to their system, in order to ensure resilience and that their fallback mechanisms function correctly
The company emphasizes the importance of identifying and defining critical microservices for basic user functionality, focusing testing and reliability strategies on these services to maintain operational core functions even during service failures.
Why Google Stores Billions of Lines of Code in a Single Repository
Google's monolithic repository contains over a billion files, with over 2 billion lines of code, with a data size of roughly 86 terabytes
There are about 45 thousand commits per workday, with roughly 2/3 of them being automated and the other 1/3 is by human developers
The rate of change to the repository has been increasing exponentially, which makes the graph of the rate of change barely shows data from 2004 on the graph

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

Uses RAID 10 for disks over RAID 5,6 as they provide the best storage, explains how data is expensive
Combining cloud services with on-premise servers gives better flexibility and scalability when managing large databases and servers
Backup strategies utilize tools like Barman for physical backups and PG Dump for logical backups

AvidThinkerArsum commented 6 months ago

Why Google Stores Billions of Lines of Code in a Single Repository

a. The monolithic repository structure allows for easier management of dependencies, reducing technical debt by ensuring library updates are quickly and uniformly applied across the entire codebase.

b. Google handles thousands of commits per week, heavily leveraging automated systems to manage the volume and ensure consistent code quality.

c. The use of a single repository supports a collaborative work environment by simplifying dependency issues and reducing conflicts, which facilitates smoother development cycles.

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

a. Focus on developing the right product for the target market first, before addressing scalability is more important, to ensure that the product adequately meets market demands.

b. Transition from a single-node PostgreSQL setup to a more scalable clustered architecture with multiple worker nodes and coordinator nodes, and incorporate database sharding with Citus as event volumes grow, to efficiently handle larger data volumes and improve system performance.

c. Implement data management strategies like clearing old data and rolling up frequently accessed reports to reduce storage costs and enhance data retrieval performance as data volume increases.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

a. Utilizing standby databases enhances scalability by improving data availability and enabling load balancing across servers.

b. Employing pg_dump and tools like Barman ensures data integrity and optimizes storage through alternating backup patterns.

c. Combining on-premises servers with cloud services increases flexibility and ensures data consistency with high availability strategies like streaming and logical replication.

Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

a. Citus, as an extension of PostgreSQL, plays a crucial role in enabling Windows to manage and analyze data from approximately 1.5 billion devices efficiently. It enhances PostgreSQL's capability to handle massive scale data by distributing the database workload across multiple nodes, which facilitates the creation of data-rich decision dashboards that refresh every few hours.

b. Windows leverages PostgreSQL’s advanced data types like JSON and hstore. JSON is used in staging tables for its flexibility in handling semi-structured data, and hstore is utilized for dynamic columns in reporting tables. These data types are pivotal for managing the diverse and voluminous data from billions of devices, allowing for efficient querying and data structure flexibility.

c. The system focuses on "measures," which are time series data specific to use cases executed on Windows devices. To handle the complexity and high volume of this data, Windows uses Citus to ensure scalability and performance.

Mastering Chaos - a guide to microservices at netflix

a. Netflix highlights the dangers of cascading failures within a microservices architecture, where the failure of a single service can potentially bring down the entire system. An example of this is the over-reliance on a single cache cluster for both online and offline processes, which poses significant risks if the cache layer fails. Implementing proper fallbacks and diversifying dependencies are critical to preventing such cascading issues.

b. Inspired by the concept of the autonomic nervous system, Netflix emphasizes the importance of creating an environment where best practices are automatic and intuitive. This is achieved through a cycle of continuous learning and automation, ensuring that the system can adapt and maintain operational efficiency without manual intervention in routine processes.

c. To handle scalability challenges, such as handling up to a million requests per second, Netflix employs strategies like workload partitioning and caching. Additionally, following a significant outage in their US-East AWS server, Netflix implemented a strategy to split their load across multiple regional servers to distribute traffic more evenly and enhance service reliability. This approach not only helps in managing high traffic volumes but also in improving the system's overall resilience to failures.

Postgres at Pandora

a. Pandora transitioned from SLONY replication, which only replicated tables and sequences, to more robust replication strategies to enhance data availability and consistency.

b. Pandora's detailed monitoring system tracks a variety of metrics, including error frequencies, queue durations, and blocked processes.

c. Initially using Oracle, Pandora switched to PostgreSQL for its cost-effectiveness, scalability, and open-source benefits. They have developed in-house solutions, such as a custom database management system named "Cluster".

Scaling Instagram Infrastructure

a. Instagram's scaling strategy encompasses three key dimensions—scaling out by adding more servers and data centers, scaling up by enhancing the efficiency of each server to handle more tasks, and scaling the development team to enable rapid deployment of features without compromising system stability.

b. The infrastructure utilizes Django for stateless computing with data stored in systems like Cassandra and Memcache, ensuring efficient data synchronization across datacenters.

c. Instagram continuously optimizes its software for performance, including memory optimization and CPU usage reduction through targeted code refinements and advanced monitoring tools.

The Evolution of Reddit.com's Architecture

a. Features like Fastlane queues to manage highly active threads and ensure they don't slow down the rest of the site. This helps in handling real-time data more efficiently by reducing delays caused by popular comments or threads.

b. To maintain stability and performance on a high-traffic platform like Reddit, implementing thorough sanity checks, observability, and multiple layers of safeguards is crucial. These practices are essential for identifying and mitigating issues before they affect server performance and user experience.

c. Reddit has encountered significant challenges with locks in their architecture, particularly impacting the processing of queues, such as the "vote" feature. This has led to efforts towards adopting lockless data models to improve efficiency and reduce bottlenecks.

ains-arch commented 6 months ago

Data modeling, the secret sauce of building & managing a large scale data warehouse

Microsoft stores time series data about Windows telemetry in a structure called “measure” for different specific use cases. This is technically difficult to optimize because of the high cardinality of the “device id” attribute and the requirement of two-level aggregation, and high concurrency.
One of the ways they handle this in “cubing” to pre-compute the inner query, but it’s difficult because of the dimensionality.
They have improved their approach for handling this by using Citus, which is a distributed SQL execution engine integrated with Postgres that rewrites queries, to do dynamic compute and device-centric queries.

Lessons learned scaling our SaaS on Postgres to 8+ billion events
- Convertflow handles billions of events in the form of views/clicks/shopping on over 20,000 websites. As they’ve grown to this size, they’ve bootstrapped their Postgres database to match demand.
- In the early days, their database was fast because there wasn’t much in it and they could solve other kinds of problems while waiting for Heroku to notify them they needed to buy more storage. To handle the growth from 1 million to 100 million events on a tight deadline and budget, they started writing events to both Postgres and Firebase.
- However, they blew through their Google Cloud credits in the growth to 1 billion, and needed to move to Azure and Citus. They also started clearing out old data for historical reports out of hot storage. They also said to reindex concurrently in order to avoid locking tables which is something I learned in the final project!

mmendiratta27 commented 6 months ago

The Evolution of Reddit.com's Architecture

I didn’t realize that architecture could change so fast (different presentations last year). I knew tech was fast-moving, but didn’t realize big companies update entire organizational structures that quickly.
Locks negatively affect throughput, which is a metric for process speed. Reddit is trying to eliminate all locks.
Sanity checks using timers or other tools are not trivial and are even used in companies as large as Reddit.

Data modeling, the secret sauce of building & managing a large scale data warehouse

One of the key challenges of using device centered aggregation is the rich (detailed/expensive) dimension and metric types they are using in their queries
A hyperloglog data structure can help efficiently calculate across 100M devices in aggregate
They use JSON files for staging tables and hstore, a postgres data type, as dynamic columns

luisgomez214 commented 6 months ago

Scaling Instagram Infrastructure:

I learned the difference between storage and computing. Storage needs to be consistent across all data centers and computing is driven by user traffic.
I learned what memcache is. It is a high-performance key value stored in memory and provides millions of reads and writes per second. It helps improve the Instagram interface.
I learned what scale-up is. Instagram uses as few CPU instructions and servers as possible, this helps more users be able to use the app.

Postgres at Pandora:

Pandora switched from Oracle to PostgreSQL to leverage open-source benefits, lower costs, and enhance scalability.
They faced complexities in managing data consistency, backup and restoration speed, as well as replication issues, which led to the development of an in-house non-ACID database called Clustr.
Pandora has placed significant emphasis on improving monitoring capabilities, such as tracking errors, long-duration queries, and blocked processes, to ensure their PostgreSQL environment remains efficient and reliable.

Evolution of Reddit's Infrastructure

The evolution of Reddit.com’s infrastructure underscored the critical importance of real-time data accessibility. In mid-2012, the site encountered significant slowdowns due to locks hindering swift queue processing.
Reddit's approach to managing high-traffic areas like popular comment threads evolved with the introduction of Fastlanes. These specialized queues were implemented to prevent site-wide slowdowns caused by exceptionally popular content, as observed in 2016 when comments queued outside the Fastlane caused issues.
Ensuring system stability and performance in a high-traffic environment like Reddit demands meticulous attention to detail. Peer-reviewed checklists and sanity checks serve as vital safeguards, preventing mishaps during server terminations and other critical operations.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

In big AdTech companies like Adjust, they use MapReduce to speed up data processing while staying up-to-date. They also make data smaller by organizing it better, like shortening country names.
PostgreSQL handles a ton of data in AdTech, with over 100k requests per second and 400 TB of data. But sometimes, tools like AutoVacuum need a little tweaking to keep things running smoothly.
In AdTech, PostgreSQL is a champ at handling huge amounts of data. They use tricks like MapReduce to quickly handle data transfers, and tools like istore to crunch numbers fast.

Why Google Stores Billions of Lines of Code in a Single Repository

Google's single code repository stores tons of code and sees thousands of updates each week, with most of them automated. They prefer a straightforward approach to development, avoiding complex branching. This helps keep their codebase unified and up-to-date across the board.
Google's big code repository is all about teamwork and keeping things simple. They automate most of their updates and don't rely much on branching. This way, everyone works on the same page, making it easier to manage and improve the code.
Google's massive code storage is about teamwork and efficiency. They automate a lot of their updates and keep things straightforward by avoiding complicated branching. This keeps their code organized and up-to-date without too much hassle.

Mastering Chaos

Netflix employs fault injection testing to ensure their services remain resilient. By deliberately introducing faults, they test if fallback mechanisms work properly in real-world scenarios.
Google's single repository strategy, storing over 1 billion files and 86 terabytes of data, emphasizes collaboration and simplicity in code management. They prioritize a unified codebase over complex branching strategies.
In managing microservices, Netflix identifies critical services like browsing and streaming, focusing their testing efforts on ensuring the reliability of these core functions even during service failures.

Large Databases Lots of Servers on Premises in the Cloud

By having two standby databases alongside a master database, organizations can bolster data availability and distribute workloads more effectively across servers. This setup not only improves system performance but also minimizes the risk of downtime by providing backup resources in case of server failures or maintenance activities. 2.Backup strategies are crucial for data security and recovery. Tools like Barman and pg_dump ensure data integrity and optimize storage through alternating backup patterns. 3.Combining on-premises servers with cloud services offers flexible and scalable infrastructure. Strategies like streaming replication ensure consistent data replication and high availability for quick failover.

Breaking Postgres at Scale

Starting with smaller datasets, pgdumps emerge as a top choice for PostgreSQL backups due to their simplicity, efficiency, and quick operation. Opting for pgdump saves time that would otherwise be spent devising complex backup strategies.
To optimize PostgreSQL performance, it's recommended to fine-tune the work_mem parameter based on the actual usage of temporary files logged. Setting it to two or three times the size of the largest file ensures efficient memory utilization and enhances system performance.
Maintaining shared buffers within the range of 16-33GB is advisable to strike a balance in memory usage and minimize checkpoint activity. Going beyond this range can lead to increased checkpointing without significant performance benefits, highlighting the importance of optimal buffer sizing.

abraryaser02 commented 6 months ago

Postgres at Pandora

Mostly good switch. Postgres is more scalable and cheaper than Oracle, as well as more flexible.
Issues with vacuuming and data consistency, but they ramped up monitoring by a lot.
Created Clustr as a way to combat lots of issues regarding joins, however not ACID compliant.

Citus: Postgres at any scale

Citus is an open-source extension for PostgreSQL, making it a distributed system that allows operations over clusters. This allows PostgreSQL to scale horizontally by distributing data and queries across multiple servers, enhancing performance and efficiently handling larger datasets as well as real-life data.
Improved COPY and other commands by executing them in parallel across worker nodes, something we saw in class.
Allows support for distributed transactions while maintaining data integrity.

elissayz commented 6 months ago

Postgres at Pandora

Stephen Rees implemented monitoring activities, seeing errors occurring and filtering them if not important to really work efficiently.
Scaling issues caused the processing of WAL files to take over 24 hours and had to do some massive cleanup and intervention. Saved by replication slots.
Autovacuum runs for more than a week per table, really inconvenient from data fragmentation.

Scaling Instagram Infrastructure

Instagram uses postgres servers to store user, media, and friendship data.
Instagram priorities as it came to scaling: scale out with more hardware, and more servers, scale up by taking full advantage of all those servers.
Instagram runs parallel processes in the memory, shared, and private memory. Since code is a big part of memory, reducing the amount of code they manage to get to an optimized mode.

Why Google Stores Billions of Lines of Code in a Single Repository

Google’s repository hosts an enormous amount of data, which has kept them focused on scalability.
Change of about 35 million commits between the creation of Google and 2015 (date of the video). On an average workday, 30 thousand commits by automated systems, plus 15 thousand by humans.
Some of the advantages of a monolithic repository include extensive code sharing and reuse, atomic changes, large scale refactoring, flexible team boundaries and code ownership.

Evolution of Reddit.com’s Architecture

Was another giant monolith, ran the same code on every one of the servers, may be running different parts but still are all running the same code.
By creating separate queues reddit can manage high-traffic areas effectively and prevent slowdowns across the whole system.
Incident with comment processing, when major news events happened a lot of people were talking about it, tried to fastlane it but this filled up. Site actions were all frozen, and they had to change the queueing method.

vitorvavolizza commented 6 months ago

scaling instagram infrastructure

instagram maintains redundant data centers for high availability and fault tolerance, minimizing downtime.
instagram uses django for logic and posgresql for data storage, ensuring scalable and efficient infrstructure.
instagram uses cprofile in python to optimize high-cost functions, reducing cpu usage and improving performance.

the evolution of reddit.com's architecture

reddit uses load balancing to manage traffic spikes, distributing requests across multiple servers for consistent perfomance.
reddit uses memcached to store frequently accessed data, reducing database load and speding up data retrieval.
reddit partitions queues and uses asynchronous processing to address lock contention, ensuring smooth operation under high-load.

postgres at pandora

pandora uses data sharding to distribute datasets across multiple servers, enhancing horizontal scaling and handling larger data loads.
pandora developed tools to track database metrics like query times, error rates, and transaction durations for proactive maintenance.
pandora moved from slony to advanced replication strategies for better data consistency and availability in their distributed databse.

postgresql at 20tb and beyond: analytics at a massive scale

adjust partitions data to distribute it evenly across servers, reducing query times.
adjust uses incremental aggregation to process data in smaller chunks, enhancing performance and reducing server load.
adjust fine-tunes b-tree indexes for efficient data retrieval and updating.

large databases lots of servers on premises in the cloud

combining on-premises servers and cloud services offers flexible, scalable infrastructure for varying workloads.
streaming replication ensures consistent data replication across servers, providing high availabilty and quick failover.
using barman allows efficient scheduling and management of backups, ensuring data integrity and quick recovery.

breaking postgres at scale

setting shared buffers to 16-33gb balances memory usage and minimizes checkpoint activiy.
postgresql’s efficient sequential scans make it suitable for lightweight aplications without extensive indexing.
point-in-time recovery (pitr) allows precise recovery to a specific time, providing a robust backup solution for large databases.

citus postgres at any scale

citus allows reference tables to join across distributed tables, enhancing query performance.
executing commands in parallel across worker nodes boosts data loading and processing speeds.
citus supports distributed transactions, maintaining atomicity and consistency across servers.

data modeling the secret sauce of building & managing a large scale data warehouse

windows uses partial covering indexes to optimize storage space and improve query performance.
windows uses citus to manage large volumes of time series data, ensuring scalability and quick access.
windows uses hstore for dynamic columns, managing varying data attributes without changing the table schema.

abraryaser02 commented 6 months ago

Scaling Instagram Infrastructure

Backend runs on PostgreSQL database storing relations like user, media, friendship. etc. API calls are made on Django reading on replicas and writing on both the master and replicas.
Apache Cassandra to store user feeds, activities, etc. all running on replicas.
Postgres Daemon replicates memcache, preventing stale comments being read.

The Evolution of Reddit.com's Architecture

Frontend is mostly Node while backend in Python (r2 )with load balancers & apache cassandra to take care of new updates.
Listings (list of links) for user posts and votes updates the actual link itself, an interesting way to sort and updating data since it deviates from caching.
Thing objects stored in Postgres clusters, replicas for read-only operations. Comments are stored in trees and Comment Tree Queues are used for retrieval.

PostgreSQL at 20Tb and beyond

More MapReducing but at an even larger scale!
More than 20 backend servers, each containing 20TB of data.
Analytics Shards (product of MapReduce) each being 2TB.

PostgreSQL at Scale.

Databases of small sizes, use pg_dump and backup consistently.
For bigger databases, use PITAR. Introduce load balancing.
Consider autovacuuming + manual vacuuming for tables with very high update rates. Let it run. Consider partial indexes since indexes get pretty huge at this point.

Mastering Chaos - A Netflix Guide to Microservices

Think of Microservices as organs in a human body.
The concept of critical Microservices, create fit recipes to make sure the basic functionalities work even if none of the dependencies are working.
Netflix developed 3 multi-region strategies to avoid depending on only one (US East-1).

Why Google Stores Billions of Lines of Code in a Single Repository

40,000 commits a day, must be much more now in 2024.
Rosie takes care of large patches and automates commits.
Maintaining a single repo avoids the diamond dependency problem.

irajmoradi commented 6 months ago

Scaling instagram infrastructure:

Scaling up means primarily using as few cpu as possible and as few servers as possible to do those instructions.
A big part of optimization is rewriting functions that are used a lot in cython or c++, which helped reduce the cpu instructions for those functions.
They decided to do source control with no branches in order to make things as efficient as possible in terms of development.

Breaking Postgres at scale:

Attempt to have the entire database in memory, but if thats not possible put the top 3 largest indexes into memory.
For larger databases, although vacuums will take a while if you kill the processes you will get transaction id wraparound issues.
At larger database sizes use pg_stat__user_indexes, this allows you to figure out what indexes are being used and which aren't being used. If you are using long strings consider using hash instead of btree index.

vibhuk10 commented 6 months ago

Scaling Instagram Infrastructure:

Instagram focuses on 3 dimensions of stability, including adding more hardware, improving efficiency of code, and create a bigger development team.
We can use techniques like "Cythonizing" to improve performance and optimize Python code
Instagram uses metrics in production to understand how new features are performing under a load.

The Evoltuion of Reddit.com's Architecture:

In order to efficiently handle the operations on the website, Reddit uses things like asynchronous queues and cache mechanisms.
We can use specialized processing methods like "fastlaning" in order to solve issues with things like vote queues.
Reddit can manage all the traffic that goes through their site very effectively as they automatically scale their AWS services so they never have issues with not having enough.

eoinoconnell04 commented 6 months ago

Scaling Instagram Infrastructure:
- She describes three ways that they have scaled. First is “scaling out” which is simply building more infrastructure, second is “scaling up” which is making existing infrastructure more efficient, and scaling the development team is necessary to implement these changes.
- The way that instagram uses Postgres is by having a single master copy of a database that is written to, and then this database will have replicas in different data centers around the world. These replicas can then be simultaneously read from without interacting with the master copy.
- As processes are simultaneously being run on a server, each process is accessing both shared memory and private memory. The private memory is unique to that specific process, while each process has access to the shared memory for things that are common between processes, such as configuration. To be able to increase the number of processes that can be run on the same amount of hardware, Instagram was able to minimize the amount of code that was copied from shared memory to private memory. Some of the things that were causing this wasted data were subtle behind the scenes issues, such as with python garbage collection, and once a company like Instagram finds an issue like this, people working on python start working to fix it.
The Evolution of Reddit.com’s Architecture:
- Reddit was dealing with issues of polls becoming very slow during peak hours, and they tried to implement an obvious fix, which was scaling out and getting more hardware to deal with the traffic. This actually made the problem worse, and this ended up being because of the locks that would cause blocking between different processors trying to change the vote counts simultaneously. This was interesting as this was a real world example of the locking and blocking issues that can arise that we learned about in class.
- One way that Reddit is trying to solve their locking issue is by using “lockless cached queries.” We learned about the problems that locks solve, so I am intrigued to learn how reddit is able to concurrently write without issues, but without locks. They said that it is still in development, but I will research the principles behind this.
- Another thing that I found interesting about Reddit’s stack, is that at the core they have R2, which is the stack for the original Reddit. Over time as Reddit has scaled, they have extracted certain functions of R2 into their own standalone sections, allowing them to restart and approach the problems in new ways as opposed to trying to change the existing architecture. I found this process interesting, and I wonder how many other companies have been developed in a similar way.
Why Google Stores Billions of Lines of Code in a Single Repository:
- I was extremely surprised to learn that Google stores all their code in a single repository, and that this is a very unusual approach for a large software company. The size of this repository is enormous, and I wonder how storage of a large repository like this works.
- Trunk-based development: While developing, branches are almost never used, instead everybody works on the head and then their commit’s are immediately usable. Branches are instead used for different releases, but not for the development process. This eliminates the need for merges which can be very difficult with such a large code base.
- One big issue that she brought up with this model, is that it makes it really easy to share and reuse code, which can be good but makes it often overdone. This is because sometimes developers will make more dependencies than necessary, which increases the likelihood of having issues down the line.
Mastering Chaos - A Netflix Guide to Microservices
- It was interesting to hear about how the original netflix code structure was problematic, because their main code was all monolithic. This is because I had just watched the other video about why google keeps all their code in a single repository, and monolithic model, but it seemed like this was not ideal for netflix.
- He provided a really good description of microservices, something that I knew nothing about. He explained it as the different parts are encapsulated and separated, which provides a level of abstraction. Structuring the code in microservices like this makes finding issues a lot easier, and has better future proofing.
- I also learned about EVCache reads. This was how Netflix found a balance between efficiency and redundancy. When writing, they simultaneously write to multiple chances in different accessibility zones, but then only read from the local cache. They found that this model scales extremely well.
PostgreSQL at Pandora:
- I was surprised to learn that when Stephen started at Pandora they weren’t really monitoring postgres at all. That was, they would check that it was running, and not running really slowly, but they were not checking the logs and errors that were occurring at all. This just really surprised me as I would think that examining errors would be necessary to have a functioning service.
- They started continuously storing both short term and long term historical data about their postgres service. For the short term, which was stored for around a week, they would keep track of behavior to establish normal behavior so that they can detect any abnormal behavior. The long term data is stored indefinitely, especially containing size information, along with other information.
- CLUSTR architecture: (not to be confused with postgres cluster with “e’). This is an architecture that does not have a guarantee of being ACID. An application writes to CLUSTR, and then writes to multiple postgres databases. For a select, CLUSTR requests data from every database in the group. Not ACID because sometimes there are issues writing to every database for connectivity or other reasons, and then databases can disagree about some selects.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale:
- Chris works at Adjust, a company that handles “advertising attribution”. I have throughout about the companies who store data for determining algorithms for who to to advertise what to would have huge amounts of data, you always hear about how much data they store on you in the news, but I haven’t heard of this. What adjust does it ensure that when people interact with ads, the transactions are handled appropriately to make sure that advertisers get paid accordingly.
- I learned that map-reduce is key to Adjust being able to have near-real time analytics of all the data that they handle. They map-reduce from the servers to analytics charts, and then have another map-reduce for the clients getting their data to the dashboard. This was really interesting to learn about because we used map-reduce when dealing with our twitter data earlier this year.
- It was also really interesting to learn about how they optimized their autovacuum after I learned a lot about it for our final exam. Instead of primarily deciding when to start an autovacuum based on a percentage of the tables being dead tuples, he switched it to be fully determined by a set number of dead tuples. However, with how large their data set is, it would be impossible to immediately just tune autovacuum to this new configuration, he had to find a way to gradually shift the servers to the new configuration without adding too much load and once and breaking everything.
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!:
- Flavio explained how the company he works for, Leboncoin, has scaled out using postgres. That is, having more postgres servers that are all replicated to store more data and have faster loading times using load balancing Over time, postgres developed and had better tools for scaling out.
- He explained the necessity of having the best possible hardware, because the data is more expensive and more valuable. Any data corruption is extremely costly, so he emphasized why it is worth it to spend on server hardware, with redundancy everywhere that is possible.
- He also discussed having geographic diversity in where your datacenters, to minimize the chance of multiple servers having issues at the same time. A standby and a master should never be in the same place. For example, if both data centers were next to the same river, then flooding could impede both data centers simultaneously. Thus, having separation is key.
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
- He broke up the stages of grown into multiple categories, the first being from 0 to 1 million events. He said that “premature optimization is the root of all evil.” I have never heard this concept before, but it makes sense. Time and money spent optimizing is not going towards getting traction building revenue, and as you grow everything will likely significantly change anyway. Thus building a good model and choosing direction is a higher priority early on. “Create the right product before worrying about scale.”
- As they grew, from 1 million to 100 million events, then to a billion, they had to change their stack drastically, which he said should be expected. Eventually, they started using Citus, instead of google firebase, which is a postgres service that used clustering to shard tables make their queries way faster, and for a much more affordable rate.
- One issue that he discussed is when the started approaching the maximum value that a Small int column id can have, and how they had to change the whole table. They were able to have a 30 minute downtime, where they altered the table. They did it by creating a new column with the new big integer type, indexing it, and then setting it equal to the current id value, then dropping the old primary key and building the new primary key with the already existing index. This is just an example of a little inconsequential decision that is made early on that can become a huge task when working on a larger scale down the line.