Final assignment for non-graduating students

mikeizbicki / cmc-csci143

big data course materials

41 stars 76 forks source link

Final assignment for non-graduating students #352

Closed mikeizbicki closed 7 months ago

mikeizbicki commented 1 year ago

Due date: Sunday, 14 May@midnight

Background: There are many videos linked below. Each video describes how a major company uses the technology we've covered in class, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to get the gist of them all.

Instructions: The assignment is out of 8 points. To get a point: watch a video and write 3 facts that you learned from the video. Submit the assignment by replying to this post with your facts for all of the videos in a single reply.

NOTE: I realize that many of you have a lot going on right now, and so I won't be offended if you decide to "punt" this assignment. The point-value is intentionally small so that it will have a minimal impact on your grade if you're not able to complete it. That said, I think this is one of the more interesting assignments in this class and so I'd recommend you find time to watch the videos. You can get partial credit on this assignment for watching only some of the videos, and you can get extra credit for watching more than 8.

Videos:

About postgres:

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E
The Evolution of Reddit.com's Architecture

https://www.youtube.com/watch?v=nUcO7n4hek4
Postgres at Pandora

https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

https://www.youtube.com/watch?v=BgcJnurVFag
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

https://www.youtube.com/watch?v=4GB7EDxGr_c
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

https://www.youtube.com/watch?v=eZhSUXxfEu0
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

https://www.youtube.com/watch?v=kd-F0M2_F3I
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

About miscellaneous devops stuff:

Mastering Chaos - a guide to microservices at netflix

https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s
Why Google Stores Billions of Lines of Code in a Single Repository

https://www.youtube.com/watch?v=W71BTkUbdqE

(If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)
Kubernetes (abbreviated k8s) is like docker-compose on steroids. The documentary covers the history of docker and k8s and some of the technical differences between the two. The k8s documentary has 2 parts, and you must watch both parts to count as a single video

https://www.youtube.com/watch?v=BE77h7dmoQU

https://www.youtube.com/watch?v=318elIq37PE

Aser-Abdelfatah commented 1 year ago

1 - Why Google Stores Billions of Lines of Code in a Single Repository

Google uses automated tools for code review, detecting dependencies, and cleaning up dead code
Trunk-based development means all developers are making changes to the mainline and have access to the mainline while branches are mostly used for releasing
Diamond dependency problem happens when two dependences depend on two different versions of another dependency

2- Mastering Chaos - a guide to microservices at netflix

Microservice is an architectural style that is centered around developing an application as a suite of small services
Cascading failure refers to the situation when one microservice fail causing other microservices failures because the system was not designed to deal with the failure of the that original microservice
CAP Theorem says that distributed systems can't guarantee consistency, availability, and partition tolerance at the same time.

3- Scaling Instagram Infrastructure

The total number of likes displayed in Instagram used to be a simple yet slow count(*) SQL command
Cache invalidation refers to the process by which the data in cache is replaced by new data or evicted
TAO infrastructure used within Facebook uses relational database (MySQL) as the back and storage device where data is stored in the form of objects as nodes and relations as edges

4 - Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

While optimizing using primary indexes speeds up queries, optimizing too much (over-engineering) represents a problem
Index Bloat refers to the problem when search engines index pages that we don't want them to index and to avoid that, we can re-index tables
Indexing tables without concurrency cause them to lock up

5- The Evolution of Reddit.com's Architecture

Software problems can be made worse if they were solved by just adding more hardware; the example provided was using more processors to solve the lock contention problem, which made it worse
A single inconsistency, like what happened with a missing parent that was still in the queue and was not transferred to Fastlane processing, can cause entire website size of reddit to be down,
Queue quotas allow for isolation: i.e, large websites like Reddit can continue to function normally even if one subreddit or one part was not working

6 - Kubernetes

Docker was a game changer for devs and ops and the entire container technology
Kubernetes abstracts all the details and uses the promise theory, which ensures that the service is executed regardless of resource breakdown
Kubernetes, Docker Swarm, and Mesos were focusing on container orchestration, which is the automation of the operational effort required to run containerized workloads and services

7-Data modeling, the secret sauce of building & managing a large scale data warehouse

Device Centric Aggregation is important to cope with noise coming from faulty devices, especially with the case of Windows with 1.5B devices
Pre-computing the inner aggregation of the device centric aggregation increase data cardinality, which causes the curse of dimensionality
Increasing the number of partial covering indexes too much will cause queries to slow down

8 - PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

The company uses a materializer to aggregate and transfer data from more than one server to more than one server
Data models are hard to change and can force the company to re-write the tables
Custom 1-byte datatypes help with alignment and can save data wasted in billions of rows

emjuliet commented 1 year ago

1. The evolution of Reddit:: 1 point

Started to split r2 into backend services
The backend services use thrifts for strong schemas and they are written in Python
Queues in 2012, the keys started getting really backed up during peak traffic and delayed votes and scores.

2. Postgres Open Silicon Valley 2017: 1 point
In the beginning, there was no Postgres at Pandora, they used the Oracle client.
They came up with 5 different classes of databases.
Their pg_dump was slow to generate and even slower to restore.

3. PostgreSQL at 20TB and Beyond: 1 point
20 tb is the maximum size of one of their databases.
Adjust does mobile app attribution, they referee which ad activity gets credited with revenue activity
They IStore which is like Hstore but for integers.

4. Citrus Con: 1 point
Hundreds of dashboards, refreshed every few hours.
A measure is a time series of data about a specific use case executed on Windows 10/11.
A typical Windows measure could have data points from 200M devices per day.

5. Citrus Con 2022: 1 point
Advice is to create the right product before worrying about the scale and that Postgres is easy to apply.
When trying to go from 100 m events to a billion, they were running out of google cloud credits.
Their long-term solution was to use Citus to scale their Postgres database.

6. Why Google Stores Billions of Lines of Code in a Single Repository: 1 point
Linux kernel 15 million lines of code are modified every week by humans.
Their Google repository has 15 million giving about 200 million lines total.
The Google workflow works like this: syncs user workspace to repo, then code it written and reviewed and committed.

7. Kubernetes: The Documentary: 1 point
Moving from pure virtualization into infrastructure at the service, like clouds.
AWS is the primary cloud provider, they are so dominant that other businesses thought they should have cloud-based services too.
Docker got immediate buzz and the vision was about building tools for mass innovation.

8. Mastering Chaos: 1 point
Client libraries were mostly java based and were also used with cache clients.
A stateless service is not a cache or a database, frequently accessed metadata, no instance affinity, and a loss of a node is a non-event.
Over the years they developed a list of good production, some being alerts, apache, autoscaling, and squeeze testing.

KaranGoel1 commented 1 year ago

Scaling Instagram Infrastructure

Instagram uses Django and Python to interact with the backend. The backend is made of PostgreSQL, Cassandra, and memcache, with RabbitMQ and Celery for asynchronous tasks
Running a count(*) query in a normalized database took 100s of milliseconds, so they used a denormalized database where the counts were stored, a query which now took 10s of microseconds.
Instagram doesn’t use different branches during development. They do all the development on the master branch, and restrict access to certain parts in a predetermined shipping plan.

Reddit’s infrastructure

Their backend components are split up in a way in which a single team can handle one component. There’s an API team, a Listing team, a Thing team, etc…
Reddit was having problems with their votes getting stuck for extended periods in locks on the backend, so they partially solved it by partitioning the queries into different queues to lower wait times.
They use an autoscaler to save money and compute resources by requesting fewer servers from AWS in off-peak hours and making sure they can get the resources they need during large traffic periods

PostgreSQL at Pandora

They split their backend into different DBMS classes by creating a wrapper for their init db script which would give the engineer a choice as to what type of database was being created
They used to use PG_DUMP for backups but had to change it because it took too long to restore the database if the need came up. They have since switched to PG_BASEBACKUP
They developed an in-house proprietary distributed database called CLUSTR. CLUSTR is not ACID compliant, so may not be useful for many cases.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

When his team wants to choose certain process related to the backend, they do an A/B test to check the performance of different solutions
Have a custom datatype called IStore which is used for integers. It allows GIN indexing
Created a bash script to replicate autovacuum since autovacuum was not an effective solution and then tuned the script until it worked properly

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

Postgres didn’t have streaming replication in 2009
They use Xymon and pagerduty for monitoring and alerting
For large databases, it is cheaper to run on premises infrastructure compared to AWS or other cloud services

Breaking PostgreSQL at Scale

A 5GB database using pg_dump to backup takes around 90 seconds
Adding 7 indexes to a table may slow down insert time by a factor of 15
Sharding can significantly accelerate reads on large datasets

Citus: Postgres at any scale

Citus is a service that superpowers distributed databases on PostgreSQL
Citus is best used for databases of size 100GB and larger where distributed databases make more sense
Parallelization is much faster on Citus and greatly improves SELECT query speeds

Data modeling, the secret sauce of building & managing a large scale data warehouse

Windows Data Mesh has a Citus Cluster within VeniceDB which is where their data is stored
They need to handle about 10 million queries a day of which 2 million are distinct
Typical Windows measure could have 200 million devices per day

Lessons learned scaling our SaaS on Postgres to 8+ billion events

ConvertFlow customers get 40 million unique website visitors per month
ConvertFlow’s tech stack started off consisting of Rails, Heroku, and PostgreSQL
REINDEX without using the CONCURRENTLY command can lock tables, so it is recommended to use the CONCURRENTLY command to ensure smooth sailing

Microservices at Netflix

CAP Theorem: in the presence of a network partition, you must choose between consistency and availability
Chaos Monkey confirms that when a node dies, everything still works
Netflix was creating Titus to ease node management since no appropriate out of the box solution existed

Why Google stores billions of lines of code in a single repository

Has about 1 billion files and 2 billion lines of code in its repository
Automated commits are a large driving force behind Google’s repository scaling, having grown much more rapidly than human commits since 2012
Google built tools to find and remove dead code since adding dependencies became incredibly easy with the monolithic model

ifreer23 commented 1 year ago

Scaling Instagram Infrastructure
- In 2017 when this was filmed, everyday, Instagram had 400 million users, 4 billion likes, 100 million photo/videos were uploaded, and it had scaled 4x in the prior three years.
- Guo explains how the implementation of indexes through a denormalized table helped speed up the process of counting likes many orders of magnitude than the previous query.
- When implementing continuous profile collection there is a tradeoff between cost of collecting the data and the visibility Instagram gets into their code base. It greatly improves their productivity when debugging performance issues
The Evolution of Reddit.com’s Architecture
- In 2017 when this was filmed, everyday Reddit had 320 million users, 1 million posts, 5 million comments, 75 million votes and 70 million searches.
- r2 is a big python “blob” that is the oldest single component of Reddit, which was started in 2008. It is currently being split up into various backend services.
- Having many rows in the data table per “Thing” object allows Reddit to make changes to the site without having to alter a table in production.
PostgresSQL at Pandora
- Pandora originally used Oracle, but they wanted too much money, so one Friday they closed as an Oracle shop and the following Monday they opened as a Postgres shop. Wanting to go open source was another motivating factor for them too.
- The Music Meta Data class of Pandora is a massively redundant set of read only databases
- Over the hundreds of pg_dumps that they have run they’ve had a 99% success rate on. The only thing they fail on is typically running out of space.
PostgreSQL at 20 TB and Beyond
- Adjust is a market leader in mobile advertisement attribution- acting as a referee in pay-per-install advertising
- At Adjust in 2017 they had 100,000 requests per second, over 2 trillion data points tracked in 2017, and over 400 TB of data being analyzed
- The Materializer is an incremental map reduce job basically where it aggregates new events, copies the aggregations to the shards, runs every few minutes, and uses new data only
Large Databases, Lots of Servers, on Premises, in the Cloud
- If someone comes to you and says Postgres doesn’t scale out- he’s lying!
- Leboncoin has 28.1 million unique visits, 27 million classified ads online, and 800,000 new ads every day.
- One of their hardware servers made by Hewlett Packard has 3 TB of rem in the server.
Breaking Postgres at Scale
- PostgreSQL can handle databases of any size
- The largest community-PostgreSQL data base Christophe Pettus has worked on was multiple petabytes
- On a small database in PostgreSQL (about 10 GB), it is hard to go wrong as nearly everything will run fast. Even using sequential scans for everything.
Citus: Postgres at any Scale
- Citus provides distributed database superpowers to PostgreSQL through distributed tables with co-location, reference tables, query routing, and more.
- Marco Slot stated that it is best to use Citus when managing 100GB and beyond.
- Also, Citus is useful for multi-tenant applications (any SaaS) or real-time analytics dashboards.
Data modeling, the secret sauce of building&managing a large scale data Warehouse
- Microsoft uses Citus to provide insights to customer experience, like Windows diagnostic data from an audience of 1.5 Billion devices
- A measure is time series data about a specific use case executed on a Windows 10 or Windows 11 device
- Some technical challenges with Device Centric Aggregation are large number of concurrent queries (10 million queries, 2 million distinct queries per day), and the rich dimension and metrics data types.

sophiahuangg commented 1 year ago

PostgreSQL at Pandora

Out of all the pg_dump commands they've done, they've had about a 99% success rate which is extremely impressive.
They've had scaling issues with vacuums which runs for more than a week per table
A ton of data - have > 1650 Postgres clusters and 335 TB data

Scaling Instagram Infrastructure

400m users visit Insta each day, top account has 110+ m followers (2017)
4x increase in scaling than three years ago
Unit testing: important for checks and balances. Determines if they roll something out.

Data modeling, the secret sauce of building & managing a large scale data warehouse

Chose Citus because they have a good SQL engine - integrated into Postgres
Works in Windows DE section - collecting data from 1.5 billion devices to build dashboards and learn about customer experience
They use a measure data schema that is a time series data

Why Google Stores Billions of Lines of Code in a Single Repository

As of Jan 2015, Google repo has 1 billion files, 2 billion lines of code, 45k commits per workday, history of 35 million commits
The advantages of having on repository is "one source of truth" - never need to wonder where the authority comes from. Lots of useful libraries that don't need to be reinvented.
Having a single repository may be too easy to add dependencies - some teams forget to think about the implications that lead to problems that end up breaking things

The Evolution of Reddit.com's Architecture

Reddit needed to improve process of migration - needed more peer reviews. Can't use same autoscaler technology for stateful services and stateless services.
Architecture seems super interesting. Their diagram is still a work in progress (as of 5 years ago), and the middle is r2 that has been Reddit since 2008.
Frontend engineers at Reddit built out more modern frontend apps that are in Node. Act as API client themselves.

Breaking PostgreSQL at Scale

More memory does not help write performance. "Rule" is to try to fit the largest 1-3 indexes in memory.
Not all indexes will be good/helpful. Adding 7 indexes to a table will slow insert times in a table by about a factor of 15.
You can partition your data like in time series data to divide the table into more manageable chunks. Data needs a strong partitioning key which means it needs a relatively invariant key that is almost never changed but is used frequently.

Mastering Chaos - A Netflix Guide to Microservices

A stateless service is not a cache or database - you're not storing data. Auto-scaling like mitosis.
Netflix has a production ready checklist that has some sort of automation behind every task such as for staging deployments so you don't push out bad code.
Interesting because they mention their issue arose from the codebase being monolithic, but this seems to work well for Google.

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

All growth was done while running on Postgres.
Scaling at an easy stage with Heroku was super easy. More of a matter creating primary indexes.
In the beginning, worry about creating the right product and finding PMF before devoting time to optimization and scaling.

kanaluM commented 1 year ago

Scaling Instagram Infrastructure

Optimizing compilation to not include doc-strings can make a big difference in performance
There is a simple/automated way to turn Python code into C/C++
It's interesting that some services like PostgreSQL are kept consistent throughout multiple data centers but other services like Memcached are not

The Evolution of Reddit.com's Architecture

You cannot just add more processors to speed up concurrent lock access, but you can partition processes so that fewer try to access locks at the same time
Queue quotas prevent a single queue (with information that needs to be processed) from using too much processing power by limiting the size of the queue.
Services with state and without state should be handled differently when autoscaling

Postgres at Pandora

A Postgres database under a heavy load may suddenly lose connection
When pushing a change to a schema/upgrading PostgreSQL/other related changes, it is common to have to shutdown or block service (they used a custom non-ACID database to get around this)
In practice the vacuum can take a week or more to run

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Postgres has grown a lot faster than the database community and so a lot of places struggle to hire people with sufficient experience
Changing simple things (like 32-bit ints to 64-bit ints) is more complicated than just altering a table for large databases.
Even small optimizations like writing in C or using custom datatypes can make a huge difference when working with large amounts of data.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

There is a built-in way to backup a Postgres database using pg_dump
It actually makes sense in a lot of cases to backup larger databases more frequently than smaller ones (this was not intuitive to me)
It is more costly to host a large database on AWS than it is to just do everything onsite

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

You can manually set cost values for different query plans that may affect how Postgres chooses to execute your queries
Adding a lot of indexes may speed up queries, but it can also significantly slow down inserts (7 indexes -> 15 times slower insert)
Index bloat (wasted space from deleting or updating table entries) can be a problem for large databases, and it may be worthwhile to recreate an index

Citus: Postgres at any Scale (Citus)

Citus is a Postgres extension that turns a single-server database into a distributed system with one coordinater (a Postgres instance with the extension) and zero or more worker nodes (other Postgres instances).
Can speed up queries by essentially sharing data and querying the shards in parallel
There is something called libpq built into Postgres that allows client programs (things in Python or Ruby etc) to interface with the database I think

Data modeling, the secret sauce of building & managing a large scale data warehouse

Windows uses Postgres (and Citus) as the backend for their diagnostics processing team
They use JSONB types for staging/processing tables (what comes out of their pipeline) and hstore types (key/value) in reporting/final tables (what their APIs use)
Citus has a lot of useful built-in data types (HyperLogLog, T-Digest, etc) that are useful for data analytics

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

Optimizing is not a huge issue for an early stage company; you should be focused on finding the market/fit and making a good product idea
When doing a SaaS startup, be prepared to change your tech stack as the company grows
Reindexing to fix bloated indexes is much faster on a distributed table when run in parallerl

Mastering Chaos - a guide to microservices at netflix

A microservice architecture is when you compartmentalize the parts of a system so that it is easier to scale and coordinate the parts and isolate issues
Critical microservices are ones that are necessary for basic functioning of the entire service. They can be identified by eliminating their dependencies and testing if there is still basic functionality.
Conway's Law: software reflects the organization of the company that produced it

Why Google Stores Billions of Lines of Code in a Single Repository

Google organizes their repository as a tree where directories are nodes and 'owners' of the directory get the final say in whether or not changes can be committed or not
Everyone works on the main branch, so everyone can see everyone's commits. Branching is mainly only used for release
The monolith code base makes developers susceptible to using dependencies instead of writing APIs which can make the code more susceptible to breaking

Kubernetes Documentary (both parts)

The developers wanted to make it open source to speed up the process of becoming a standardized software, but Google was hesitant because they didn't want to fund the competition
Google's announcement of Kubernetes kind of conflicted with the Docker startup because Kubernetes seemed more exciting to people than Docker's own container orchestrator (like Docker Swarm)
Niantic used Kubernetes for Pokemon Go!

laurenleadbetter commented 1 year ago

Scaling Instagram Infrastructure:

The video explains how Instagram scaled its infrastructure from its initial stages to the current state and has scaled 4x in the prior three years.
In typical day on Instagram four hundred million users visit Instagram every day four billion likes are registered more than 100 million media is uploadeda typical day on Instagram four hundred million users visit Instagram every day four billion likes are registered more than 100 million pieces of media are uploaded
Instagram uses continuous optimization, testing, and automation to ensure the platform's reliability and improve its performance.

The Evolution of Reddit.com's Architecture

Initially, Reddit used a single web server running Apache and MySQL, but as the site's popularity grew, they moved to a distributed architecture using multiple web servers, load balancers, and a shared database.
Reddit uses Amazon Web Services (AWS) extensively, including EC2 for computing, S3 for storing media, and CloudFront for content delivery.
Reddit uses Kubernetes for container orchestration and deployment, making it easier to manage and scale their services.

Mastering Chaos - A Netflix Guide to Microservices

Josh Evans compares the act of taking traffic in a microservice architecture to the act of breathing, with incoming and outgoing traffic
Microservices are an abstraction and require additional components such as data access libraries, caches, and orchestration
Netflix operates as suite of small services, each with its own process and communicating with a lightweight mechanism.

Why Google Stores Billions of Lines of Code in a Single Repository

Google repositories house around a billion files, containing about 15 million lines of code modified every week by humans and supporting a very high volume of traffic.
Google used custom built tools with Sitese, Critique, and Tap, which are all custom tools built to support their high volume model
Googles approach of using a single repository has allowed for their most efficient scaling.

PostgreSQL at Pandora

Pandora was initially an Oracle shop, the relationship was discontinued due to financial constraints and they migrated to PostgreSQL.
They stablished a comprehensive alerting system based on changes and rate of change that allows for quick identification of performance issues.
They developed a set of predefined DB management classes to deploy new systems, and these classes were defined by the PostgreSQL configuration and the authentication file.

Breaking PostgreSQL at Scale

This video details how PostgreSQL databases change operationally as the size of the database increases up to multi petabytes.
For a 10GB database, it is easy to backup data using PGdump, and high availability requires a simple manual failover. However, for a 100GB database, fitting the database fully in memory is not an option and backups using PGdump are becoming less attractive.
For optimal running with a 100g database Pettus suggests using Point in Time Recovery (PITR) backups, and adopting PG Backrest.

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

Johnathan Denney scaling their SaaS on Postgres to eight-plus billion events all using postgreSQL.
Denney emphasizes the importance of sharding tables by customer tenant ID column to distribute the workload across multiple database nodes.
Denney suggests adopting a rollup strategy for commonly accessed reports and clearing out old stale data from hot storage to keep reports fast. That technique allowed him to scale from thousands to billions of events.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018

Chris Travers discussing real-time analytics using Postgres at a massive scale, with over 400 terabytes of high velocity data and around 100,000 requests per second.
Their processing is done using PostgreSQL, Kafka, and Go to handle the workload. Their architecture involves routing requests to backends via an HA proxy, where a program written in lpgSQL sends the data to analytics shards for processing through an incremental MapReduce job.
They face challenges such as the need for experienced employees and the difficulty of changing data models at such a large scale.

kushpetal commented 1 year ago

(11)

1. Scaling Instagram Infrastructure

Three dimensions of growth: Scale Out builds infrastructure to add more hardware, Scale Up makes each of the servers count and Scale Dev Team enables a fast growing engineering team to continue productivity.
Facebook TAO uses MySQL as a backend storage device and a simplified data model: Objects represented by nodes and relationships represented by edges.
Memcache has an inverse relationship between processing speeds and latency. The high number of read/writes per second makes it sensitive to network conditions.

2. The Evolution of Reddit.com’s Architecture

Caching used to speed up costly/slow operations for primary features such as listing orders and voting.
Reddit r2 is a monolithic application that represents the oldest component in the reddit backend. All the components of r2 are used across various features of the Reddit application such as Search, Listing, etc.
Reddit Thing is a part of r2 which stores data in PostgresSQL with heavy caching in memcached.

3. Postgres at Pandora

Clustr is an in-house distributed database that does not follow the ACID requirements. Clustr is mainly intended for sharded databases.
Some of the advantages of Clustr include: Postgres version agnostic, transparent re-sharding and that one or more group members may be offline concurrently,
Disadvantages are that Clustr does not support, JOIN, WITH, sub SELECT operations along with only 5 metadata fields per row.

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

Adjust performs a MapReduce from their backend servers to their analytics shards and then performs another MapReduce from the shards to the client.
Autovacuum is a background utility process which issues a regular clean-up of redundant data (garbage collection).
The autovacuum process can run into problems with extremely large table sizes where there can be millions of dead tuples.

5. Large Databases, Lots of servers, on Premises, in the Cloud - Get Them All!

pg_dump is a utility for backing up a PostgresSQL database. It allows for consistent backups even if the database is being used concurrently.
pg_dump must be tested to ensure its functionality with the primary feature being the ‘time to restore’.
Barman is a disaster recovery solution to backup PostgresSQL in the production environment.

6. Breaking Postgres at Scale

pgpool2 is a middleware that works in between the server and database client.
The prim shared_buffers range is in between 16 and 32GB with larger ranges not providing much added benefit.
The autovacuum process has a field, autovacuum_workers that should only be increased when working with a large number of database tables (500+).

7. Citus: Postgres at any scale

Citus is an open source DB that provides distributed DB properties to postgres.
Citus should be used in multi-tenant applications and real-time analytics dashboards.
An advantage of Citus in a SaaS application is that there are no limits on CPU, memory, storage or I/O capacity.

8. Data Modeling, the secret sauce of building & managing a large scale data warehouse

Citus uses measures in creating dashboards to provide insights regarding customer experience (Windows).
The data schema for measures includes properties revolving around time series data about a specific use case executed on a Windows 10/11 device.
An average Windows measure can easily have data points from 200M+ devices/day.

9. Lessons learned scaling our SaaS on Postgres to 8+ billion events

Analytics based products are easier to scale with sharded data and distributed data across multiple nodes.
A tip to scale a SaaS product is to routinely decrease data storage by clearing out old data from hot storage.
Another tip is to look for potential ID value limitations in tables and migrate to BIGINT.

10. Mastering Chaos - a guide to mastering microservices at Netflix

Fault Injection Testing (FIT) is a software testing method to test out a computer system by deliberately introducing errors.
The CAP Theorem for databases: Consistency, Availability, Partition Tolerance. The theorem states that in the event of a failure, it is possible to provide either consistency OR availability.
Spinnaker is a software developed by Netflix for continuous deployment. Designed to work with AWS, Kubernetes, GCP, Azure and Oracle Cloud.

11. Why Google Stores Billions of Lines of Code in a Single Repository

Some advantages of a monolithic repository include large scale refactoring, codebase modernization, simplified dependency management, etc.
The diamond dependency problem occurs when two dependencies depend on different versions of the same artifact.
A monolithic codebase does not necessarily imply a monolithic software design.

JSanders24 commented 1 year ago

Scaling Instagram Infrastructure

Scale up does not mean more expensive hardware, but rather to use as few CPU instructions as possible and as few servers as possible.
The number of processes is upper bounded by system memory.
Scaling out refers to using more hardware.

The Evolution of Reddit.com Architecture

The foundation of reddit comes down to an ordered list of links
To resolve users waiting for the locks, the solution was to partition the queues based on the link being voted on.
Timers give you a cross section and p99 shows you problem cases
Locks are really bad for throughput

Postgres At Pandora

First thing to implement with postgres was monitoring because the built in function did this ineffectively.
Originally backups were done by system administrators and done with pg dump, but now they use pg base backup.
Groups of clustrs has two or more databases kept in sync by the Clustr app

PostgreSQL at 20TB and Beyond

They tracked over 2 trillion data points in 2017 which is considered "big"
Requests get routed to the back-end via HA proxy in round robin
They have built over 400TB of an environment and still haven't outgrown PostgreSQL

Large Databases, Lots of Servers, on Premises, in the Cloud-Get Them All!

Postgres Scales out with standby servers
Can increase the number of servers with no problem, thus it web scales
Some applications have connections to more than one databases, this is for legacy applications

Breaking PosgreSQL at Scale

The farther you fall behind on major versions, the harder it becomes to upgrade
Data Warehouses need much more resources than transactional databases
For fields with a large number of values, the default statistic target can be too low

Citus: Postgres at any Scale

You can have an update that's paralyzed across all the shards
One benefit of Citus for real time is there are no limits on CPU, memory storage, or I/O capacity
Ship/no-ship decisions for Microsoft Windows are made using Hyperscale

Data Modeling, The secret sauce of building and managing a large scale data warehouse

Each release of windows puts a unique DeviceID on each of the install
2 million of the 10 million queries handled per day are distinct
Often they use the device population itself as a proxy for the entire average of the whole population

Lessons learned scaling our SaaS on Posgres to 8+ Billion events

Heap analytics scaled their postgres database significantly using an extension called Citus which made it easier to share their analytics tables by a customer tenant ID column which resulted in much faster queries
Postgres is easy to deploy and scale
Scaled databases horizontally by adding more worker nodes as the customer base grows

Mastering Chaos

The microservice architectural style is an approach to developing a single application as a suite of small services, each running its own process and communicating with lightweight mechanisms
You need to be able to automate your operations as much as possible, on demand provisioning is required for this
Any piece of software reflects the organizational structure that produced it

Why google stores billions of lines of code in one repository

They always decided to focus on scalability and maintain one monolithic repository rather than split it up
Piper is built on top of standard Google infrastructure, and is replicated across 10 data centers worldwide
Monolithic Codebase captures all dependency information

Kubernetes

Kubernetes is cloud-agnostic, meaning it can be deployed on various cloud platforms such as Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), or on-premises data centers. This flexibility allows you to run your applications on the infrastructure of your choice.
Kubernetes is an open-source project originally developed by Google and now maintained by the Cloud Native Computing Foundation (CNCF). This means it is freely available and has a large community of contributors and users who actively contribute to its development and improvement.
Kubernetes has self-healing capabilities built into its design. It constantly monitors the health of applications and automatically restarts or replaces failed containers or nodes to maintain the desired state. This helps ensure the availability and reliability of your applications without manual intervention.

BITEEE0308 commented 1 year ago

(4)

Scaling Instagram Infrastructure

Instagram uses Postgres to store user, media, friendship type of data.
Optimizing memory and network access can make each server serve more CPUs.
The processing speeds of Memcache are inversely proportional to latency. Its ability to perform a high number of read/write operations per second makes it particularly susceptible to network circumstances.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

The company employs a materializer to aggregate and transfer data from multiple servers to several other servers.
Autovacumm might not work and might cause troubles when the table size is insanely large and contains many dead tuples.
The general architecture is that the question comes from the internet, being written to the backend, materialized to analytic shards, and shown in dashboards.

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

Patroni contains tooling for failover, which is particularly good for a cloud container environment.
However, compared to pgpool2, one downside of Patroni is that it requires us to separate the read and write traffic manually.
The time PG Upgrade requires is proportional to the number of database objects instead of the database size.

The Evolution of Reddit.com's Architecture

Reddit uses H a Proxy to split and isolate users' different requests into various pools in the application servers to avoid different pages interacting and affecting each other.
The backend services for reddit were written in Python and they used thrifts for strong schema.
Reddit can change the site without changing a table in production because it has a large size of rows in the data table for every object.

pmukneam commented 1 year ago

Scaling Instagram Infrastructure
- Instagram uses Postgre to store user, media, friendship data. They use Cassandra (NoSQL) to store user feeds, activities, etc. On the computing side, they use django frameworks. RabbitMQ with Celery, and memcache.
- They use denormalized database to reduce runtimes of query such as counting number of likes of a post. But it still takes so much time. They then use memcache lease to address the issue.
- Interestingly, they store the code/development/etc. via a single master branch and not multiple branches.
The Evolution of Reddit.com's Architecture
- Mostly written in Python, using common libraries. Also use Node.js for front-end development. Additionally, they use Cassandra/Postgres for storage.
- There is a problem with the queue system. By implementing more processors, the queue for lock gets longer and longer. So with the same number of processors, they 'partition' the queue instead.
- They use denormalized listing of a comment tree to store the parent comment and children comments in one place, so they don't have to look up the whole tree, but only specific ones that they want to display.
Postgres at Pandora
- They have an alert system to monitor 'out of norm' behaviors from their tables.
- The original replication is really slow and done sequentially. Now they are using Steaming Replication. Note that this can be only done with the same version of Postgres.
- They implement their own diagnostic system called Clustr to investigate their problem. This architecture is used for sharded database. They use this implementation when trying to migrate/update their database without the downtime of Postgres database. However, some of the features won't be available such as JOIN operation. There will also be additional overhead space taken.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
- Apparently, there is not many people who have a deep understanding of database and a practical skill at the same time, which post challenges in the industry.
- It's a really challenging process to increase the bits (e.g., from 32-bit integer to 64bit) in your database, especially if you are dealing with Big data.
- Small error/things you overlooked could be potent when dealing with big data. For example, if you are wasting 10 percent of your space it might not look like much, but if it's 10 percent of 400 TB, that's 4 TB, which doesn't sound good.
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
- They automate their server on Cloud via netbox/ AWS CloudFormation/ Colins/ Puppet (post install automation).
- Good hardware is essential for database. Pay for a good hardware to avoid data corruption and other problems.
- They use Streaming replication. While using pg_dump, we need to make sure that there is no error in other part of your database such as automation system, etc.
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
- Postgres can handle any size of database. In a small database like 10G, your database should just fit in 10G memory. You can use pg_dump to back up your database. And when doing an update, just do it. The longer we wait, more complicated it gets.
- Next, when the database gets bigger, other than looking at the size of database and memory, try to fit top 3 largest indexes in memory. Note that more memory doesn't improve write speed. At this point, pg_dump is not efficient enough. One of the choice is pgBackRest.
- When the database is 1TB and above, we can't start to fit all data in memory now. So, just get as much as possible. We will notice that VACUUM takes a really long time, but do wait for it to complete. And DO NOT turn off autovacuum.
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
- Citus is Postgres extension that helps scaling Postgres database. Citus parallelize many regular Postgres command such as COPY. We can consider using Citus when the database is bigger than 100GB. They use hash-partitioning by calculating the hash for incoming data, then distribute it to an appropriate shard.
- Note that if the query cannot be sped up with parallelism, Citus won't help much and will just do things that Postgres normally does.
- To migrate to Citus, just add tenant ID columns to large tables and use reference tables for other tables. Then use tenant ID to filter queries.
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
- Windows Data Mesh uses VeniceDB, a Postgres spin-off.
- Windows need to handle over 10M query per day, which posts a technical problem.
- They are also using partial covering index. Normally, they are accumulating over 50+ indexes per table.

verynicocool commented 1 year ago

The Evolution of Reddit.com's Architecture

Reddit's architecture includes a monolithic application called "r2," which has been the core of Reddit since around 2008. R2 is a large Python codebase and acts as the original monolithic application that powers Reddit.
Reddit's front-end engineers have been building modern front-end applications using Node.js. These applications act as API clients and communicate with the APIs provided by Reddit's API gateway or r2 itself.
Reddit uses a combination of technologies for data storage and caching. The core data model of Reddit, called "thing," is stored in PostgreSQL with memcache in front of it. Additionally, Reddit heavily utilizes Cassandra, a distributed database, for storing new features and ensuring high availability.

Scaling Instagram Infrastructure

Instagram initially started running on AWS but later moved its service completely inside Facebook's data center to take advantage of scaling services. However, when Facebook conducted drills called storms, Instagram was not able to operate in multiple data centers.
Instagram focused on optimizing CPU demand to make each server more efficient. They collected data on CPU instruction usage by different endpoints and implemented tools to monitor and analyze this data.
Instagram identified that code itself was a significant contributor to memory usage. They took steps to reduce the amount of code in memory by optimizing and removing dead code.

PostgresSQL at Pandora

Pandora initially used Oracle as their database system but switched to PostgreSQL due to financial constraints. They wanted to go open source and scale out, which was not compatible with the costly Oracle solution.
To manage their PostgreSQL instances and clusters effectively, Pandora implemented a set of predefined DBMS classes. Each class was defined by its PostgreSQL configuration and the authentication file.
Pandora uses replication for high availability. They have local replicas and remote disaster recovery replicas. The standard procedure is to failover to the local replica, and only in cases where both the primary and local replica are affected, they switch to the disaster recovery replica.

Large Databases, Lots of Servers, on Premises, in the Cloud

The speaker mentions that Postgres can scale out since 2008-2009, and they have successfully scaled Postgres by using external tools like Slony for replication and load balancing. They highlight that Postgres can handle large databases and lots of servers, supporting web-scale applications.
The speaker discusses the infrastructure of Leboncoin, mentioning that they have 70 database servers distributed across two data centers and the cloud. They indicate that they have a 3-terabyte live database, which is one of the largest databases at Leboncoin, along with other larger databases.
The speaker emphasizes the importance of backups and high availability for database systems. They mention using PG dump for nightly backups, encrypting and storing them off-site in the cloud using AWS Storage Gateway.

Citus PostgreSQL at any Scale

Citus is an open-source extension to PostgreSQL that transforms it into a distributed database. It allows you to create distributed tables where data is spread across multiple Postgres servers, enabling you to handle tables of any size. Users have successfully managed tables of up to a petabyte in size with Citus.
Citus introduces the concept of reference tables, which are replicated to every server in the Citus cluster. Although writes to reference tables can be slower due to the need to update all servers, they offer benefits such as the ability to create foreign keys and perform joins with distributed tables.
Citus provides several superpowers to PostgreSQL, including the ability to route queries to the appropriate node, high-performance parallelism for commands like COPY, SELECT, INSERT, UPDATE, and DELETE, scaling out stored procedure calls, and support for distributed transactions.

Data modeling, the secret sauce of building & managing a large scale data warehouse

The data schema used in the data warehouse follows a dimensional modeling approach. This approach organizes data into dimensions and facts, representing the relationships between various data elements.
To optimize query performance, the data warehouse utilizes partial covering indexes. These indexes are created on the reporting tables and are designed to handle the high dimensionality of the data.
The data warehouse leverages extended data types provided by the Citus extension for Postgres. Examples of these extended data types mentioned include HyperLogLog and T-Digest.

Lessons learned scaling our SaaS on Postgres to 8+ billion events

Sharding tables by customer ID and distributing the data across multiple database nodes is key for maintaining high performance at scale. The Citus Postgres extension makes this process easy and is recommended for scaling applications with analytics on Postgres.
Clearing out old stale data that is infrequently accessed from hot storage can optimize margins and improve report performance. By periodically removing historical event data that is no longer frequently accessed, ConvertFlow was able to reduce storage needs and work with less data in hot storage.
As ConvertFlow scaled to billions of events, they encountered the limitation of the 2.1 billion max value for Postgres' integer column, which was used as the default ID primary key. To overcome this, they worked with Citus to migrate their analytical tables' primary keys to the big integer column type, ensuring continued scalability for their database.

SybelFrancois commented 1 year ago

Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

Windows Data Engineering Group uses Citus. It is a scalable PostgreSQL database that handles high-cardinality data and can run complex queries.
The Group prefers to use dynamic computation due to the fact that they run large scale queries which require high computational and storage resources, and consequently pre-computing or materializing the query results in excessive use of computing time and storage resources.
The use of special data structures, such as HyperLogLog, can efficiently aggregate data across hundreds of millions of devices. In addition, related to data storage and organization, the use of JSON and JSONB is efficient for staging tables due to their flexibility.

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

Adopting a rollup strategy for most commonly accessed reports in a dashboard can be beneficial. Incrementing lifetime statistics for campaign use, conversions, and conversion rate reports allows fetching a pre-calculated report to visualize the totals for customer's campaigns in a single page without making dozens of separate queries to events table.
Clearing out data from hot storage optimizes margins and improves the speed of reports since it requires less data in hot storage. That said, the practice of mitigating index bloat, meaning reindexing (concurrently) tables could reduce storage impact, which makes maintenance operations easier, improves margins and avoids locking up one of production tables.
Horizontal scaling as a strategy for handling big data, which can be used by the Hyperscale (Citus), offering in Azure to add more worker nodes to Postgres database as database grows, enables to continue processing billions of events.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! — Flavio Gurgel

Postgres is a highly scalable database system, which they used to scale out from 2009 to the present, accommodating a growing number of servers, users, and data. It can handle web-scale operations and can be expanded to accommodate even more servers if necessary.
In order to ensure the reliability and integrity of a large database, servers with 3TB of RAM, ensuring all hardware is under warranty, and using RAID 10 for disks can be used. That comes with investing in high-quality hardware, battery backup cache, using ECC (Error-Correcting Code) RAM, and double network cards.
To back up PostgreSQL databases, nightly backups using the 'pg_dump' tool can be performed, using custom or directory mode depending on the size of the database. These backups are encrypted and sent off-site to the cloud for additional security and disaster recovery. In addition, AWS Storage Gateway would enable having a backup system in each data center that uses the same S3 bucket, which makes all PostgreSQL dumps available across an infrastructure.

Scaling Instagram Infrastructure

The backend stack of Instagram is distributed to different data centers with the servers categorized into storage and computing. Storage servers store global data with replication, while computing servers are stateless and process user requests. Its backend (utilizing both C++ and Python) consists of the web tier that runs Django with Python, which receives user requests and accesses various back-end storage or services.
Instagram uses a continuous integration (CI) process for development, meaning engineers collaborate easily by working on a single master branch with feature flags, allowing for easier upgrades, refactoring, and performance data analysis. Instagram has a monitoring and alerting system called "Extend" that helps discover problems quickly and sometimes requires reverting to previous versions. Instagram continuously rolls out the master repository whenever a diff is checked in, resulting in around 40-60 rollouts a day. A typical commit goes out within an hour of landing on the master branch.
Instagram uses Cassandra to store user feeds, activities, and more. It has no master server, and all replicas have the same copy of data with eventual consistency. Also, C Profile is a tool used to deep dive into the codebase to understand the impact of specific functions on code performance.

The Evolution of Reddit.com's Architecture

Reddit faced an issue where cached listings were referring to items that didn't exist in Postgres, causing pages to crash, which they solved by building a tool to clean up these listings and remove any bad data.
Reddit comments are stored in a threaded manner. Since it can be complicated to render, they store the parent relationships of the whole tree in one place for efficiency. However, comment processing systems, especially with large threads can slow down the site. They then developed a system to allow threads to get dedicated processing, but this created other bugs.
They use Autoscaler, which watches utilization metrics reported by load balancers and automatically increases or decreases the number of servers requested from AWS, allowing the site to scale according to demand. But later in 2016, they had problems during a migration from EC2 classic into VPC, which involved the Autoscaler. Now they have built a next-gen Autoscaler, using lesions from these past mistakes.

PostgreSQL at Pandora

Cluster from Pandora, is a non-ACID database and a high availability system and compromises consistency for availability, designed for shared databases.
When data is written or read, Cluster sends the request to each database in the group (two or more) and then returns the data to the application. If there is a discrepancy in the data from each database, a read reconciliation process is applied. Pandora has at least 1650 Postgres clusters, containing at least 335 terabytes of data, with the advertising-supported radio application alone increasing by two terabytes per month.
Cluster has some scaling issues, particularly with Write-Ahead Logging (WAL) files and the Vacuum process.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

PostgreSQL uses custom data types to optimize data storage. For instance, they have a custom datatype that allows all countries to be represented in a single byte, which can save significant storage space over billions of rows.
They use another one called Istore for modeling sparse integer arrays, which offers many of the benefits of columnar storage but in a row-oriented format. Istore supports General Inverted Index (GIN indexing), allowing efficiency for querying of membership in an object.
Postgres's Auto Vacuum feature can be used to manage garbage collection, but has to adapt large-scale environments to the default settings. Postgres source code, written for humans to read, assists in developing extensions.

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

Auto vacuum should never be turned off completely, but in some cases, manual vacuuming of specific tables may be necessary in addition to auto vacuum.
Partial indexes can be useful for specific queries, especially in cases where only a small subset of data is actively used. But also, Monitoring index usage using PG stat user indexes can help identify unused indexes that can be safely dropped.
Upgrading a larger database may require logical replication-based upgrades or using tools like PG Upgrade, depending on the complexity of the schema and data size. Additionally, as a database grows, traditional full backups may become impractical, and alternatives like file system snapshots or SAN-based snapshots should be considered.

kevinm126 commented 1 year ago

1. Scaling instagram infrastructure

amazon uses multiple data centers spread out over different parts of the world in order to improve scalability and reliability.
Instead of the branch management model for source control, Instagram uses a 1 master approach which requires continuous integration to keep the singular master branch up and working
In order to optimize the amount of CPU instructions instagram carries out they rewrote some of the extensively used and stable functions in C/C++ which made these instructions less costly.

2. The Evolution of Reddit.com's Architecture

One of the problems reddit had with their listings was that the order of listing would be invalidated most often when someone voted on a listing. Their solution was to created a denormalized index of links that would be updated with respect to the voting on a listing but was no longer a cache as it originally was
When a thread on reddit is receiving a high magnitude of comments, due to the tree like nature of the reddit comment threads, the site can become very slow. To solve this problem, threads experiencing this are able to be manually picked out to get their own dedicated queue called the "Fastlane"
r2, which is the oldest reddit component, built in 2008, is now split up into a bunch of backend services as opposed to the original blob

3. Postgres at Pandora

Pandora originally used pg_dump for its backups but switched over to pg_basebackup due to pg_dump having a number of issues, although it had a close to 100% success rate
Pandora uses replication to enable a high availability solution and has a process by which data bases can failover to its replicas
one of the drawbacks of using Clustr is that five extra metadata fields are added to each row, adding overhead

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

They use the MapReduce paradigm from the data in its backend servers to its analytics charts and then again from its analytics charts to the clients
They create custom data types in order to save about 8 bytes per row which adds up over the millions of rows in the long run
In order to solve their problems with Autovacuum, adjust changed their trigger from 50 rows plus 20% of the table being dead to 150k rows plus 0% of the table.

5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

Small changes in optimization can have a substantial affect on performance when working with large amounts of data.
pg_dump is a tool in PostgreSQL which is built in to facilitate back up of PG databases.
PostgreSQL is scalable and has progressed rapidly since the early 2000's to get to that point.

6. Breaking Postgres at Scale

Small databases are easy to work with and don't require the processing and hard problems that are presented by large databases. Often the course of action with a small database are the most basic possible options which still work with the db at hand
for databases biggere than able to fit in memory a rule of thumb is to set effective_cache_size to be bigger than your largest index
Once working with an extremley large databases incremental backup becomes very important but shared_buffers should not be increased drastically as it will not significantly increase performance

7. Citus: Postgres at any Scale

Citus can be used as a document/Key-value store with parallell queries
Citus allows distributed tables to be co-located with other distributed tables which allows for foreign keys between distributed tables and distributed columns
One of the benefits of using Citus for multi-tenant applications is that there is no limit on CPU, memory and storage

8. Data modeling, the secret sauce of building & managing a large scale data warehouse

One of microsoft's use cases for citus is processing the insights on customer experience which is layed out and represented in dashboards
They use JSON for staging tables and Hstore for dynamic columns in reporting tables
Citus is a great distributed SQL execution engine, seamlessly integrated into postgres

9. Lessons learned scaling our SaaS on Postgres to 8+ billion events

All the growth and scaling done at the company was done while working in Postgres
It is possible to over optimize
Postgres is extremely scalable
Its important to find product market fit before scaling your platform

10. Mastering Chaos - a guide to microservices at netflix

The microservice architectural style is an approach to developing an application as a suite of small services
Microservices are an abstraction
microservices allow for horizontal scaling

11. Why Google Stores Billions of Lines of Code in a Single Repository

The size of the content of google's monolithic repository is 86 terabytes at the time of filming
the are 45 thousand commits per day to googles repository
changes to any directory must go through a code review before being committed. this helps keep the repository sane