State of the industry video assignment

mikeizbicki / cmc-csci143

big data course materials

41 stars 76 forks source link

State of the industry video assignment #170

Closed mikeizbicki closed 2 years ago

mikeizbicki commented 2 years ago

Due date: Monday, 2 May

Background: There are 9 videos linked below. Each video describes how a major company uses postgres, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to get the gist of them all.

Instructions: Watch each video and write 3 facts that you learned from the video. Submit the assignment by replying to this post with your facts for all of the videos in a single reply. The assignment is worth 1 point per video.

NOTE: I realize that many of you have a lot going on right now, and so I won't be offended if you decide to "punt" this assignment. The point-value is intentionally small so that it will have a minimal impact on your grade if you're not able to complete it. That said, I think this is one of the more interesting assignments in this class and so I'd recommend you find time to watch the videos.

Videos:

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E
The Evolution of Reddit.com's Architecture

https://www.youtube.com/watch?v=nUcO7n4hek4
Postgres at Pandora

https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

https://www.youtube.com/watch?v=BgcJnurVFag
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

https://www.youtube.com/watch?v=4GB7EDxGr_c
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

https://www.youtube.com/watch?v=eZhSUXxfEu0
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

https://www.youtube.com/watch?v=kd-F0M2_F3I
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

mikeizbicki commented 2 years ago

Chuck asked good questions in class about monolithic architectures vs microservices and mono repos vs polyrepos. The two videos below address these questions. If you would like to watch these videos, you may substitute them for any of the videos above. (So the total number of videos you must watch is still 9, you now just have more choices.)

Mastering Chaos - a guide to microservices at netflix

https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s
Why Google Stores Billions of Lines of Code in a Single Repository

https://www.youtube.com/watch?v=W71BTkUbdqE

(If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)

Also, I'm just now realizing I never mentioned the kubernetes documentary in class. (Recall that k8s is like docker-compose on steroids.) The documentary covers the history of docker and k8s and some of the technical differences between the two. So you may also substitute the k8s documentary for one of the videos above if you'd like.

Note that the k8s documentary has 2 parts, and you must watch both parts to count as a single video

https://www.youtube.com/watch?v=BE77h7dmoQU

https://www.youtube.com/watch?v=318elIq37PE

Tonnpo commented 2 years ago

Is there a partial credit for this assignment or we need to complete 9 videos?

Thanks very much!

mikeizbicki commented 2 years ago

@Tonnpo Yes, there's partial credit. You'll get 1pt/video.

jzlilili commented 2 years ago

Instagram
- fixes the chache inconsistency problem by runnign a daemon on the postgres replicas instead of updating memcache
- scales up by minimizing the number of CPU instructions and the number of servers used to execute those instructions
- removed garbage collection to minimize memory usage
Reddit
- a lock problem caused vote queue pileups, which was solved by partitioning the queue on subreddit id
- Things are Reddit's oldest data types and are represented by two tables: a Thing table and a data table
- parent-child relations of comment trees are stored in a denormalized listing and deffered to offline job processing
Pandora
- Hadoop cores outnumber Pandora employees about 32 to 1 (as of 2017)
- Pandora automatically stores historical data to better identify abnormal activity
- uses CLUSTR with sharded databases, trading consistency for availability
20TB and Beyond
- byte counting saves a small amount of space per row but pays off when working with large amounts of data
- the default settings for autovacuum were changed because it was accumulating too much data between vacuums
- data is aggregated using an incremental map reduce between servers (seven total stages)
Leboncoin
- databases are usually the hardest part of the stack to scale out
- good hardware is required for reliable software (always raid 10 for disks, ecc ram is the most important thing when it comes to quality)
- it's important to test pg_dumps for possible corruption and to measure restoration time
Breaking Postgres at Scale
- pitr backups take a file system level copy of the database and save the log segments generated from the file system copy
- replication lag can cause reads after writes to be unable to pick up on the information that was just written
- indexes take up disk space and insert time, and add to planning time, so indexes should be added in response to query patterns, not because they might be useful
Citus
- uses a coordinator node to manage multiple worker nodes
- Citus can query data from shards in worker nodes in parallel
- has three different implementations of insert..select: co-located, repartitioning, and merge step, which can each handle around 100, 10, and 1 million rows per second, respectively
Data Modeling
- hundreds of measures are used to track specific data about latency, bluetooth connectivity, wifi network connectivity, sudio and video quality, etc.
- device-centric aggregation, or "one device, one vote", is used to reduce noise when gathering data
- uses JSON for staging tables, for its flexibility and ability to handle many different data types, and Hstore dynamic columns when reporting tables, for its smaller size
ConvertFlow
- it's best to focus on developing a product before scaling, focusing on customers and not overengineering the stack
- clearing out old data from hot storage optimizes margins and speeds up operations
- when reindexing tables, make sure to reindex concurrently to avoid locking up the table unnecessarily

leafsphere commented 2 years ago

Instagram
- The caching consistency problem was resolved through the Postgres replication mechanism to invalidate the memcache in its own local region, letting users from different regions get the latest comments.
- The high cost of URL generation was reduced by Cythonizing or using C/C++ for functions that were stable and extensively used.
- In scaling up, in order to reduce the memory requirement to run more processes, code reduction was done by running in an optimized mode, removing dead code, and moving some private memory into the shared area where only one copy is needed to reduce the total memory needed by moving configuration data into shared memory and disabling garbage collection for python, leading to a capacity increase of at least 20%.
Reddit
- Listing (of links) is the foundation of Reddit. The way it's done is by selecting the links and caching the list of Link IDs in numcache, then looking up the links by their primary key. The cache is invalidated on new submissions and votes.
- Running the select query is still expensive even if you invalidate it, so in a process like voting where you already have all the info you need to update the cache listing, you can mutate in place without rerunning the query and store the ID, as well as any related sort information, fetch the current cache listing and perform a read-mutate-write operation.
- When you have a bunch of votes on hot posts at the same time, you run into lock contention. To fix this, around mid 2012, vote queues were partitioned, putting user votes into separate queues to end up with the same number of total processors divided into different partitions and far fewer vying for the same lock at the same time.
Pandora
- Replication is used for a high availability solution; all production databases are replicated, and there is both a local replica and remote replica (Disaster Recovery).
- Originally, all replication was SLONY based, but it was a very intrusive replication. system. It adds its own triggers, which made it possible to break referential integrity, and it had a three-fold increase in writes on the production database. The fragile system tended to fall over a lot, but it was able to replicate between different Postgres versions.
- Eventually, Pandora started using STREAMING Replication with WAL shipping, which was better than SLONY in most ways. It was not as intrusive as SLONY, but you could only replicate between same Postgres versions.
PostgreSQL at 20TB and Beyond
- Saving in Redis, reading from Redis, and storing in Postgres means that restarting Postgres doesn't interrupt the backend. You typically get 20-24 TB from each backend, and the data is shipped to archive servers when it gets close to that.
- To aggregate so much data, Adjust does an incremental MapReduce from one set of servers to another set of servers. The shards themselves do the second stage aggregation.
- Autovacuum may not be able to catch up on huge tables, so the eventual resolution to the problem was creating a batch script to roll out changes to servers gradually to avoid overloading the system.
Large Databases, Lots of Servers, on Premises, in the Cloud
- In Paris, the position of data centers are mostly by the Seine River; if the river floods, several data centers would be flooded at the same time. The paths between the data centers should be doubled and ideally multiplied.
- pg_dump is great for long term storage. It can also test the restore as well as time the restoration, which can be sped up by running parallel processes.
- leboncoin uses Xyman to moniter alerts, and it's connected to pagerduty. Some metrics are sent to Datadog for developers and other users to see graphs in real time, but for instructure, cybertec pgwatch2 is used to read database statistics. The front end uses Grafana.
Breaking PostgreSQL at Scale
- By enabling temp file logging, you can how big the temporary files created by queries are and set work_mem to 2-3x the size of the largest temporary file. If it requires something big like 16 GB, then you should fix the query… or start thinking about more memory.
- If doing full PITR backups is taking too long, you should start doing incremental backups (pgBackRest does them out of the box). Generally you shouldn't increase shared_buffers past 32 GB, since it doesn't benefit actual performance too much.
- Setting maintenance_work_mem too high will cause autovacuuming to have a hard time finishing; if most indexes are larger than 2 GB, you'll generally have better performance with 256-512 MB.
Citus PostgreSQL at any Scale
- Citus started as a startup in Turkey and is now offered as a managed service on Azure as a part of Microsoft. It's an open source extension for PostgreSQL that turns it into a distributed database.
- By distributing data across many PostgreSQL servers, which means you basically can have tables of any size and aren't restricted by memory, and limits on hardware. There are also reference tables but they must be small as it has to be replicated to every server in your Citus cluster and are often slow to write because they need to go to all servers. However, you will have the ability to join them with the distributed table on any column.
- In addition to select and copy, Citus supports handling many concurrent transactions that hit one worker node, since you're only using a single process on a single worker for that transaction whereas you can't handle many parallel updates across the entire cluster concurrently.
Data modeling, the secret sauce of building & managing a large scale data warehouse
- A measure is a time series data about executing specific scenarios on a Windows 10/11 device; i.e. tracking start menu launch latencies, audio and video qualities. The measure data has a dimension column (DeviceID, build revisions, etc.) and a metric column (count/Int, value/Float, <key, value> types, and more complex types like histograms).
- PostgreSQL supports Hstore (key-value storage) and JSON. JSON/JSONB is often used for staging tables as it can handle a lot of complex data types; during the cooking time, either they flatten the JSON types or take subsets of the JSON columns and put into Hstore types to serve as dynamic columns. It seems like Hstore types are smaller, which means less I/O to affect the data in the buffer pool.
- Partial covering indexes are also used for reporting tables of high dimensions. IOPS is very expensive to buy in a cloud setting, so using a partial covering indexes can be used to manage large-scale data. However, you still have to be careful with how many total indexes you create because they do add up.
Lessons learned scaling our SaaS on Postgres to 8+ billion events
- In the process of scaling from 1M to 100M events, customers started complaining about time balance when viewing reports. Simply scaling up was not working since the amount of data needed to bring into memory for each query was far too large for a single node database. To solve this, they decided to apply for Firebase and ended up writing and carrying events to Firebase to temporarily solve the problem.
- But they were rapidly running out of credits, so they dropped Firebase and migrated to use Postgres extension Citus, which made it easy to shard analytics tables. As a result, queries ended up being 2x-4x faster. Thus, you should expect parts of your stack and vendors to change with the scale of your data.
- Scaling analytical workloads benefits from sharding tables by customer tenant ID column and upgrading a single node database to a cluster of nodes consisting of a coordinator node and multiple worker loads (distributing the workload across multiple database nodes), all of which Citus makes easy to do.

Luew2 commented 2 years ago

Instagram

The caching consistency problem was resolved by using the Postgres replication mechanism to invalidate the memcache in its own local region.
Garbage collection was scrapped in order to more efficiently use their memory
When instagram scaled up, they removed a lot of dead code through code optimization with a focus on lowering cpu operations in total. They also moved memory into a shared space to limit redundancy.

Originally reddit was only designed for a small amount of users, leading to them having to restructue their architecture for improved performance and reducing total latency
Reddit stores their comment trees in a denormalized listing with parent/child relationships, showing subsets at a time. They defer to offline job processing to do this.
Reddit splits up (queries) processing by creating multiple messages that deal with many parts of the job individually

Pandora

Always monitor syslog according to pandora, they mapped a lot of their hardware issues back thanks to syslog.
Pandora swapped from SLONY replication to STREAMING replication due to the 'fragility' of SLONY
Pandora decided to compromise consistency for availability. using an architecture intended for sharded databases, each shard keeps two or more databases in sync with the CLUSTR application.

PostgreSQL

Postgres uses a lot of their own custom data types, (mapping a value to things such as country name, language, etc) saving them billions of bytes over a large 20TB+ data set
Byte counting is extremely important on major data sets
Postgres utilizes 'Istores', similar to Hstor but for integers, useful for time series modeling.

Large databases

Pg_dumps are handled nightly from all their database servers, they encrypt it ans store it on an AWS storage gateway
Their servers have 3tb of ram, redundant power supplies, RAID10 disks
They use a mix of SQL and NOSQL

Breaking PostgresSQL at scale

Load balancing: It is important to know that replication lag exists, so in designing your app be careful to avoid syncronous read/write situations.
Keep shared_buffers to 16-32gb, it does not improve performance
Keep maintenence_work_mem around 256MB

Citus: Postgres at any scale

Distributed tables are limited by their unique constraints needing to include the distribution columns, missing triggers and table inheritance, and many foreign keys/join limitations.
Build pipelines using parallel operations
Citus is an opensource extension to postgres and is part of microsoft, it was started in turkey.

Data modeling, the secret sauce....:

Use JSON for staging tables, and hstore dynamic columns in reporting tables
Citus is a great distributed SQL execution engine
Citus is very scalable

Lessons learned scaling our SaaS on Postgres....:

Early on, they decided to scale using firebase, but they quickly ran out of credits and ran into more issues, eventually they moved to postgres extention Citus.
A lesson learned was that with scale you must expect parts of you tech stack to change
Clear out old data from hot storage in order to optimize margins

kingeddy11 commented 2 years ago

Instagram:

Does not use memcache to provide global consistency like many other successful tech stacks but instead uses the postgres replication mechanism to solve cache inconsistency problems
In order to reduce the memory requirement, Instagram figured out they needed to reduce the amount of code in memory and run in optimized mode
When working with branches and managing source control, engineers must be mindful of the codebase they’re working with through context switching, code syncs/merge overheads, and major upgrades

Reddit:

Reddit uses load balancers in its R2 monolithic architecture to split requests up to various pools of servers to isolate different type of request paths. So if a comments page is running slowly, its frontpath is still functioning normally
Listings are the foundation of Reddit which is an ordered list of links. It is simply put selecting links out of a database with a sort
Split up processing rather than devoting a single vote to one processor in order to speed up the process and avoid converging towards the same domain listing

Pandora

Prior to Rees coming to Pandora, Pandora had not implemented any sort of scalable monitoring of their Postgres databases such as going through Postgres errors, long duration transactions, blocked processes, etc
Capturing historical data allows an individual to go back to previous versions of databases and alerts the individual on what sorts changes were made to the databases
Running pg_dumps can block other processes and are slow to generate and even slower to restore so you must be cautious when running it to back up data

20TB and Beyond

A materializer serves as an incremental map reduce that aggregates new events, copies the aggregations to the shards, runs every few minutes, and uses new data only
Istore built by Adjust models integer arrays that can be useful for time series and is supported by gin indexes to speed up the query
Autovacuum doesn’t kick in until you have 50 rows plus 20% of table being dead

Large Databases, Lots of Servers, on Premises, in the Cloud

The most important thing for real availability is hardware and hardware is very expensive. Don’t buy cheap parts because data is important and better to buy high quality parts to minimize chance of corrupt databases
You need multiple standbys during load balancing because if one standby or master fails then one server is not capable of handling your charge
On top of pg_dumps, you should also physically back up your data based on the number of transactions in your WAL log

Breaking Postgres at Scale:

Set a couple of specialized parameters for all-in-memory databases, but keep it simple and don’t change too many parameters
Try to fit at least the top 1-3 largest indexes and ideally effective_cache_size > largest index
Don’t increase autovacuum_workers if you have a large number of database tables

Citus:

Data in distributed tables is hash-partitioned across worker nodes based on the distribution column values
Distributed tables can be colocated with other distributed tables which enables foreign keys and efficient joins between distributed tables with distribution columns
Complex queries can be fully delegated to one worker if they filter by distribution column and distributed tables are colocated

Data Modeling, the secret sauce of building & managing a large scale data warehouse

A measure is time series about a specific use case such as tracking start menus and launch latencies; usually 100 measures to track these instances
Two-level aggregation has high cardinality and is very expensive; takes a large amount of computing time and uses a lot of storage resources
To do an average calc 32 bytes 200M 28 days it requires over 200 GB of data. Ideally each query, must execute in 5 seconds or less

Lesson learned scaling our SaaS on Postgres to 8+ billion

After initial launching, premature optimization is the root of all evil and make sure to do your due diligence by finding the right market fit for your product
At an early stage, focus on creating the right product rather than worrying about scaling in hypothetical scenarios; when scaling issues arise in real situations then focus on them
Expect parts of your stack and vendors to change with scale; you are going to make some wrong decisions, but try to learn from these mistakes and figure out how to find a long term solution

Tcintra commented 2 years ago

Scaling Instagram Infrastructure

1) To scale out means to build an infrastructure that allows us to use more hardware/servers when we need them, to scale up means to make each of these servers count, and to scale a dev to means to enable a fast-growing dev team to move fast without breaking things 2) In order to be ready for disasters and power outages, companies like Facebook conducts regular drills to make sure their services can serve users seamlessly even with the loss of a region, power outages, and human errors 3) Moving data centers closer to where users are will reduce the latency between users’ interactions on instagram

The Evolution of Reddit.com's Architecture

1) The load balancers take in requests from the users and splits them up into various pools of application servers in order to isolate different request paths 2) By partitioning vote queues based on the subreddit of the link being voted on, fewer processors were vying for the same locks concurrently which reduced the amount of time it took to process votes on reddit. 3) Tree structures can be tricky to manage and require extra maintenance/clean up because they are sensitive to ordering

Postgres at Pandora

1) Implementing an error monitoring system for postresql can help identify problems in faster and earlier, such as long duration transactions that exceed a threshold or blocked processes. 2) Pandora uses replication for a high availability solution for events like disaster recovery 3) When making major updates to the database such as upgrading to a new version of postrgresql or deploying a database schema change, users must be switched over to a read-only replica; in order to avoid this problem, pandora utilized CLUSTR

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

1) Requests that come from the internet are directed to the next available back-end by HAProxy. 2) Hiring for these intensive database jobs is very difficult because they can’t hire junior developers and the environment is demanding with little room for error 3) Its very painful to change your data model at a large scale (multiple terabytes)

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

1) Sometimes data can be subject to natural disasters, like if the Seine river floods, then many Parisian data centers would be compromised. 2) When scaling your startup, your RDMS is usually the most complicated part of your stack to manage. 3) It is important to continuously test pg_dumps, but also to maintain physical backups for large databases.

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

1) Replication lag can mess with the acidity of the database such that some reads won't pick up information from writes written in an earlier transaction. 2) Indexes take up disk space and slow query planning, so they should only be created to speed up queries we know are frequently executed, they are not just good to have. 3) Increasing maintenance_work_mem arbitrarily can slow down or break autovacuuming, so we must be careful when tuning this parameter.

Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

1) To build pipelines, use parallel operations. 2) Citus is an open source extension for Postgres that transforms it into a distributed database. 3) Citus was originally an open source project that was started in Turkey. Since it was acquired by Microsoft, it us now offered as a service on Azure.

Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

1) Citus has a great distributed SQL execution engine. 2) Citus has built-in custom data types like HyperLogLog. 3) You should use JSON for staging tables.

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

1) Create the right product before worrying about scale 2) Tech stack decisions can change, and you will learn what they should be over time 3) When facing slow database problems, there are probably solutions within your database that you just need to look and find, like Citus for Postgres.

yurynamgung commented 2 years ago

Instragram:

Storage servers store global data that need to be consistent across data centers but computing servers are stateless, processing requests by user traffic and temporarily storing data on an as-needed basis.
Memcache is a high performance key-value store in memory that allows for millions of reads/writes per second and reduces the read load on databases (postgresql)
To scale up means to use as few CPU instructions as possible and to use as few servers as possible

Reddit:

Splitting up queries into subqueries and partitioning them out into separate partitions can speed up processing time because it prevents them from vying for the same resources.
In order to speed up the process of looking up nested comments, reddit stored the parent relationships of the comments in a denormalized listing so that they could figure out ahead of time which subset of comments they needed to show and then only looked up those comments in a different table.
Sometimes certain threads can get very busy and hog resources for the rest of the website; reddit dealt with this problem by manually marking those threads and sending them off to their own queue called fastlane.

Pandora:

By identifying normal database activity through by monitoring and logging database activity, we can more quickly identify when abnormal activity is occurring
Pandora stores historical statistics about their database and its activities (i.e. pg_stat views, table sizes and their indexes, etc) so that they can quickly identify the source of errors as they occurred.
Trigger based replication systems add their own triggers to your systems and disables your own triggers, which can make it a very intrusive replication system

PostgreSQL at 20TB and Beyond:

The materializer aggregates new events/data from many back-end servers and copies the aggregations to analytics shards
The Postgres database system has grown much faster in popularity than the DBA community has grown to fill the needs there
Being able to converse about database systems is very important to getting hired to work on these intensive data-related projects, perhaps even more so than being able to contribute technically right away

Large Databases, Lots of Servers, on Premises, in the Cloud:

You can pg_dump into cloud based platforms, as Gurgel explains they do with an AWS instance.
Basebackups frequency is determined by WAL/day: daily for > 1TB, twice a week for > 100GB, twice a month for < 100GB.
If PITR backups take too long, then you should think about doing incremental backups to spread out work.

Breaking Postgres at Scale:

While it might seem that increasing shared_buffers would always boost performance, if your hardware can handle it, this is not the case. We should keep shared_buffers below the 32GB for optimal performance.
To avoid maintenance_work_mem breaking or slowing down the auto vacuuming process, we should keep it below 256MB in a production database.
If your query creates temporary files that exceed 16GB, you probably should rethink the logic for your query.

Citus: Postgres at any Scale:

Citus was an opensource project that started in Turkey. It is now offered under the Azure platform and is thus part of the Microsoft suite.
Reference tables must be replicated across every server in our Cities cluster, so they must be very small to avoid cluttering your cluster.
We can use a coordinator node to manage the many worker nodes in our distributed database.

Data modeling, the secret sauce of building & managing a large scale data warehouse:

Citus has rich built in custom data type and is very scalable and reliable
To do an aggregation calculation over 200Million devices, such as Microsoft does, without further optimizing, it would require an equally large scale of memory which is very expensive
In postgresql you can add custom data types

Lessons learned scaling our SaaS on Postgres to 8+ billion events:

Postgres is easy to deploy and scale, and you can stick with it both at the startup stage and in the long term for massive companies
Citus data allows for sharding tables which helps scale databases to a much larger size
Make sure to index tables concurrently if they are sharded across many nodes

cristywei commented 2 years ago

Instagram

in order to reduce latency on the home page, they use asynchronous i/o to separately retrieve the stories, feed, and suggested users to be shown on the home page.
C++ is not as debugging-friendly as Python, so Instagram wants to focus on improving Python runtimes without having to tradeoff ease of programming/debugging with faster runtimes.
They implemented the webmaster approach for source control in order to reconcile the downsides of branch-based source control.

Things change quickly, so instead of rerunning queries, the cached results are modified; this makes their cache more like a denormalized list of links.
They learned it was important to utilize permissions and build tooling in order to prevent data from being written to the wrong databases and to be able to quickly make their data consistent again.
Inconsistencies in their comment tree database, like missing parent comments, will trigger an automatic recomputation, but max queue lengths need to be implemented to prevent a queue from consuming all resources.

Pandora

Prior to Postgres, Pandora used Oracle, which was proprietary and extraordinarily expensive, especially for a company like Pandora.
Using their proprietary Clustr database, which is not ACID compliant and compromises consistency for availability.
Their Clustr database allows group members to be offline concurrently, whereas previously changes generally had to be pushed at night to avoid being blocked or blocking by any users.

PostgreSQL at 20TB and Beyond

They don't hire junior database folks because they want people who are deeply grounded in both theory and practice, and therefore require their hirees to come into the job already having the relevant background knowledge needed to quickly grow into their positions.
It is very hard to change your data model at the big data scale, so they want to avoid rewriting tables, minimize disk usage, and maximize alignment to pages, meaning they have to find other ways to update tables, such as developing their own data types and IStore.
Autovacuuming is utilized to help free up space in their tables, but the configurations of autovacuuming had to be changed in order to adapt to the very large tables as well as to prevent overloading of their systems.

Breaking PostgreSQL at Scale

With small < 10gb databases, even using the slowest scans and joins, things will still run "fast" in Postgres (unless the joins are fully N^2) and you can use the standard default configuration.
PITR backups can serve as a faster/better replacement for pg_dump once a database is too large to fit in the standard database, but is done at the cost of doing a large copy, meaning that it will be inefficient at a certain size.
With larger tables > 1tb that are updated very often, you should consider doing automated manual vacuums that should be configured to be able to keep up with the update speed of the tables, as well as letting vacuums run (even if they take a while).

Citus: Postgres At Any Scale

Citus can give better COPY performance than Postgres, with higher performance possible with more hardware and worker nodes, which is more pronounced on larger datasets and many indexes.
Citus can be used to speed up operations at several levels; the speed-ups at every level add up and create overall faster querying and operations.
Citus is especially powerful for large > 100GB datasets, multi-tenant applications (SaaS), and real-time analytics dashboards

Data modeling, the secret sauce of building

Cubing is used to pre-compute/materialize the query, which uses a lot of computing time and resources.
JSON can handle quite complex data types and is used for staging the tables; however the JSON data is either flattened or a specific column is chosen to be put into hstore data types to serve as dynamic columns.
Custom hstore data types are developed to store metric types.

Lessons learned scaling our SaaS on Postgres

"Premature optimization is root of all evil": the most important thing at the beginning of a start-up is figuring out your product/market fit and working on a product that customers actually want.
Postgres is a great tool for startups: easy to setup and deploy with a large community.
It's generally better to address issues as they arise due to real business needs rather than optimizing/expand based on hypotheticals.

Why Google Stores Billions of Lines of Code in a Single Repository

On an average workday, there are 15,000 human commits, 30,000 automated commits, and billions of file read requests, meaning that it supports a very high volume of daily traffic.
Advantages of a monolithic repository include extensive code sharing and reuse, meaning that during new projects parts of your project have already been written and can be reused in order to focus on the implementation of the other parts of your project.
Disadvantages include the costs associated with tooling to maintain such a large database and new dependencies are very easy to create due to the ability to easily reuse code; abandoned dependencies take up space and time wasted on updating them.

dabalus commented 2 years ago

Scaling Instagram Infra

Instagram uses the PostgreSQL system to store media, user, and friendship types of data
Instagram uses Cassandra to store sample data such as user feeds, activities, and many more
Instagram minimizes its memory storage by removing garbage

The revolutions of Reddit.com Architecture:

The CDN sends requests to distinct stacks depending on the domain, path cookies, and many more
Thing is the oldest data model in R2 and it is designed to allow extensions within a safely net
Reddit started out with a small number of users and overtime increased their architectures to accommodate its growing number of users

Postgres at Pandora:

Postgres is a free open source
Postgres is used in DBMS agnostic because of its reliability
Pandora initial used oracle before the Postgres was introduced

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of Postgres)

ISTORE unlike HStores allow for modeling sparse integer arrays and gives the same benefits as using Columnar storage and can be used in modeling time series data. And allows for arithmetics of operations. For example, when one Istores is joined with another Istores it automatically assumes that messing keys are zeros and add the two keys together until you get the sum of the two keys
When aggregating a large set of data, adjust does incremental map Reduce, and maps the first phase of the aggregate on the backend, reduces and the second phase the aggregation on the shards, and finally, does further aggregation if possible on demand.
IStore supports GIN indexing among others

Large Databases, Lots of Servers, on Premises, in the cloud - Get them all (AdTech use of Postgres:

When running a database in shelf hardware it is important to know that you can easily lose your database while working with a bad memory, and this can corrupt your database.
Pg_dumps are important because of their size and ability to be kept for a longer period of time. One can a test the restore without the need to test the PG restore right the SPG restarts and helps to see corruptions in the database
Lebocoin is the most important website for the French people after Google, Amazon, Facebook, etc

Breaking Postgres at Scale:

Is hard to go wrong with a small database on PostgreSQL even in joins unless they are fully N^2 or cross joins.
Autovacuums_works should be increased only if you have a large number of database tables (500+) because each work can only work on one table at a time.
10GB data can be backed using Pg_dump

Citus: Postgres at any Scale:

Citus have the ability to share table across a cluster of PostgreSQL servers and Transparently routes queries across the cluster, allowing the user to horizontally scale your database without losing PostgreSQL functionality.
A citus consists of multiple PostgreSQL servers with a citus extensions
Distributed tables can be co-located, with other distributed tables

Data Modeling:

It is very possible to scale citus
HyperLogLog is a unique data structure that can be used to calculate across millions of devices to aggregate scalable problems.
JSON or JSONB can be used for staging tables because of its flexibility

Lessons learned scaling our SaaS on Postgres tp 8+billion events...

Create the right product before worrying about scale
Postgres is easy to deploy and scale
Expect part of your stack and vendors to change with scale

katiewu71 commented 2 years ago

Scaling Instagram Infrastructure
- Storage servers need to be consistent across data centers, whereas computing servers are usually stateless and hold temporary data. PSQL and Cassandra are used for storage, whereas Django, Celery, and RabbitMQ are used for computing.
- To reduce the memory requirement and run more processes, Instagram reduced code and moved some of the private memory to shared memory. They reduced code by running in optimized mode and removing dead code. Configurations were moved into shared memory and garbage collection was disabled.
- Checks and balances include code reviews and unit testing, further tests after the code is accepted and committed, and production canaries.
The Evolution of Reddit.com's Architecture
- r2 is the original monolithic application that is the oldest component of Reddit. It was started in 2008 and written in python, and includes data models like Things and Listings. However, new backend services, which are also written in Python, are beginning to split off from r2.
- Vote queues would fill up at peak traffic hours, and votes could wait in queue for up to hours. Attempts to scale by adding more consumer processes actually exacerbated the problem, which was because of the cached query mutation locks. This was eventually fixed by using partitioning, which put votes into different queues.
- Commends are stored as comment trees, which are sensitive to ordering. For megathreads, which tend to hog resources, Fastlane was introduced, which provides a separate queue for the comment thread so that it could be processed more quickly.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
- Data is aggregated with incremental Map Reduce — Map and first phase aggregation is done on backends, and Reduce and second phase aggregation is done on shards.
- Since Adjust writes once and then updates once to the PSQL database, old data needs to be cleaned up. However, the autovacuum is not able to keep up with millions of rows. Thus, performance suffers and the tables bloat.
- To solve the auto vacuum problem, they changed the autovacuum trigger requirement to 150k rows and 0% of table being “dead”. This needed to be rolled out gradually to prevent overloading systems.
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!
- There are various elements that ensure availability in PostgreSQL. Good hardware is necessary to prevent the loss or corruption of memory. Things to consider with hardware include warranty, power, RAID 10, battery, etc.
- pg_dump is important because it is great for long-term storage and testable. Testing alerts you if there is corruption or bugs in the database.
- Physical backups are necessary. Basebackups provide daily snapshots (this can be tailored to the size of the database - i.e. only twice a week for databases between 100GB and 1TB). This is used in conjunction with PITR for restoration.
Breaking Postgres at Scale
- PostgreSQL can handle databases of any sizes, including those that are multiple petabytes large. However, with small databases, essentially everything with run quickly. Large databases must be handled differently.
- With large databases, they might not fit in memory, and queries may start performing poorly, in addition to pg_dump taking too long. A good rule of thumb is to try to fit at least the top 1-3 largest indexes in memory, which effective cache size being larger than the largest index. Note that memory does not help write performance.
- PITR backups and pgBackRest can be used for database backups instead of pg_dump for large databases. PITR takes an entire filesystem copy, and more frequent copies will yield a faster restore.
Citus: Postgres at any Scale
- Citus provides provides distributed database capabilities to PostgreSQL users, which enables compact representation, high query throughput, fast bulk loading, quick queries, etc.
- Citus is best used for 100GB or more of data, for multi-tenant applications and real-time analytics dashboards.
- A Citus cluster contains several PostgreSQL servers with the Citus extension. Citus distributes tables across the database cluster and hash-partitions the worker nodes based on distribution column values.
Lessons learned scaling our SaaS on Postgres to 8+ billion events
- ConvertFlow started with Rails, Heroku, and PostgreSQL, three popular technologies. At early stages, queries were fast, it was easy to stay within storage limits, and optimizing meant creating indexes.
- As the company and databases grew, Citus was used to help shard and distribute tables across multiple nodes.
- ConvertFlow was also able to reduce data storage by clearing out old data from hot storage and reduce index bloat by reindexing.
Mastering Chaos - A Netflix Guide to Microservices
- Microservices are an approach to developing an application as a suite of small services. They enable modularity, scalability, and elasticity.
- One service failing can lead to a cascading failure. Netflix resolved this with Hystrix, which provides a fallback to enable a customer to continue using Netflix rather than getting an error. Fault Injection Testing (FIT) was used to test the effectiveness of Hystrix.
- Conway’s Law tells us that organizations that design systems must produce designs that are copies of the communication structures of these organizations.
Why Google Stores Billions of Lines of Code in a Single Repository
- Google chose to focus on scalability within their large repository rather than splitting it into several repositories. As of January 2015, there were 1 billion files, 9 million source files, and 2 billion lines of code
- Google has a custom system named Piper, which hosts the monolithic repository. It is replicated across 10 data centers worldwide. CitC is used to access Piper, which enables developers to see local changes overlaid on the Piper repository and navigate the codebase.
- Some advantages of having a monolithic repository include one source of truth, code sharing and reuse, and simplified dependency management.

ohorban commented 2 years ago

Instagram:

disabling python garbage collector led to significant increase in performance
even though data is replicated through different data centers, memcache does not, so two users who sit next to each other can see different feeds even if they are sitting right next to each other. To fix this problem they ran Postges replication before hashing in memcache and then each data centers ran hashing separately.
it was interesting to learn how insert of using branches to develop new features, Instagram develops everything on the master branch and controls the access with gates.

Reddit:

When vote queues started getting overloaded scaling out made the problem worse. It was the locks that caused the problems. To fix that they partitioned the voting queues but it stopped helping with some time. I didn’t really how they fixed it in the end.
They are working on lockless cache querying which sounds very interesting.
Fast lane processing that flagged highly used threads was a good idea at first but it soon broke because all the memory filled up so the thread has to be restarted and all the memory was lost.. They now instead use Queues Quotas.

Pandora:

Monitor errors at the Postgres log file, long queries, blocked processes, errors at syslog
pd_dump is slow, can block other processes, >99% success rate. Pg_basebackup connection may be terminated 98% success rate.
They have a historical database for every table, index and so on. The process is fully automated

Postgres at 20TB and beyond:

Postgres is spreading faster but there is not a lot of people with enough knowledge and experience to work with Postgres.
Custom 1-byte Enums sounds like a small optimization but it pays out in the long run with a lot of data
Autovacuum in Postgres can cause problems when used on default setting on big data. It will sleep for a long time first and then try to vacuum way too much data at once.

Large Databases, Lots of Serves … :

If someone tells you that Postgres doesn’t scale out, they are lying!
Data is more expensive than hardware
It was interesting to learn about geographic distribution. If you have a main database and a replica stand by data base they should be in separate data centers at least 4km apart.

Breaking Postgres at scale:

Each autovaccum worker can only work on one table at a time, so don’t increase autovauum workers unless your schema has a lot of tables
Create partial indexes my adding flags to the table.
Using non sequential keys like UUIDs instead of primary keys can actually improve performance

Citus: Postgres at any Scale:

Turns Postgres into distributive database
Citus is good for >100Gb databases and multi-tenant applications
It picks the shard by computing the hash of the distribution column

Data modeling, the secret sauce of building & managing a large scale data warehouse:

They collect diagnostic data from 1.5 B devices to build dashboards that show insights about customer experience.
Postgres has many rich datatype so Citus uses JasonB for staging tables and then during reporting stage they flatten the json types
Citus is highly scalable and reliable for large scale data

Lessons learned scaling our SaaS on Postgres to 8+ billion events:

Create the right product before working about scale.
Expect parts of your stack and your vendors to change with scale
Analytics is easier to scale when tables are sharded and distributed across multiple nodes

Tonnpo commented 2 years ago

Scaling Instagram Infrastructure:

One interesting fact to me is that Instagram was taken over by Facebook after releasing for two years
This is in Q&A session. For each rollout update, it takes about 10 minutes to update to about over 20,000 servers
Upon scaling the Dev Team, they intentionally use MySQL for shipping features because it is the simplest to get things done.

Mastering Chaos - A Netflix Guide to Microservices:

Microservices is an approach to develop a single app as a suit of small services, each running in its own process and communicating with lightweight mechanisms, often HTTP resource API
To prevent cascading failure, a static fallback response can serve as a solution.
With large and complex architecture, small decrease in availability could lead to a problem. For example, as the speaker mentioned, suppose we have 10 apps, each has 99.99%, then our availability becomes (99.99)^10 = 99.9 which is a huge difference, like 8 or 9 hours in one year.

Why Google Stores Billions of Lines of Code in a Single Repository:

As of 2015, Google had 2 billion lines of code, 1 billion files, 35 million commits, and 86 terabyte of content
To follow the monolithic model of source management, it requires investment heavily on tools to support development.
One advantage of a single code base is that everybody is looking at the same source files which make it much easier to manage/update dependencies, refactor, and utilize new features.

dustin-lind commented 2 years ago

1. Scaling Instagram Infrastructure a. Instagram uses PostgreSQL for user, media, and friendship data. PostgreSQL allowed them to scale out into as many datacenters as the company wanted because databases can be easily replicated between datacenters. b. To optimize CPU usage, Instagram would identify functions that are used extensively and are stable (i.e., not frequently updated) and try to convert them to C (or some version of it) because C is faster. c. Instagram ships code whenever there is a difference between the main branch. There are 40-60 rollouts per data, which makes unit tests extremely important. 2. The Evolution of Reddit.com’s Architecture a. R2 is the original Reddit code monolith b. The vote queue pileup that the speaker talked about was due to lock contention. As concurrent updates on the vote count occurred, locks followed, and this slowed down the processing speed. The initial solution was to partition the vote queues. c. An autoscaler watches utilization metrics and increases/decreases desired capacity accordingly. This helps Reddit save money because they can request less resources from AWS during off-peak/peak times. 3. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres) a. A materializer aggregates data from many servers and transfers it to many servers. It functions like map reduce with the added complication that it is a many server to many server transformation. b. AdTech is known for creating a lot of custom datatypes because when they are working with such large databases byte counting is something that pays off in the long-run with less storage costs. c. Autovaccum occurs by default when there are 50 rows plus 20% of the tables consisting of “dead” tuples. When you are working with a huge dataset, it makes sense to change these parameters because 20% of a 20TB table is still a lot of dead tuples 4. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus) a. To do device centric aggregation, the team would need to query over 200GB of data b. The team uses indexes (both full and partial indexes) extensively to speed up their queries. Some commonly queried tables have 50+ indexes c. One database cluster utilizes is 1920 cores, 15TB of DRAM, and 960TB of premium SSD 5. Mastering Chaos - a guide to microservices at Netflix a. The CAP Theorem states that in the presence of a network partition, you must choose between consistency and availability. b. One challenge of a microservices system is the problem of “crossing the chasm” (i.e., when services need to depend on each other). This could mean network latency, congestion, or failure. c. Netflix uses the Fault Injection Testing (FIT) framework. It’s almost like a vaccine against possible network failures. Netflix will insert possible faults in their microservices and see how the rest of the network reacts. 6. Why Google Stores Billions of Lines of Code in a Single Repository a. Although everyone at the company works from head branch of the Google codebase, Google has an incredible number of testing layers to make sure that the code being merged to the head branch is well-written and free of bugs b. At the time of talk, the Google repository modified just as much code per week as the entire Linux kernel (i.e., millions of lines) c. The speaker cites her struggles working with distributed repos at a gaming company before Google. Each game was being built in its own repository, so it was very hard to merged changes across all the games 7. K8s documentary a. Docker was revolutionary in that it allowed for a portable software development process, and it made it easy to write code that could scale up. b. One reason that Kubernetes became the industry leader despite other competitors was because they decided to opensource the software, so they had a huge workforce of people contributing to the project. c. The people interviewed in the documentary express how they are disappointed that Kubernetes has been pitted against Docker. Kubernetes would not have happened if Docker did not become so popular. Kubernetes is essentially a system for operating docker containers at a large scale. 8. Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company) a. The phrase “premature optimization is the root of all evil” is very true for an early-stage SaaS company. The focus of the technology team should not be on figuring out which systems to use to best scale up the technology, but simply on creating the right product and gaining traction. b. PostgreSQL is a useful database system for early-stage companies. It’s easy to set up, it’s easy to deploy in today’s cloud infrastructure, and its open source so there is a huge community of people supporting the project. Because it powers companies that are working at massive scale, you can know that it will work well in the long-term c. Query performance on large databases can be improved through sharding which is a method for distributing a single dataset across multiple databases, which can then be stored on multiple machines. Citus is an extension for PostgreSQL that allows users to do database sharding. 9. Breaking PostgreSQL at scale a. For monitoring postgres databases, at a minimum you should be processing logs using pgbadger. Pg_stat_statements and pganalyze are valuable tools for operations managers. You should be aiming for basic health reports, query response times, and memory usage. b. Don’t create indexes on every variable. Indexes take up a lot of disk space, they increase insert time, and tend to add to planning times. Add indexes based on query patterns gathered from monitoring reports c. Never disable autovaccum in postgres. The speaker mentions that he gets dozens of calls from clients who ran into problems because they disabled autovaccum. You must accept that vacuuming takes a while and should be completed in full.

afroCoderHanane commented 2 years ago

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E

Invalidating mem_cache forces the count likec to be searched in the postgres database because of the counter that was deleted from the cache and runs in 100ms but using a denormalized indexed database it reduces to 10s us. This still has a higher load issue but using the memcache lease operation on multiple django requests, they scaled out for better capacity, reliability and readiness for regional failure.

Scale up means use as few CPU instructions as possible. Instagram created continuous profiling to collect data to gain codebase visibility to generate a call graph that helps improve productivity. With C profile, they reduce code in memory and use Async to proceed more request at once
Instagram handled automatic cache, defined relations without worrying about implementation, facilitated self service by production engineer and scaling by infrastructure engineers using TAO. TAO is a database + right to cache, which still uses mySQL and allows simplified models.

The Evolution of Reddit.com's Architecture

https://www.youtube.com/watch?v=nUcO7n4hek4

CDN is used at Reddit to make a lot of decision logic to figure out which path to direct users query to. It allow to have one reddit.com that is directed to multiple stack
r2 schema the code is monolithic Load balancers route requests to a distinct pool of identical servers, therefore if a comment is going slow for a particular page it doesn’t affect other users. Things are data models that are based in Postgres and memcache in front of them Cassandra is able to stay up with one node going down
Scaling out Reddit made it slower due to lock contention. So to fix the issue, Reddit used partitioning to decrease the lock, but it was still slow in 2012 which was due to outlier – domain listing – so they use split up processing. Lesson learned: have a timer in your code.

Postgres at Pandora

https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38

The reason why pandora switch to postgres was because: It was free, they wanted to be open source , have community support, Scale out and commodity hardware.
Originally all the replication of Pandora was slony based . It is an intrusive replication system that add his own trigger and disable your own so if you run a pg_dump on a slony replica and you restore that pg dump , your triggers are disabled so you refragmentation integration may be broken.
If you have a postgres version that has replication you no longer need to copy the WAL files over because they will be kept on the master system until no longer needed by the client

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

https://www.youtube.com/watch?v=BgcJnurVFag

PostgreSQL Materializer aggregates data from many servers and transfers it to many servers. It is similar to a map reduce with an add on of many servers to many transformations.
Really painful to change data model
Postgresql write their custom data type

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

https://www.youtube.com/watch?v=4GB7EDxGr_c

Postgres scale out because at leboncoin they have one single database with 19 standby servers that will load balance over it
High Availability proxy – HAProxy are read-only endpoints for applications. HAProxy is faster than pg pool
You have to test pg_dumps because if you use pg-dumps and you don’t test the restore you don’t have dumps / backups.

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

https://www.youtube.com/watch?v=eZhSUXxfEu0

For large dataset you can’t back up with pd_dump, you have to use pit R
Don’t bump up shared_buffers to 16-32 GB, because according to Pettus he has never overseen any significant improvements.
Hash indexes are good for string – those with most of the entropy later in the string– such as URL

Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

https://www.youtube.com/watch?v=kd-F0M2_F3I

Citus is a postgres extension that turn PostgreSQL into a distributed database
Multi-tenant applications on Citus model the data to enable the application to make router queries
Citrus is designed for app that are 100GB and beyond

Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6

A measure is a time-series data collected about use case executed on windows 10 or 11
Device-centric aggregation is used to reduce noise during data collection
Some commonly queried table have 50+ indexes to improve performance

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

Create the right product before worrying about scale :ConvertFlow
Citus allow creation of cluster nodes database consisting of a coordinating node with multiple worker node
Reindex the table greatly reduce the storage impact, makes maintenance easier and improving margin so it is important to reindex as postgres evolve

RuiluGao commented 2 years ago

Scaling Instagram Infrastructure
- Scale-out is the ability to use more servers to match the user growth by increasing the number of servers as well as the number of datacenters.
- To scale-up means to make each of the server counts, and to scale the development team means to have a fast-growing engineering team to continue to move fast without breaking things
- Storage vs. Computing: while storage needs to be consistent across data centers, computing is driven by user traffic, with a needed basis. Instagram uses PostgreSQL to store data like media, users, friendships, etc., and Cassandra for user feeds, activities, etc.
The Evolution of Reddit.com's Architecture
- r2: the original monolithic application, the oldest single component of Reddit, which was started in 2008, and written in Python. R2 is a giant blog, At the front is the load balancer which takes the request that the user has and splits it into various pools of application servers so that Reddit could isolate different kinds of request paths e.g. a comments page is going slow today because of something going on, it doesn't affect the front page for other people.
- CDN: CDN can be used as being able to do a lot of decision logic outside at the edge, and figure out which stack is going to end that request, based on the domain that is coming in, the path on the site, any of the cookies that the user has, including, perhaps, experiment bucketing.
- Reddit uses Cassandra very heavily as well. It has been in the stack for seven years now, it is used for a lot of the new features, ever since it came on board, and it has been very nice for its ability to stay up with one node going down.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale
- Big data: the scale of the data that Adjust deals with every day is about 100k-300k requests per second tracking over 2 trillion data points, with over 500 TB of data to analyze and it was all done using Postgres SQL.
- Materializer: being part of the architecture; it aggregates new events, copies the aggregations to the shards, and runs every few minutes. It aggregates data from many servers and transfers it to many servers. It functions like MapReduce with the added complication that it is many servers to many server transformations.
- Postgres SQL is the big data platform that Adjust uses and Adjust team expects to continue using Postgress while expanding in scale as the data demand continues to grow.
Breaking PostgreSQL as Scale
- Small database: For a small database of about 10G, it’s hard to go wrong at PostgreSQL and nearly everything will run fast including pathological joins
- Mid-size database: To back up a mid-size database, pgdumps no longer works, we should use (Point In Time Recovery) PITR backups which take a file system level copy of the whole database at an arbitrary period and save files generated after the file system copy started, therefore, no need to copy frequently.
- Dont’ pregenerate indexes if you will not need them. Indexes are not free, they take up disk space, a significant amount of insert time as well as planning time. Add indexes that are deemed useful.
Citus: Postgres at any Scale
- Citus: is an open-source extension for PostgresSQL and turns it into a distributive database (where the data is distributed across many Postgres servers); therefore no more limits on hardware, and memory; having tables of any size.
- When to use Citus: expect data to grow 100GB and beyond, multi-tenant workloads, real-time analytics, parallel queries, and a data warehouse with a relatively static query set.
- Citus cluster: a coordinator node and Postgres servers with the Citus extension and an application connected to the coordinator node and this may be subject to modification
Data modeling, the secret sauce of building & managing a large scale data warehouse
- Windows Data Mesh: a processing pipeline that sessionaize Diagnostic event data and turn them into session-based measures, using census data Data as a reference and loading it to VeniceDB for data enrichment.
- VeniceDB: Citus-based Postgres cluster, with three stages: Reporting table for consumption, efficiency, Stats table, and Dimension table, the latter two are for customers for data quality assessment.
- Measure Data Schema: a measure is time-series data about a specific use case execution e.g. Wifi, Bluetooth, etc. with hundreds of measures to track them
Lessons learned scaling our SaaS on Postgres to 8+ billion events
- Early-stage: At the early stage of the product, it is more important to create the right product than to worry about the scale-up and also choose a database that is easy to initialize, deploy and eventually scale up, preferably open source for later potential migration.
- Scaling up isn’t as easy as it seems in math: scaling from 1million to 100 million isn’t just easily scaling up by 100x.
- Old, outdated, unnecessary data should be cleared from hot storage so that there are fewer costs on storing data, therefore, bringing more profits.
Mastering Chaos - a guide to microservices at Netflix
- Microservice: an approach to develop a single application as a suite of small services, it enables the rapid, frequent, and reliable delivery of large, complex applications.
- One challenge for microservice is to deal with the intra-service requests, categorizing as crossing the chasm, running into problems like network latency, congestions, and hardware failure as well as logical or scaling failures. With one failure, it can cascade the whole system.
- Fault Injection Testing(FIT): Netflix creates FIT to test for errors, like taking a dead virus into the body to develop antibodies. it uses synthetic transactions and can be overridden at the device or account level and further taken to monitor under the arbitrary percentage of life traffic.
Why Google Stores Billions of Lines of Code in a Single Repository
- Google uses one giant single shared repository shared by the whole company and has decided to keep one repo by investing in scalability to keep up with the growth, it is probably the largest single repository in use in the world
- The size of the repository, the rate of change of the repository as well as its usage have increased significantly with >1 billion files, >2 billion lines of code with 45K commits per workday.
- Driving force behind the growth case is the automated use case, way above human commits as of today: configuration and many supporting data files are generated from the automation process.

DestrosCMC commented 2 years ago

Scaling Instagram Infrastructure
- Instagram is built on top of Django
- Because of how highly Instagram values cross-region operation and consistency, they use Postgres replication for Postgres inserts and cache invalidation into Memcache.
- It was interesting that Instagram made multiple urls for each photo based on the device
The Evolution of Reddit.com's Architecture
- I was surprised at how python dependent Reddit's services were. It makes sense engineers wanted to adopt node.js.
- Reddit reads from memcache and the "cache" is now a denormalized index of links.
- For comment threads, it is expensive to figure out the tree metadata in-request so they precompute it and store it.
Postgres at Pandora
- Pandora used to run on Oracle, but Oracle was too expensive and everything cost money. So, they migrated to postgres.
- It was interesting to see that pg_dump has a >99% success rate and pg_basebackup has a 98% success rate.
- They made a proprietary distributed database in house that is CAP not ACID.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
- 20TB per backend server gets stored, and 20+ servers.
- Materializer is similar to mapreduce but is different because it is a many server to many server transformation.
- Space saving and small improvements are worth it when working with terrabytes of data.
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
- Barman strategy was interesting to learn about in case of a disaster
- Datadog was actually used by the devs to see real-time graphs and not the users.
- No MongoDB because hard to migrate all their data. However, does use NoSQL for elastic search.
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
- Just use pg_dumps for backups, only takes 90 seconds for 5GB on their laptop. (For small databases)
- More memory does not help with write performance
- Larger databases time for PITR backups. So pgBackRest is used by the presenter and mentioned that WAL-E is old warhorse. Don't roll your own unless you have really specialized needs.
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
- Use Citus at 100GB +. Does real-time analytics dashboards.
- If they want the distributed tables to have a unique constraint, it must include the distribution column.
- Citus gives minimal overhead for queries, still single server, and boasted as easier to integrate than manual sharding and NoSQL.
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
- Venice DB is used for Windows device data. Rich dimension and metrics data types.
- Petrabyte level queries look like just sql query but under the hood is more complex.
- Fails pretty frequently so calculate average failure rate.
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
- Make sure to choose a good tech stack when you make a product, something that doesn't cost an arm and a leg and open source if you need to migrate later.
- Heroku single node postgres database wasn't working and they didn't understand this because they didn't know what was under the hood.
- Citus helped them because it sharded the tables very easily.

joeybodoia commented 2 years ago

Scaling Instagram Infrastructure ⁃ As a work around for the consistency and scaling issue that occurred when memcache was cross region, Instagram used a lease-get and fill/wait/usestale/hit call and response with the database. In this system, the memcache tells the calling application that it has permission to go to the database to fill the cache and subsequent calls do not schedule duplicate DB lookups for the same cache miss. ⁃ With code analysis tools, Instagram gets a lot of visibility into the code base: what paths through the code are being taken and what are the bottlenecks. This was used to alter their url generation usage and additionally rewrite some key python functions in Cython (rewritten in C to be used within the python codebase). ⁃ In addition to scaling out (number of servers) and scaling up (capacity of server), Instagram also created a framework (Tao) to abstract certain DB/caching functions to simplify models for devs and shorten ramp up time and feature development time.
The Evolution of Reddit.com's Architecture
- For expensive procedures such as voting or submitting a link, reddit defers these procedures to an asynchronous job queue (via RabbitMQ), allowing processors to handle them later.
- For listings, which are an ordered list of links, Reddit caches a list of IDs, paired with sort information, corresponding to the links in a given listing in memcache, allowing for quicker lookup. When listings are updated, such as when a new vote takes place, the cache listing is updated, and it uses the sort information that corresponds to the id of the link to decide whether that link needs to move up in the listing. Being able to mutate the cache listings allows for Reddit to not have to do expensive select queries
- Reddit ran into an issue with vote queue pileups during peak traffic hours. This resulted in reddit votes taking hours to process. It’s interesting that adding more processors did not help this issue. It turned out this issue was related to the lock that was put on the listing that is trying to be updated with new information on the votes. Partitioning the vote queues was able to solve this issue.
Postgres at Pandora
- Pandora initially used Oracle but moved to Postgres because of costs and also a desire for going open-source.
- Pandora uses a separate database to store historical data. This includes data on size of the entire cluster instance (PGDATA), and statistics from every database in that cluster.
- Pandora created a proprietary database called CLUSTR, which does not have ACID guarantees. The motivation for this was high availability, but the compromise is consistency of data.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
- Adjust uses Redis to save all requests that are processed by backend servers. They then have a process that reads in from Redis and stores it in Postgres. Using Redis in this way allows for restarting Postgres without interrupting the backend.
- Through their Materializer, Adjust is able to perform a many-to-many MapReduce procedure from their backend servers to analytic shards, allowing for increased performance.
- Adjust creates their own custom 1-byte enums, which allows for representing things such as country names in a single byte, which helps with alignment issues.
- Adjust changed their autovacumming settings such that autovacuum kicks in when 150k rows are dead plus 0% of table rather than the default of 50 rows + 20% of table.
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
- Leboncoin uses pg_dump for backups. All of their pg_dumps are encrypted and sent to the cloud because they want their backups to be off-site ⁃ Restoring a postgres database from a pg_dump (testing the dump) can reveal corruptions in the database, like violating foreign key constraints or bad stored procedures. Testing pg_dumps also allows you to measure the time to restore a pg_dump, this insight can help you make it faster by adding more parallel workers. ⁃ When there are minor version upgrades, the dependent site can stay up through the update with the postgresql server set on standby and only a few seconds outage. Using logical replication there is very little downtime when upgrading a Postgres server for a major version upgrade as well, taking only a matter of seconds compared to 3 hours before the use of logical replication.
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
- How you handle Postgres changes as db gets larger: what works for 1GB db doesnt for 10TB db
- pg_dump should be used to backup dbs that are relatively small (<10GB), but do not work well for larger dbs. Point in time recovery backups (PITR) are better for larger dbs. PITR backups allows you to not have to copy the entire database nearly as often.
- For databases that you can't fit entirely in memory, aim to fit at least the top 1-3 largest indexes.
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought) ⁃ Citus allows you to make your Postgres database distributed across many servers. In many cases, despite the additional network routing to the working nodes, many commands are quicker (like many inserts, instead leveraging the copy operation) because of parallelization. ⁃ Queries using “Unique” must within the distributed column because the server only knows if it is unique within the working machine. Similarly, there are complications for aggregation queries and the coordinator node sometimes needs to combine results returned by worker nodes. In sharing multiple tables on the same distributed column then you also still have the ability to create foreign keys and use joins as normal very efficiently (still possible otherwise but slower as a low is sent across the network). ⁃ The coordinator node can delegate queries when it sees that the where clause is on a distributed column.
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus) ⁃ As part of Citus, there are some additional aggregate functions (todigest_percentile) that the coordinator node recognizes and uses in all of its merge coordination under the hood (in a distributed map reduce way). ⁃ In order to work around the count-distinct issues that occur in a distributed database, they use the hyperloglog algorithm for the query to create the denominator. ⁃ To reduce table size and reduce I/O time, much of the input data is in json format during staging but only parts of the original json object are selected and saved in hstore (key, value) columns.
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company) ⁃ In the earliest stages of the startups history, scale was not as important as the product, but down the road too much data was being pulled into memory for a single node database and causing major latency issues cross-customer when a particular customer was creating a large report. ⁃ In order to make it through the scale up to 1 million uses the company temporarily used Firebase (on cloud credits) to patch the problem, but ultimately sharding on website_id was the permanent fix. ⁃ Reindexing their tables concurrently (with latest Postgres indices) and clearing out data from hot storage (older than 1 year) they were able to reduce cost when reaching the multibillion event level. They also hit the max value limit for the primary key and had to migrate from a smallint to a bigint by adding a new indexed column, backfilling it and then making it the new primary key.

samcbogen commented 2 years ago

Scaling Instagram Infrastructure

There is a issue with cache inconsistency, and it is solved via a daemon on the postgres replication
URL generation cost is lowered via C or C++
Memory usage is limited by removing garbage collection for Python, increasing its capacity by over 20%

The Evolution of Reddit.com's Architecture

Listing is the basis of how Reddit works. Reddit selects links and caches the list of link id's through numcache.
If some threads are really busy, these threads will take up much of Reddit's resources, so Reddit manually marks the "busy" threads and moves them to a different queue (which they created) called "fastlane"
It can be really hard to manage the tree structures, so extra maintenance is required to ensure that ordering is correct since they are order-sensitive.

Postgres at Pandora

By storing their historical statistics regarding the database's activities, Pandora can more easily find the sources of errors when they occur, making them easier to fix.
Pandora uses CLUSTR in shared databases, which sacrifices a bit of consistency for more availability because CLUSTR allows Pandora to more easily update and upgrade the database.
Postgres replaced Oracle at Pandora once it was introduced, and Postgres was significantly cheaper than the overly expensive Oracle.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Although Postgres is spreading through the community, there are not enough people with the knowledge and experience with Postgres, so its growth is somewhat stunted as a result.
Adjust uses their Materializer to complete MapReduce procedures from their server to analytic shards, and the shards perform the second stage aggregation. This increases performance.
Autovacuum cannot necessarily catch up on larger tables, so a batch script was created to introduce changes to these servers in order to keep their system from overloading.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
- Hardware is incredibly expensive, and it can directly correlate with availability. It is better to buy more expensive, higher-quality hardware than cheaper hardware because the risk of corrupting databases is very real, and fixing this can be WAY more expensive than just buying high-quality hardware.
- pg_dumps must be continuously tested, and the data should be manually backed up.
- Many large databases do not fit within memory, and this will start to make your queries perform significantly worse.
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

Replication lag exists and is very real. This makes load balancing extremely important, and avoiding synchronous read and write situations is an absolute must for software developers.
If you have many (a large number) of tables in a database, do not increase autovacuum_workers on it!
Indexes should only be used if you are speeding up queries. Otherwise they can slow query planning and take up precious space/memory.

Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

Distributed tables use hash partitioning across multiple worker nodes which are based on the distribution column values.
Parallel operations are one of the best methods to build pipelines.
Citus originated as a startup in Turkey, and since Microsoft acquired it, it is offered as a service on Azure (which is owned by Microsoft).

Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

A measure is a set of time-series data regarding specific execution scenarios including tracking start menu launch latencies, audio/video quality, etc.
You should use JSON for staging tables since it has a high flexibility.
To report tables with higher dimensions, you can use partial covering indexes as a substitute for the very expensive IOPS.

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

ConvertFlow customers began to complain about time balance when viewing their reports during the time that ConvertFlow was scaling up to 100 million events, and firebase served as a temporary solution to this problem.
ConvertFlow moved to Citus because scaling with firebase was unsuccessful and ConvertFlow ran out of credits with firebase. One main fix was sharding on website_id.
ConvertFlow seems to fix issues as they come up rather than working on possibilities, and it seems to work for them.

alex-muehleisen commented 2 years ago

Scaling Instagram Infrastructure
- 'Scaling' can be seen in many different ways at firms, but it always consists of continuous effort, alertness, and holding everyone at the company responsible.
- Storage needs to be consistent across data centers
- Being ready to fail and having back-up plans in case of failure is crucial to running a successful business.
The Evolution of Reddit.com's Architecture
- Stack components change all the time, and even major components remain a work in progress at most tech companies.
- Comments on Reddit are threaded, meaning that replies are nested. This makes comment trees very expensive.
- Reddit sets 'Queue Quotas' (maximum queue lengths) to ensure that no one can consume all of their resources.
Postgres at Pandora
- Monitoring activity in Postgres is incredibly important. This includes monitoring current activity, the Postgres log file, Query Durations, and errors.
- Accessing Historical Data can offer you the chance to make updated growth predictions on databases and tables, as well as reconfigure any database architecture if necessary.
- All production databases are replicated. This is generally considered good, common practice.
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of Postgres)
- A Materializer aggregates data from numerous servers and transfers it. It functions similar to MapReduce we learned in class, but with the complication of transferring data across multiple servers.
- Postgres' autovacuum functionality is incredibly important and can be 'tuned' to activate once a certain percentage of the table's tuples are dead. This helps to minimize disk usage and avoid rewriting tables.
- Postgres is huge and even 400TB worth of an analytics environment is not enough to outgrow Postgres.
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)
- Bad hardware is bad for your data.
- Version upgrades are dangerous for businesses; their detriment can be limited by capping maintenance breaks and limiting service disruptions to only specific sections of the website.
- NoSQL and MySQL engines can be used in conjunction with the Postgres engine.
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)
- What works for a 1GB database doesn't work for a 10TB database.
- Postgres has high availability, through direct streaming, basic WAL archiving, or manual failover.
- In regard to memory, the effective cache size should be greater than the largest index, ideally.

abooli commented 2 years ago

Instagram:
- utilizing postgresql to store the metadata about a post
- removing garbage collection helps with more effective use of memory
- scaling up in a sense that means to use as few CPU instructions as possible
Reddit
- the listings were computed with a simple SQL query (although it looks way cooler than that)
- the alerts indicating replication in 2011 were ultimately solved by finding the phantom data and removing them.
- the thing object was eventually decided to be pulled out from r2, which manages the complexity of the reddit infrastructure
Pandora
- Pandora originally used Oracle, but later decided to switch to postgres due to cost and open-source
- the Pandora database consists of three classes: the Nexus (generates music meta data), the Meta Data, and the radio (the stations and other info)
- pg_dump was a very slow replication to use for a database as big as Pandora’s (out-of-space issue is the 1% fail)
PostgreSQL
- Adjust uses ~24 backend serves, and around 20TB per backend server gets stored.
- the materializer aggregates new events/data from back-end servers
- custom 1-byte enums helps Adjust with managing complex data
Large databases
- a mix of sql and nosql was incorporated in the system
- postgresql could safely handle scale-ups if enough attentions were paid
- the WAL log is important for managing data back-ups and recovery
Breaking PostgreSQL at scale
- pgBackRest could be used to replace pg_dump
- work_mem should be kept at around 256 MB based on the logs
- the shared_buffer size should be set to 16-32 gigabytes
Citus
- citus supports fast bulk loading and parallelized COPY
- citus comes in handy when the amount of data is beyond 100GB
- distributed tables can be co-located
Data Modeling
- a measure is time series data about a specific use case executed
- JSON is a good data type to store data for due to its flexibility
- citus provides built-in data types.
ConvertFlow
- creating indexes is important for scaling up and optimizing query speed
- it is important to expect parts of your stack and vendors to change with scale
- periodically clearing old stale data from hot storage is important for maintaining disk space

anyu-yu commented 2 years ago

Instagram

The caching consistency problem was resolved by using the postgres daemon
Metadata that is used: whether a specific feature is used in a request
Instagram also scales the dev team according to product manager needs

Reddit.com

Major components of the architecture of Reddit include r2 and S3.
Instead of running search for reddit listing, Reddit uses a cache of IDs.
One of Reddit's issues was related to finding data that was no longer there due to internal system of getting rid of old data.

Postgres was a great alternative to Oracle, which was really expensive.

axelahdritz commented 2 years ago

1) Scaling Instagram Infrastructure

Instagram stores its user, media, and friendship data in Postgres. Write requests are done to a master data center while read requests are handled by replicas of the data in other data centers
Instagram fixed the memcache inconsistency problem across regions by writing and then replicating the postgreSQL first across data centers and then invalidating the caches of both, so that the user looking for latest information is redirected to retrieving the info from the postgreSQL server… thus getting up to date information!
One way that Instagram decreased the huge loads on the PostgreSQL servers during high-activity times of day was with memcache lease which runs lease-gets to the memcache that redirects some user requests to the database, and others are given stale values in the cache, or “waits” if the users refresh/retries the query in which case their request also goes to the database.

2) The Evolution of Reddit.com's Architecture

Rather than querying directly to PostgreSQL and maintaining a cache in memcache for listings, Reddit stores both the ID and the sorting information in the cache, mutating it in place when votes occur et cetera, which requires a lock. This transforms the cache almost into a denormalized/persistent index of links which avoids expensive queries to PostgreSQL which is stored in Cassandra.
The locks on the caches created really large vote queues at times. Reddit solved this by partitioning the vote queues by sub-reddit so that processers weren’t vying for the same locks, and later on, they further split up the queries by type (to solve the cross-partition domain issue).
Reddit solved the data-loss problems on their read-replicas (when the larger primary write PostgreSQL server was down) by creating better permissions and fixing their indexing so that data wasn’t accidentally written to replicas and thus lost from the primary.

3) Postgres at Pandora

Pandora’s database backups from the WAL stream sometimes fail when the database is under load.
Pandora stopped using Slony replication for various reasons relating to their release engineering and other errors (which I didn’t quite understand) and switched to using streaming replication which opened up a whole other world of problems (for instance extremely large WAL logs which sometimes meant the databases were out of sync, and it was also difficult to test new PostgreSQL versions) and as a whole created clusters which were not in read-write mode continually. To solve this, they created their own distributed database called CLUSTR which has the downside of not being ACID guaranteed (which isn’t much of a problem for them apparently) but has high availability. This manages PostgreSQL databases differently.
Pandora has issues with the transactionID wraparound problem on the AUTOVACUUM (seems to be a common and pervasive issue even for large companies).

4) PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

AdTech uses many-to-many map reduce procedure (their materializer) to allocate incoming data from the backend to the shards on which data-analytics are performed
Serious problem of PostgreSQL growing in popularity without as many people that are knowledgeable in both theory and practice to fill roles
AdTech uses many custom data types which reduces the byte usage in a micro-management way which racks up when there are terabytes and terabytes of data

5) Why Google Stores Billions of Lines of Code in a Single Repository

The single repository system Google uses allows for single changes rather than duplicating changes across many repositories… allowing for scalability.
Google’s repository is massive (1 billion files), with over 45 thousand commits every day, most of which is driven by automated source control system that commits 30 thousand of the daily commits… this is what is truly growing.
Another advantage of Google using a single, monolithic repository is in the process of codebase modernization. Rather than having all of the different teams creating their own tools constantly, there can be cross-platform teams that centrally perform these updates when old API’s need to be changed et cetera, keeping all coders up to date with the same tools and all using the same dependencies.

6) Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

To get high availability on small databases (10 GB), just use a primary write database and a secondary read database. Don’t mess around with the config files too much, and turn on a lot of the logging information which are disabled by default (it will fit).
For larger data needs (but not huge data), switch from using pg_dump to point in time recovery (PITR) in which the filesystem is copied and the WAL is archived at particular points in time (allowing for backups to happen to secondary instances and one can restore to particular points in time if need be).
For huge data, think about sharding the databases by geography, enterprises, and such.

7) Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

Citus offers many features to help scale up PostgreSQL, such as the ability to have a single table be distributed across multiple servers such that a table can be of any size. These are paired with reference tables which are like small, normalized representations on each server helping to eliminate duplicates.
Citus has far better copy performance over regular postgres as it deals out rows to each of the worker postgres nodes in parallel. This parallelization also extends to select and other kinds of operations on distributed tables.
Microsoft Windows runs using citus to keep track of performance/diagnostic analytics.

8) Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

Lots of queries they run use many high dimensionality columns like DeviceID in order to get device-centric aggregation which is computationally expensive to aggregate/operate on, which is why the dynamic compute offered by Citus is so good!
Citus is used because it has rich data types (like HyperLogLog to solve runtime issues on certain calculations), distributed info while allowing the same SQL queries to be used, and is totally scalable to any size.
Instead of just using JSON, internal postgres hstore data type is also used by Microsoft Windows because it is smaller and customizable.

9) Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

“Premature optimization is the root of all evil” is a great takeaway from the talk, and good for start-ups to think about as they are getting off the ground. At the early stages, it is better to think about the product and the market.
ConvertFlow moved onto Citus rather easily as they had started with PostgreSQL, and sharded their single node database to a cluster of nodes. Adjust the stack based on data! Analytics especially good on Citus.
Periodically put old data from over a year in cold storage rather than keeping it all in active “hot” storage which sped things up dramatically as they moved to more and more events. Also they began to reindex distributed indexes which had become bloated.

Bhargavaa1 commented 2 years ago

Scaling Instagram Infrastructure

Instagram handles 100 million photo/video uploads and 4+ billion likes a day.
Storage needs to be consistent across data centers (and withstand disasters/storms) and computing is typically stateless/temporary with the main factor being user traffic.
PostgreSQL is used for user, friendship, and media data while Cassandra is used to store user feeds and activities4.

The Evolution of Reddit.com's Architecture

Reddit is the 4th most popular website in the United States and receives 320 million monthly active users.
The CDN send requests to distinct stacks depending on the path, cookies, and domain.
The "Thing" data model uses PostgreSQL and stores the core data of Reddit (links, subreddits, comments, and accounts)

PostgreSQL at Pandora

Pandora started out using Oracle but switched to PostgreSQL as Oracle was an expensive product and consulting service.
The six DBMS classes Pandora used were Nexus (generates metadata about music), Music MetaData, Radio (stations, ratings), Data Warehouse, and Clustr.
Pandora first used PG_DUMP for backup but moved to PG_BASEBACKUP as PG_DUMP is considered to be slow and had problems with blocking.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Adjust ensures that advertisers are fair and no fraud is being committed in pay-per-install advertising activities.
Adjust receives over 100k requests per second and has 400+ TB of data to analyze.
The materializer at Adjust aggregates data from numerous servers and distributes to a lower number of servers through a function similar to mapreduce.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

Databases tend to be the most difficult part of the stack to scale out
Leboncoin has two data centers, one cloud provider, two thousand virtual machines, and a tech team of more than two hundred people
Leboncoin uses AWS CloudFormation to automate cloud deployment and puppet to automate tasks after installation.4.

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company) Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

PostgreSQL can handle databases of any size and even worked on data that was multiple petabytes large.
Christophe Pettus advises us to get into the habit of planning our upgrade strategy as falling behind on major versions for even small databases such as 10GB might be a problem.
Queries tend to decrease in performance and pg_dump backups take too long to restore but the database contains 100GB of data

Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

Citus provides PostgreSQL with many benefits of distributed databases including query routing, distributed transactions, and parallelized COPY
Citus should be utilized on data that is 100GB and beyond or applications that are complex (Saas, time series, loT)
Data in distributed tables have certain limitations such as unique constraints and limitations for primary/foreign keys and joins

Data modeling, the secret sauce of building & managing a large scale data warehouse

Min Wei defines a measure to be time series data about a certain use case executed on a specific device with two levels of dimensions.
Microsoft uses JSON for staging tables (beginning of the data mesh where data types are quite complicated) and hstore for reporting tables. Another interesting aspect of Citus is the use of indexes and their ability to convert IO transactions from random to sequential.
Citus is quite beneficial because it has built-in data types that are common for data, scalable, and serves a “distributed SQL engine.”

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

ConvertFlow allows marketing teams to develop websites and apps without using code. This new style of code is called “low-code” and makes code more accessible to others.
The transition from 0 to one million events had fast queries, loose storage limits, and indexes to speed up operations
The transition from one to one hundred million events was quite difficult as scaling did not function and caused numerous reports to time out. The Microsoft team used Citus instead of Google Firebase to approach this problem as Firebase tends to be more expensive!