mikeizbicki / cmc-csci143

big data course materials
41 stars 76 forks source link

State of the industry video assignment #170

Closed mikeizbicki closed 2 years ago

mikeizbicki commented 2 years ago

Due date: Monday, 2 May

Background: There are 9 videos linked below. Each video describes how a major company uses postgres, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to get the gist of them all.

Instructions: Watch each video and write 3 facts that you learned from the video. Submit the assignment by replying to this post with your facts for all of the videos in a single reply. The assignment is worth 1 point per video.

NOTE: I realize that many of you have a lot going on right now, and so I won't be offended if you decide to "punt" this assignment. The point-value is intentionally small so that it will have a minimal impact on your grade if you're not able to complete it. That said, I think this is one of the more interesting assignments in this class and so I'd recommend you find time to watch the videos.

Videos:

  1. Scaling Instagram Infrastructure

    https://www.youtube.com/watch?v=hnpzNAPiC0E

  2. The Evolution of Reddit.com's Architecture

    https://www.youtube.com/watch?v=nUcO7n4hek4

  3. Postgres at Pandora

    https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38

  4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

    https://www.youtube.com/watch?v=BgcJnurVFag

  5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

    https://www.youtube.com/watch?v=4GB7EDxGr_c

  6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

    https://www.youtube.com/watch?v=eZhSUXxfEu0

  7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

    https://www.youtube.com/watch?v=kd-F0M2_F3I

  8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

    https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6

  9. Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

    https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

mikeizbicki commented 2 years ago

Chuck asked good questions in class about monolithic architectures vs microservices and mono repos vs polyrepos. The two videos below address these questions. If you would like to watch these videos, you may substitute them for any of the videos above. (So the total number of videos you must watch is still 9, you now just have more choices.)

  1. Mastering Chaos - a guide to microservices at netflix

    https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s

  2. Why Google Stores Billions of Lines of Code in a Single Repository

    https://www.youtube.com/watch?v=W71BTkUbdqE

    (If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)

Also, I'm just now realizing I never mentioned the kubernetes documentary in class. (Recall that k8s is like docker-compose on steroids.) The documentary covers the history of docker and k8s and some of the technical differences between the two. So you may also substitute the k8s documentary for one of the videos above if you'd like.

  1. Note that the k8s documentary has 2 parts, and you must watch both parts to count as a single video

    https://www.youtube.com/watch?v=BE77h7dmoQU

    https://www.youtube.com/watch?v=318elIq37PE

Tonnpo commented 2 years ago

Is there a partial credit for this assignment or we need to complete 9 videos?

Thanks very much!

mikeizbicki commented 2 years ago

@Tonnpo Yes, there's partial credit. You'll get 1pt/video.

jzlilili commented 2 years ago
  1. Instagram

    • fixes the chache inconsistency problem by runnign a daemon on the postgres replicas instead of updating memcache
    • scales up by minimizing the number of CPU instructions and the number of servers used to execute those instructions
    • removed garbage collection to minimize memory usage
  2. Reddit

    • a lock problem caused vote queue pileups, which was solved by partitioning the queue on subreddit id
    • Things are Reddit's oldest data types and are represented by two tables: a Thing table and a data table
    • parent-child relations of comment trees are stored in a denormalized listing and deffered to offline job processing
  3. Pandora

    • Hadoop cores outnumber Pandora employees about 32 to 1 (as of 2017)
    • Pandora automatically stores historical data to better identify abnormal activity
    • uses CLUSTR with sharded databases, trading consistency for availability
  4. 20TB and Beyond

    • byte counting saves a small amount of space per row but pays off when working with large amounts of data
    • the default settings for autovacuum were changed because it was accumulating too much data between vacuums
    • data is aggregated using an incremental map reduce between servers (seven total stages)
  5. Leboncoin

    • databases are usually the hardest part of the stack to scale out
    • good hardware is required for reliable software (always raid 10 for disks, ecc ram is the most important thing when it comes to quality)
    • it's important to test pg_dumps for possible corruption and to measure restoration time
  6. Breaking Postgres at Scale

    • pitr backups take a file system level copy of the database and save the log segments generated from the file system copy
    • replication lag can cause reads after writes to be unable to pick up on the information that was just written
    • indexes take up disk space and insert time, and add to planning time, so indexes should be added in response to query patterns, not because they might be useful
  7. Citus

    • uses a coordinator node to manage multiple worker nodes
    • Citus can query data from shards in worker nodes in parallel
    • has three different implementations of insert..select: co-located, repartitioning, and merge step, which can each handle around 100, 10, and 1 million rows per second, respectively
  8. Data Modeling

    • hundreds of measures are used to track specific data about latency, bluetooth connectivity, wifi network connectivity, sudio and video quality, etc.
    • device-centric aggregation, or "one device, one vote", is used to reduce noise when gathering data
    • uses JSON for staging tables, for its flexibility and ability to handle many different data types, and Hstore dynamic columns when reporting tables, for its smaller size
  9. ConvertFlow

    • it's best to focus on developing a product before scaling, focusing on customers and not overengineering the stack
    • clearing out old data from hot storage optimizes margins and speeds up operations
    • when reindexing tables, make sure to reindex concurrently to avoid locking up the table unnecessarily
leafsphere commented 2 years ago
  1. Instagram
    • The caching consistency problem was resolved through the Postgres replication mechanism to invalidate the memcache in its own local region, letting users from different regions get the latest comments.
    • The high cost of URL generation was reduced by Cythonizing or using C/C++ for functions that were stable and extensively used.
    • In scaling up, in order to reduce the memory requirement to run more processes, code reduction was done by running in an optimized mode, removing dead code, and moving some private memory into the shared area where only one copy is needed to reduce the total memory needed by moving configuration data into shared memory and disabling garbage collection for python, leading to a capacity increase of at least 20%.
  2. Reddit
    • Listing (of links) is the foundation of Reddit. The way it's done is by selecting the links and caching the list of Link IDs in numcache, then looking up the links by their primary key. The cache is invalidated on new submissions and votes.
    • Running the select query is still expensive even if you invalidate it, so in a process like voting where you already have all the info you need to update the cache listing, you can mutate in place without rerunning the query and store the ID, as well as any related sort information, fetch the current cache listing and perform a read-mutate-write operation.
    • When you have a bunch of votes on hot posts at the same time, you run into lock contention. To fix this, around mid 2012, vote queues were partitioned, putting user votes into separate queues to end up with the same number of total processors divided into different partitions and far fewer vying for the same lock at the same time.
  3. Pandora
    • Replication is used for a high availability solution; all production databases are replicated, and there is both a local replica and remote replica (Disaster Recovery).
    • Originally, all replication was SLONY based, but it was a very intrusive replication. system. It adds its own triggers, which made it possible to break referential integrity, and it had a three-fold increase in writes on the production database. The fragile system tended to fall over a lot, but it was able to replicate between different Postgres versions.
    • Eventually, Pandora started using STREAMING Replication with WAL shipping, which was better than SLONY in most ways. It was not as intrusive as SLONY, but you could only replicate between same Postgres versions.
  4. PostgreSQL at 20TB and Beyond
    • Saving in Redis, reading from Redis, and storing in Postgres means that restarting Postgres doesn't interrupt the backend. You typically get 20-24 TB from each backend, and the data is shipped to archive servers when it gets close to that.
    • To aggregate so much data, Adjust does an incremental MapReduce from one set of servers to another set of servers. The shards themselves do the second stage aggregation.
    • Autovacuum may not be able to catch up on huge tables, so the eventual resolution to the problem was creating a batch script to roll out changes to servers gradually to avoid overloading the system.
  5. Large Databases, Lots of Servers, on Premises, in the Cloud
    • In Paris, the position of data centers are mostly by the Seine River; if the river floods, several data centers would be flooded at the same time. The paths between the data centers should be doubled and ideally multiplied.
    • pg_dump is great for long term storage. It can also test the restore as well as time the restoration, which can be sped up by running parallel processes.
    • leboncoin uses Xyman to moniter alerts, and it's connected to pagerduty. Some metrics are sent to Datadog for developers and other users to see graphs in real time, but for instructure, cybertec pgwatch2 is used to read database statistics. The front end uses Grafana.
  6. Breaking PostgreSQL at Scale
    • By enabling temp file logging, you can how big the temporary files created by queries are and set work_mem to 2-3x the size of the largest temporary file. If it requires something big like 16 GB, then you should fix the query… or start thinking about more memory.
    • If doing full PITR backups is taking too long, you should start doing incremental backups (pgBackRest does them out of the box). Generally you shouldn't increase shared_buffers past 32 GB, since it doesn't benefit actual performance too much.
    • Setting maintenance_work_mem too high will cause autovacuuming to have a hard time finishing; if most indexes are larger than 2 GB, you'll generally have better performance with 256-512 MB.
  7. Citus PostgreSQL at any Scale
    • Citus started as a startup in Turkey and is now offered as a managed service on Azure as a part of Microsoft. It's an open source extension for PostgreSQL that turns it into a distributed database.
    • By distributing data across many PostgreSQL servers, which means you basically can have tables of any size and aren't restricted by memory, and limits on hardware. There are also reference tables but they must be small as it has to be replicated to every server in your Citus cluster and are often slow to write because they need to go to all servers. However, you will have the ability to join them with the distributed table on any column.
    • In addition to select and copy, Citus supports handling many concurrent transactions that hit one worker node, since you're only using a single process on a single worker for that transaction whereas you can't handle many parallel updates across the entire cluster concurrently.
  8. Data modeling, the secret sauce of building & managing a large scale data warehouse
    • A measure is a time series data about executing specific scenarios on a Windows 10/11 device; i.e. tracking start menu launch latencies, audio and video qualities. The measure data has a dimension column (DeviceID, build revisions, etc.) and a metric column (count/Int, value/Float, <key, value> types, and more complex types like histograms).
    • PostgreSQL supports Hstore (key-value storage) and JSON. JSON/JSONB is often used for staging tables as it can handle a lot of complex data types; during the cooking time, either they flatten the JSON types or take subsets of the JSON columns and put into Hstore types to serve as dynamic columns. It seems like Hstore types are smaller, which means less I/O to affect the data in the buffer pool.
    • Partial covering indexes are also used for reporting tables of high dimensions. IOPS is very expensive to buy in a cloud setting, so using a partial covering indexes can be used to manage large-scale data. However, you still have to be careful with how many total indexes you create because they do add up.
  9. Lessons learned scaling our SaaS on Postgres to 8+ billion events
    • In the process of scaling from 1M to 100M events, customers started complaining about time balance when viewing reports. Simply scaling up was not working since the amount of data needed to bring into memory for each query was far too large for a single node database. To solve this, they decided to apply for Firebase and ended up writing and carrying events to Firebase to temporarily solve the problem.
    • But they were rapidly running out of credits, so they dropped Firebase and migrated to use Postgres extension Citus, which made it easy to shard analytics tables. As a result, queries ended up being 2x-4x faster. Thus, you should expect parts of your stack and vendors to change with the scale of your data.
    • Scaling analytical workloads benefits from sharding tables by customer tenant ID column and upgrading a single node database to a cluster of nodes consisting of a coordinator node and multiple worker loads (distributing the workload across multiple database nodes), all of which Citus makes easy to do.
Luew2 commented 2 years ago

Instagram

Reddit

Pandora

PostgreSQL

Large databases

Breaking PostgresSQL at scale

Citus: Postgres at any scale

Data modeling, the secret sauce....:

Lessons learned scaling our SaaS on Postgres....:

kingeddy11 commented 2 years ago

Instagram:

Reddit:

Pandora

20TB and Beyond

Large Databases, Lots of Servers, on Premises, in the Cloud

Breaking Postgres at Scale:

Citus:

Data Modeling, the secret sauce of building & managing a large scale data warehouse

Lesson learned scaling our SaaS on Postgres to 8+ billion

Tcintra commented 2 years ago

Scaling Instagram Infrastructure

1) To scale out means to build an infrastructure that allows us to use more hardware/servers when we need them, to scale up means to make each of these servers count, and to scale a dev to means to enable a fast-growing dev team to move fast without breaking things 2) In order to be ready for disasters and power outages, companies like Facebook conducts regular drills to make sure their services can serve users seamlessly even with the loss of a region, power outages, and human errors 3) Moving data centers closer to where users are will reduce the latency between users’ interactions on instagram

The Evolution of Reddit.com's Architecture

1) The load balancers take in requests from the users and splits them up into various pools of application servers in order to isolate different request paths 2) By partitioning vote queues based on the subreddit of the link being voted on, fewer processors were vying for the same locks concurrently which reduced the amount of time it took to process votes on reddit. 3) Tree structures can be tricky to manage and require extra maintenance/clean up because they are sensitive to ordering

Postgres at Pandora

1) Implementing an error monitoring system for postresql can help identify problems in faster and earlier, such as long duration transactions that exceed a threshold or blocked processes. 2) Pandora uses replication for a high availability solution for events like disaster recovery 3) When making major updates to the database such as upgrading to a new version of postrgresql or deploying a database schema change, users must be switched over to a read-only replica; in order to avoid this problem, pandora utilized CLUSTR

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

1) Requests that come from the internet are directed to the next available back-end by HAProxy. 2) Hiring for these intensive database jobs is very difficult because they can’t hire junior developers and the environment is demanding with little room for error 3) Its very painful to change your data model at a large scale (multiple terabytes)

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

1) Sometimes data can be subject to natural disasters, like if the Seine river floods, then many Parisian data centers would be compromised. 2) When scaling your startup, your RDMS is usually the most complicated part of your stack to manage. 3) It is important to continuously test pg_dumps, but also to maintain physical backups for large databases.

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

1) Replication lag can mess with the acidity of the database such that some reads won't pick up information from writes written in an earlier transaction. 2) Indexes take up disk space and slow query planning, so they should only be created to speed up queries we know are frequently executed, they are not just good to have. 3) Increasing maintenance_work_mem arbitrarily can slow down or break autovacuuming, so we must be careful when tuning this parameter.

Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

1) To build pipelines, use parallel operations. 2) Citus is an open source extension for Postgres that transforms it into a distributed database. 3) Citus was originally an open source project that was started in Turkey. Since it was acquired by Microsoft, it us now offered as a service on Azure.

Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

1) Citus has a great distributed SQL execution engine. 2) Citus has built-in custom data types like HyperLogLog. 3) You should use JSON for staging tables.

Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

1) Create the right product before worrying about scale 2) Tech stack decisions can change, and you will learn what they should be over time 3) When facing slow database problems, there are probably solutions within your database that you just need to look and find, like Citus for Postgres.

yurynamgung commented 2 years ago

Instragram:

  1. Storage servers store global data that need to be consistent across data centers but computing servers are stateless, processing requests by user traffic and temporarily storing data on an as-needed basis.
  2. Memcache is a high performance key-value store in memory that allows for millions of reads/writes per second and reduces the read load on databases (postgresql)
  3. To scale up means to use as few CPU instructions as possible and to use as few servers as possible

Reddit:

  1. Splitting up queries into subqueries and partitioning them out into separate partitions can speed up processing time because it prevents them from vying for the same resources.
  2. In order to speed up the process of looking up nested comments, reddit stored the parent relationships of the comments in a denormalized listing so that they could figure out ahead of time which subset of comments they needed to show and then only looked up those comments in a different table.
  3. Sometimes certain threads can get very busy and hog resources for the rest of the website; reddit dealt with this problem by manually marking those threads and sending them off to their own queue called fastlane.

Pandora:

  1. By identifying normal database activity through by monitoring and logging database activity, we can more quickly identify when abnormal activity is occurring
  2. Pandora stores historical statistics about their database and its activities (i.e. pg_stat views, table sizes and their indexes, etc) so that they can quickly identify the source of errors as they occurred.
  3. Trigger based replication systems add their own triggers to your systems and disables your own triggers, which can make it a very intrusive replication system

PostgreSQL at 20TB and Beyond:

  1. The materializer aggregates new events/data from many back-end servers and copies the aggregations to analytics shards
  2. The Postgres database system has grown much faster in popularity than the DBA community has grown to fill the needs there
  3. Being able to converse about database systems is very important to getting hired to work on these intensive data-related projects, perhaps even more so than being able to contribute technically right away

Large Databases, Lots of Servers, on Premises, in the Cloud:

  1. You can pg_dump into cloud based platforms, as Gurgel explains they do with an AWS instance.
  2. Basebackups frequency is determined by WAL/day: daily for > 1TB, twice a week for > 100GB, twice a month for < 100GB.
  3. If PITR backups take too long, then you should think about doing incremental backups to spread out work.

Breaking Postgres at Scale:

  1. While it might seem that increasing shared_buffers would always boost performance, if your hardware can handle it, this is not the case. We should keep shared_buffers below the 32GB for optimal performance.
  2. To avoid maintenance_work_mem breaking or slowing down the auto vacuuming process, we should keep it below 256MB in a production database.
  3. If your query creates temporary files that exceed 16GB, you probably should rethink the logic for your query.

Citus: Postgres at any Scale:

  1. Citus was an opensource project that started in Turkey. It is now offered under the Azure platform and is thus part of the Microsoft suite.
  2. Reference tables must be replicated across every server in our Cities cluster, so they must be very small to avoid cluttering your cluster.
  3. We can use a coordinator node to manage the many worker nodes in our distributed database.

Data modeling, the secret sauce of building & managing a large scale data warehouse:

  1. Citus has rich built in custom data type and is very scalable and reliable
  2. To do an aggregation calculation over 200Million devices, such as Microsoft does, without further optimizing, it would require an equally large scale of memory which is very expensive
  3. In postgresql you can add custom data types

Lessons learned scaling our SaaS on Postgres to 8+ billion events:

  1. Postgres is easy to deploy and scale, and you can stick with it both at the startup stage and in the long term for massive companies
  2. Citus data allows for sharding tables which helps scale databases to a much larger size
  3. Make sure to index tables concurrently if they are sharded across many nodes
cristywei commented 2 years ago
  1. Instagram
  1. Reddit
  1. Pandora
  1. PostgreSQL at 20TB and Beyond
  1. Breaking PostgreSQL at Scale
  1. Citus: Postgres At Any Scale
  1. Data modeling, the secret sauce of building
  1. Lessons learned scaling our SaaS on Postgres
  1. Why Google Stores Billions of Lines of Code in a Single Repository
dabalus commented 2 years ago

Scaling Instagram Infra

  1. Instagram uses the PostgreSQL system to store media, user, and friendship types of data
  2. Instagram uses Cassandra to store sample data such as user feeds, activities, and many more
  3. Instagram minimizes its memory storage by removing garbage

The revolutions of Reddit.com Architecture:

  1. The CDN sends requests to distinct stacks depending on the domain, path cookies, and many more
  2. Thing is the oldest data model in R2 and it is designed to allow extensions within a safely net
  3. Reddit started out with a small number of users and overtime increased their architectures to accommodate its growing number of users

Postgres at Pandora:

  1. Postgres is a free open source
  2. Postgres is used in DBMS agnostic because of its reliability
  3. Pandora initial used oracle before the Postgres was introduced

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of Postgres)

  1. ISTORE unlike HStores allow for modeling sparse integer arrays and gives the same benefits as using Columnar storage and can be used in modeling time series data. And allows for arithmetics of operations. For example, when one Istores is joined with another Istores it automatically assumes that messing keys are zeros and add the two keys together until you get the sum of the two keys
  2. When aggregating a large set of data, adjust does incremental map Reduce, and maps the first phase of the aggregate on the backend, reduces and the second phase the aggregation on the shards, and finally, does further aggregation if possible on demand.
  3. IStore supports GIN indexing among others

Large Databases, Lots of Servers, on Premises, in the cloud - Get them all (AdTech use of Postgres:

  1. When running a database in shelf hardware it is important to know that you can easily lose your database while working with a bad memory, and this can corrupt your database.
  2. Pg_dumps are important because of their size and ability to be kept for a longer period of time. One can a test the restore without the need to test the PG restore right the SPG restarts and helps to see corruptions in the database
  3. Lebocoin is the most important website for the French people after Google, Amazon, Facebook, etc

Breaking Postgres at Scale:

  1. Is hard to go wrong with a small database on PostgreSQL even in joins unless they are fully N^2 or cross joins.
  2. Autovacuums_works should be increased only if you have a large number of database tables (500+) because each work can only work on one table at a time.
  3. 10GB data can be backed using Pg_dump

Citus: Postgres at any Scale:

  1. Citus have the ability to share table across a cluster of PostgreSQL servers and Transparently routes queries across the cluster, allowing the user to horizontally scale your database without losing PostgreSQL functionality.
  2. A citus consists of multiple PostgreSQL servers with a citus extensions
  3. Distributed tables can be co-located, with other distributed tables

Data Modeling:

  1. It is very possible to scale citus
  2. HyperLogLog is a unique data structure that can be used to calculate across millions of devices to aggregate scalable problems.
  3. JSON or JSONB can be used for staging tables because of its flexibility

Lessons learned scaling our SaaS on Postgres tp 8+billion events...

  1. Create the right product before worrying about scale
  2. Postgres is easy to deploy and scale
  3. Expect part of your stack and vendors to change with scale
katiewu71 commented 2 years ago
  1. Scaling Instagram Infrastructure

    • Storage servers need to be consistent across data centers, whereas computing servers are usually stateless and hold temporary data. PSQL and Cassandra are used for storage, whereas Django, Celery, and RabbitMQ are used for computing.
    • To reduce the memory requirement and run more processes, Instagram reduced code and moved some of the private memory to shared memory. They reduced code by running in optimized mode and removing dead code. Configurations were moved into shared memory and garbage collection was disabled.
    • Checks and balances include code reviews and unit testing, further tests after the code is accepted and committed, and production canaries.
  2. The Evolution of Reddit.com's Architecture

    • r2 is the original monolithic application that is the oldest component of Reddit. It was started in 2008 and written in python, and includes data models like Things and Listings. However, new backend services, which are also written in Python, are beginning to split off from r2.
    • Vote queues would fill up at peak traffic hours, and votes could wait in queue for up to hours. Attempts to scale by adding more consumer processes actually exacerbated the problem, which was because of the cached query mutation locks. This was eventually fixed by using partitioning, which put votes into different queues.
    • Commends are stored as comment trees, which are sensitive to ordering. For megathreads, which tend to hog resources, Fastlane was introduced, which provides a separate queue for the comment thread so that it could be processed more quickly.
  3. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

    • Data is aggregated with incremental Map Reduce — Map and first phase aggregation is done on backends, and Reduce and second phase aggregation is done on shards.
    • Since Adjust writes once and then updates once to the PSQL database, old data needs to be cleaned up. However, the autovacuum is not able to keep up with millions of rows. Thus, performance suffers and the tables bloat.
    • To solve the auto vacuum problem, they changed the autovacuum trigger requirement to 150k rows and 0% of table being “dead”. This needed to be rolled out gradually to prevent overloading systems.
  4. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

    • There are various elements that ensure availability in PostgreSQL. Good hardware is necessary to prevent the loss or corruption of memory. Things to consider with hardware include warranty, power, RAID 10, battery, etc.
    • pg_dump is important because it is great for long-term storage and testable. Testing alerts you if there is corruption or bugs in the database.
    • Physical backups are necessary. Basebackups provide daily snapshots (this can be tailored to the size of the database - i.e. only twice a week for databases between 100GB and 1TB). This is used in conjunction with PITR for restoration.
  5. Breaking Postgres at Scale

    • PostgreSQL can handle databases of any sizes, including those that are multiple petabytes large. However, with small databases, essentially everything with run quickly. Large databases must be handled differently.
    • With large databases, they might not fit in memory, and queries may start performing poorly, in addition to pg_dump taking too long. A good rule of thumb is to try to fit at least the top 1-3 largest indexes in memory, which effective cache size being larger than the largest index. Note that memory does not help write performance.
    • PITR backups and pgBackRest can be used for database backups instead of pg_dump for large databases. PITR takes an entire filesystem copy, and more frequent copies will yield a faster restore.
  6. Citus: Postgres at any Scale

    • Citus provides provides distributed database capabilities to PostgreSQL users, which enables compact representation, high query throughput, fast bulk loading, quick queries, etc.
    • Citus is best used for 100GB or more of data, for multi-tenant applications and real-time analytics dashboards.
    • A Citus cluster contains several PostgreSQL servers with the Citus extension. Citus distributes tables across the database cluster and hash-partitions the worker nodes based on distribution column values.
  7. Lessons learned scaling our SaaS on Postgres to 8+ billion events

    • ConvertFlow started with Rails, Heroku, and PostgreSQL, three popular technologies. At early stages, queries were fast, it was easy to stay within storage limits, and optimizing meant creating indexes.
    • As the company and databases grew, Citus was used to help shard and distribute tables across multiple nodes.
    • ConvertFlow was also able to reduce data storage by clearing out old data from hot storage and reduce index bloat by reindexing.
  8. Mastering Chaos - A Netflix Guide to Microservices

    • Microservices are an approach to developing an application as a suite of small services. They enable modularity, scalability, and elasticity.
    • One service failing can lead to a cascading failure. Netflix resolved this with Hystrix, which provides a fallback to enable a customer to continue using Netflix rather than getting an error. Fault Injection Testing (FIT) was used to test the effectiveness of Hystrix.
    • Conway’s Law tells us that organizations that design systems must produce designs that are copies of the communication structures of these organizations.
  9. Why Google Stores Billions of Lines of Code in a Single Repository

    • Google chose to focus on scalability within their large repository rather than splitting it into several repositories. As of January 2015, there were 1 billion files, 9 million source files, and 2 billion lines of code
    • Google has a custom system named Piper, which hosts the monolithic repository. It is replicated across 10 data centers worldwide. CitC is used to access Piper, which enables developers to see local changes overlaid on the Piper repository and navigate the codebase.
    • Some advantages of having a monolithic repository include one source of truth, code sharing and reuse, and simplified dependency management.
ohorban commented 2 years ago

Instagram:

Reddit:

Pandora:

Postgres at 20TB and beyond:

Large Databases, Lots of Serves … :

Breaking Postgres at scale:

Citus: Postgres at any Scale:

Data modeling, the secret sauce of building & managing a large scale data warehouse:

Lessons learned scaling our SaaS on Postgres to 8+ billion events:

Tonnpo commented 2 years ago

Scaling Instagram Infrastructure:

  1. One interesting fact to me is that Instagram was taken over by Facebook after releasing for two years
  2. This is in Q&A session. For each rollout update, it takes about 10 minutes to update to about over 20,000 servers
  3. Upon scaling the Dev Team, they intentionally use MySQL for shipping features because it is the simplest to get things done.

Mastering Chaos - A Netflix Guide to Microservices:

  1. Microservices is an approach to develop a single app as a suit of small services, each running in its own process and communicating with lightweight mechanisms, often HTTP resource API
  2. To prevent cascading failure, a static fallback response can serve as a solution.
  3. With large and complex architecture, small decrease in availability could lead to a problem. For example, as the speaker mentioned, suppose we have 10 apps, each has 99.99%, then our availability becomes (99.99)^10 = 99.9 which is a huge difference, like 8 or 9 hours in one year.

Why Google Stores Billions of Lines of Code in a Single Repository:

  1. As of 2015, Google had 2 billion lines of code, 1 billion files, 35 million commits, and 86 terabyte of content
  2. To follow the monolithic model of source management, it requires investment heavily on tools to support development.
  3. One advantage of a single code base is that everybody is looking at the same source files which make it much easier to manage/update dependencies, refactor, and utilize new features.
dustin-lind commented 2 years ago

1. Scaling Instagram Infrastructure a. Instagram uses PostgreSQL for user, media, and friendship data. PostgreSQL allowed them to scale out into as many datacenters as the company wanted because databases can be easily replicated between datacenters. b. To optimize CPU usage, Instagram would identify functions that are used extensively and are stable (i.e., not frequently updated) and try to convert them to C (or some version of it) because C is faster. c. Instagram ships code whenever there is a difference between the main branch. There are 40-60 rollouts per data, which makes unit tests extremely important. 2. The Evolution of Reddit.com’s Architecture a. R2 is the original Reddit code monolith b. The vote queue pileup that the speaker talked about was due to lock contention. As concurrent updates on the vote count occurred, locks followed, and this slowed down the processing speed. The initial solution was to partition the vote queues. c. An autoscaler watches utilization metrics and increases/decreases desired capacity accordingly. This helps Reddit save money because they can request less resources from AWS during off-peak/peak times. 3. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres) a. A materializer aggregates data from many servers and transfers it to many servers. It functions like map reduce with the added complication that it is a many server to many server transformation. b. AdTech is known for creating a lot of custom datatypes because when they are working with such large databases byte counting is something that pays off in the long-run with less storage costs. c. Autovaccum occurs by default when there are 50 rows plus 20% of the tables consisting of “dead” tuples. When you are working with a huge dataset, it makes sense to change these parameters because 20% of a 20TB table is still a lot of dead tuples 4. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus) a. To do device centric aggregation, the team would need to query over 200GB of data b. The team uses indexes (both full and partial indexes) extensively to speed up their queries. Some commonly queried tables have 50+ indexes c. One database cluster utilizes is 1920 cores, 15TB of DRAM, and 960TB of premium SSD 5. Mastering Chaos - a guide to microservices at Netflix a. The CAP Theorem states that in the presence of a network partition, you must choose between consistency and availability. b. One challenge of a microservices system is the problem of “crossing the chasm” (i.e., when services need to depend on each other). This could mean network latency, congestion, or failure. c. Netflix uses the Fault Injection Testing (FIT) framework. It’s almost like a vaccine against possible network failures. Netflix will insert possible faults in their microservices and see how the rest of the network reacts. 6. Why Google Stores Billions of Lines of Code in a Single Repository a. Although everyone at the company works from head branch of the Google codebase, Google has an incredible number of testing layers to make sure that the code being merged to the head branch is well-written and free of bugs b. At the time of talk, the Google repository modified just as much code per week as the entire Linux kernel (i.e., millions of lines) c. The speaker cites her struggles working with distributed repos at a gaming company before Google. Each game was being built in its own repository, so it was very hard to merged changes across all the games 7. K8s documentary a. Docker was revolutionary in that it allowed for a portable software development process, and it made it easy to write code that could scale up. b. One reason that Kubernetes became the industry leader despite other competitors was because they decided to opensource the software, so they had a huge workforce of people contributing to the project. c. The people interviewed in the documentary express how they are disappointed that Kubernetes has been pitted against Docker. Kubernetes would not have happened if Docker did not become so popular. Kubernetes is essentially a system for operating docker containers at a large scale. 8. Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company) a. The phrase “premature optimization is the root of all evil” is very true for an early-stage SaaS company. The focus of the technology team should not be on figuring out which systems to use to best scale up the technology, but simply on creating the right product and gaining traction. b. PostgreSQL is a useful database system for early-stage companies. It’s easy to set up, it’s easy to deploy in today’s cloud infrastructure, and its open source so there is a huge community of people supporting the project. Because it powers companies that are working at massive scale, you can know that it will work well in the long-term c. Query performance on large databases can be improved through sharding which is a method for distributing a single dataset across multiple databases, which can then be stored on multiple machines. Citus is an extension for PostgreSQL that allows users to do database sharding. 9. Breaking PostgreSQL at scale a. For monitoring postgres databases, at a minimum you should be processing logs using pgbadger. Pg_stat_statements and pganalyze are valuable tools for operations managers. You should be aiming for basic health reports, query response times, and memory usage. b. Don’t create indexes on every variable. Indexes take up a lot of disk space, they increase insert time, and tend to add to planning times. Add indexes based on query patterns gathered from monitoring reports c. Never disable autovaccum in postgres. The speaker mentions that he gets dozens of calls from clients who ran into problems because they disabled autovaccum. You must accept that vacuuming takes a while and should be completed in full.

afroCoderHanane commented 2 years ago

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E

The Evolution of Reddit.com's Architecture

https://www.youtube.com/watch?v=nUcO7n4hek4

Postgres at Pandora

https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

https://www.youtube.com/watch?v=BgcJnurVFag

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

https://www.youtube.com/watch?v=4GB7EDxGr_c

Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

https://www.youtube.com/watch?v=eZhSUXxfEu0

Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

https://www.youtube.com/watch?v=kd-F0M2_F3I

Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

RuiluGao commented 2 years ago
  1. Scaling Instagram Infrastructure

    • Scale-out is the ability to use more servers to match the user growth by increasing the number of servers as well as the number of datacenters.
    • To scale-up means to make each of the server counts, and to scale the development team means to have a fast-growing engineering team to continue to move fast without breaking things
    • Storage vs. Computing: while storage needs to be consistent across data centers, computing is driven by user traffic, with a needed basis. Instagram uses PostgreSQL to store data like media, users, friendships, etc., and Cassandra for user feeds, activities, etc.
  2. The Evolution of Reddit.com's Architecture

    • r2: the original monolithic application, the oldest single component of Reddit, which was started in 2008, and written in Python. R2 is a giant blog, At the front is the load balancer which takes the request that the user has and splits it into various pools of application servers so that Reddit could isolate different kinds of request paths e.g. a comments page is going slow today because of something going on, it doesn't affect the front page for other people.
    • CDN: CDN can be used as being able to do a lot of decision logic outside at the edge, and figure out which stack is going to end that request, based on the domain that is coming in, the path on the site, any of the cookies that the user has, including, perhaps, experiment bucketing.
    • Reddit uses Cassandra very heavily as well. It has been in the stack for seven years now, it is used for a lot of the new features, ever since it came on board, and it has been very nice for its ability to stay up with one node going down.
  3. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

    • Big data: the scale of the data that Adjust deals with every day is about 100k-300k requests per second tracking over 2 trillion data points, with over 500 TB of data to analyze and it was all done using Postgres SQL.
    • Materializer: being part of the architecture; it aggregates new events, copies the aggregations to the shards, and runs every few minutes. It aggregates data from many servers and transfers it to many servers. It functions like MapReduce with the added complication that it is many servers to many server transformations.
    • Postgres SQL is the big data platform that Adjust uses and Adjust team expects to continue using Postgress while expanding in scale as the data demand continues to grow.
  4. Breaking PostgreSQL as Scale

    • Small database: For a small database of about 10G, it’s hard to go wrong at PostgreSQL and nearly everything will run fast including pathological joins
    • Mid-size database: To back up a mid-size database, pgdumps no longer works, we should use (Point In Time Recovery) PITR backups which take a file system level copy of the whole database at an arbitrary period and save files generated after the file system copy started, therefore, no need to copy frequently.
    • Dont’ pregenerate indexes if you will not need them. Indexes are not free, they take up disk space, a significant amount of insert time as well as planning time. Add indexes that are deemed useful.
  5. Citus: Postgres at any Scale

    • Citus: is an open-source extension for PostgresSQL and turns it into a distributive database (where the data is distributed across many Postgres servers); therefore no more limits on hardware, and memory; having tables of any size.
    • When to use Citus: expect data to grow 100GB and beyond, multi-tenant workloads, real-time analytics, parallel queries, and a data warehouse with a relatively static query set.
    • Citus cluster: a coordinator node and Postgres servers with the Citus extension and an application connected to the coordinator node and this may be subject to modification
  6. Data modeling, the secret sauce of building & managing a large scale data warehouse

    • Windows Data Mesh: a processing pipeline that sessionaize Diagnostic event data and turn them into session-based measures, using census data Data as a reference and loading it to VeniceDB for data enrichment.
    • VeniceDB: Citus-based Postgres cluster, with three stages: Reporting table for consumption, efficiency, Stats table, and Dimension table, the latter two are for customers for data quality assessment.
    • Measure Data Schema: a measure is time-series data about a specific use case execution e.g. Wifi, Bluetooth, etc. with hundreds of measures to track them
  7. Lessons learned scaling our SaaS on Postgres to 8+ billion events

    • Early-stage: At the early stage of the product, it is more important to create the right product than to worry about the scale-up and also choose a database that is easy to initialize, deploy and eventually scale up, preferably open source for later potential migration.
    • Scaling up isn’t as easy as it seems in math: scaling from 1million to 100 million isn’t just easily scaling up by 100x.
    • Old, outdated, unnecessary data should be cleared from hot storage so that there are fewer costs on storing data, therefore, bringing more profits.
  8. Mastering Chaos - a guide to microservices at Netflix

    • Microservice: an approach to develop a single application as a suite of small services, it enables the rapid, frequent, and reliable delivery of large, complex applications.
    • One challenge for microservice is to deal with the intra-service requests, categorizing as crossing the chasm, running into problems like network latency, congestions, and hardware failure as well as logical or scaling failures. With one failure, it can cascade the whole system.
    • Fault Injection Testing(FIT): Netflix creates FIT to test for errors, like taking a dead virus into the body to develop antibodies. it uses synthetic transactions and can be overridden at the device or account level and further taken to monitor under the arbitrary percentage of life traffic.
  9. Why Google Stores Billions of Lines of Code in a Single Repository

    • Google uses one giant single shared repository shared by the whole company and has decided to keep one repo by investing in scalability to keep up with the growth, it is probably the largest single repository in use in the world
    • The size of the repository, the rate of change of the repository as well as its usage have increased significantly with >1 billion files, >2 billion lines of code with 45K commits per workday.
    • Driving force behind the growth case is the automated use case, way above human commits as of today: configuration and many supporting data files are generated from the automation process.
DestrosCMC commented 2 years ago
  1. Scaling Instagram Infrastructure

    • Instagram is built on top of Django
    • Because of how highly Instagram values cross-region operation and consistency, they use Postgres replication for Postgres inserts and cache invalidation into Memcache.
    • It was interesting that Instagram made multiple urls for each photo based on the device
  2. The Evolution of Reddit.com's Architecture

    • I was surprised at how python dependent Reddit's services were. It makes sense engineers wanted to adopt node.js.
    • Reddit reads from memcache and the "cache" is now a denormalized index of links.
    • For comment threads, it is expensive to figure out the tree metadata in-request so they precompute it and store it.
  3. Postgres at Pandora

    • Pandora used to run on Oracle, but Oracle was too expensive and everything cost money. So, they migrated to postgres.
    • It was interesting to see that pg_dump has a >99% success rate and pg_basebackup has a 98% success rate.
    • They made a proprietary distributed database in house that is CAP not ACID.
  4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

    • 20TB per backend server gets stored, and 20+ servers.
    • Materializer is similar to mapreduce but is different because it is a many server to many server transformation.
    • Space saving and small improvements are worth it when working with terrabytes of data.
  5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

    • Barman strategy was interesting to learn about in case of a disaster
    • Datadog was actually used by the devs to see real-time graphs and not the users.
    • No MongoDB because hard to migrate all their data. However, does use NoSQL for elastic search.
  6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

    • Just use pg_dumps for backups, only takes 90 seconds for 5GB on their laptop. (For small databases)
    • More memory does not help with write performance
    • Larger databases time for PITR backups. So pgBackRest is used by the presenter and mentioned that WAL-E is old warhorse. Don't roll your own unless you have really specialized needs.
  7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

    • Use Citus at 100GB +. Does real-time analytics dashboards.
    • If they want the distributed tables to have a unique constraint, it must include the distribution column.
    • Citus gives minimal overhead for queries, still single server, and boasted as easier to integrate than manual sharding and NoSQL.
  8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

    • Venice DB is used for Windows device data. Rich dimension and metrics data types.
    • Petrabyte level queries look like just sql query but under the hood is more complex.
    • Fails pretty frequently so calculate average failure rate.
  9. Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

    • Make sure to choose a good tech stack when you make a product, something that doesn't cost an arm and a leg and open source if you need to migrate later.
    • Heroku single node postgres database wasn't working and they didn't understand this because they didn't know what was under the hood.
    • Citus helped them because it sharded the tables very easily.
joeybodoia commented 2 years ago
  1. Scaling Instagram Infrastructure ⁃ As a work around for the consistency and scaling issue that occurred when memcache was cross region, Instagram used a lease-get and fill/wait/usestale/hit call and response with the database. In this system, the memcache tells the calling application that it has permission to go to the database to fill the cache and subsequent calls do not schedule duplicate DB lookups for the same cache miss. ⁃ With code analysis tools, Instagram gets a lot of visibility into the code base: what paths through the code are being taken and what are the bottlenecks. This was used to alter their url generation usage and additionally rewrite some key python functions in Cython (rewritten in C to be used within the python codebase). ⁃ In addition to scaling out (number of servers) and scaling up (capacity of server), Instagram also created a framework (Tao) to abstract certain DB/caching functions to simplify models for devs and shorten ramp up time and feature development time.

  2. The Evolution of Reddit.com's Architecture

    • For expensive procedures such as voting or submitting a link, reddit defers these procedures to an asynchronous job queue (via RabbitMQ), allowing processors to handle them later.
    • For listings, which are an ordered list of links, Reddit caches a list of IDs, paired with sort information, corresponding to the links in a given listing in memcache, allowing for quicker lookup. When listings are updated, such as when a new vote takes place, the cache listing is updated, and it uses the sort information that corresponds to the id of the link to decide whether that link needs to move up in the listing. Being able to mutate the cache listings allows for Reddit to not have to do expensive select queries
    • Reddit ran into an issue with vote queue pileups during peak traffic hours. This resulted in reddit votes taking hours to process. It’s interesting that adding more processors did not help this issue. It turned out this issue was related to the lock that was put on the listing that is trying to be updated with new information on the votes. Partitioning the vote queues was able to solve this issue.
  3. Postgres at Pandora

    • Pandora initially used Oracle but moved to Postgres because of costs and also a desire for going open-source.
    • Pandora uses a separate database to store historical data. This includes data on size of the entire cluster instance (PGDATA), and statistics from every database in that cluster.
    • Pandora created a proprietary database called CLUSTR, which does not have ACID guarantees. The motivation for this was high availability, but the compromise is consistency of data.
  4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

    • Adjust uses Redis to save all requests that are processed by backend servers. They then have a process that reads in from Redis and stores it in Postgres. Using Redis in this way allows for restarting Postgres without interrupting the backend.
    • Through their Materializer, Adjust is able to perform a many-to-many MapReduce procedure from their backend servers to analytic shards, allowing for increased performance.
    • Adjust creates their own custom 1-byte enums, which allows for representing things such as country names in a single byte, which helps with alignment issues.
    • Adjust changed their autovacumming settings such that autovacuum kicks in when 150k rows are dead plus 0% of table rather than the default of 50 rows + 20% of table.
  5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

    • Leboncoin uses pg_dump for backups. All of their pg_dumps are encrypted and sent to the cloud because they want their backups to be off-site ⁃ Restoring a postgres database from a pg_dump (testing the dump) can reveal corruptions in the database, like violating foreign key constraints or bad stored procedures. Testing pg_dumps also allows you to measure the time to restore a pg_dump, this insight can help you make it faster by adding more parallel workers. ⁃ When there are minor version upgrades, the dependent site can stay up through the update with the postgresql server set on standby and only a few seconds outage. Using logical replication there is very little downtime when upgrading a Postgres server for a major version upgrade as well, taking only a matter of seconds compared to 3 hours before the use of logical replication.
  6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

    • How you handle Postgres changes as db gets larger: what works for 1GB db doesnt for 10TB db
    • pg_dump should be used to backup dbs that are relatively small (<10GB), but do not work well for larger dbs. Point in time recovery backups (PITR) are better for larger dbs. PITR backups allows you to not have to copy the entire database nearly as often.
    • For databases that you can't fit entirely in memory, aim to fit at least the top 1-3 largest indexes.
  7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought) ⁃ Citus allows you to make your Postgres database distributed across many servers. In many cases, despite the additional network routing to the working nodes, many commands are quicker (like many inserts, instead leveraging the copy operation) because of parallelization. ⁃ Queries using “Unique” must within the distributed column because the server only knows if it is unique within the working machine. Similarly, there are complications for aggregation queries and the coordinator node sometimes needs to combine results returned by worker nodes. In sharing multiple tables on the same distributed column then you also still have the ability to create foreign keys and use joins as normal very efficiently (still possible otherwise but slower as a low is sent across the network). ⁃ The coordinator node can delegate queries when it sees that the where clause is on a distributed column.

  8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus) ⁃ As part of Citus, there are some additional aggregate functions (todigest_percentile) that the coordinator node recognizes and uses in all of its merge coordination under the hood (in a distributed map reduce way). ⁃ In order to work around the count-distinct issues that occur in a distributed database, they use the hyperloglog algorithm for the query to create the denominator. ⁃ To reduce table size and reduce I/O time, much of the input data is in json format during staging but only parts of the original json object are selected and saved in hstore (key, value) columns.

  9. Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company) ⁃ In the earliest stages of the startups history, scale was not as important as the product, but down the road too much data was being pulled into memory for a single node database and causing major latency issues cross-customer when a particular customer was creating a large report. ⁃ In order to make it through the scale up to 1 million uses the company temporarily used Firebase (on cloud credits) to patch the problem, but ultimately sharding on website_id was the permanent fix. ⁃ Reindexing their tables concurrently (with latest Postgres indices) and clearing out data from hot storage (older than 1 year) they were able to reduce cost when reaching the multibillion event level. They also hit the max value limit for the primary key and had to migrate from a smallint to a bigint by adding a new indexed column, backfilling it and then making it the new primary key.

samcbogen commented 2 years ago
  1. Scaling Instagram Infrastructure
  1. The Evolution of Reddit.com's Architecture
  1. Postgres at Pandora
  1. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)
  1. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

    • Hardware is incredibly expensive, and it can directly correlate with availability. It is better to buy more expensive, higher-quality hardware than cheaper hardware because the risk of corrupting databases is very real, and fixing this can be WAY more expensive than just buying high-quality hardware.
    • pg_dumps must be continuously tested, and the data should be manually backed up.
    • Many large databases do not fit within memory, and this will start to make your queries perform significantly worse.
  2. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

  1. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)
  1. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)
  1. Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)
alex-muehleisen commented 2 years ago
  1. Scaling Instagram Infrastructure

    • 'Scaling' can be seen in many different ways at firms, but it always consists of continuous effort, alertness, and holding everyone at the company responsible.
    • Storage needs to be consistent across data centers
    • Being ready to fail and having back-up plans in case of failure is crucial to running a successful business.
  2. The Evolution of Reddit.com's Architecture

    • Stack components change all the time, and even major components remain a work in progress at most tech companies.
    • Comments on Reddit are threaded, meaning that replies are nested. This makes comment trees very expensive.
    • Reddit sets 'Queue Quotas' (maximum queue lengths) to ensure that no one can consume all of their resources.
  3. Postgres at Pandora

    • Monitoring activity in Postgres is incredibly important. This includes monitoring current activity, the Postgres log file, Query Durations, and errors.
    • Accessing Historical Data can offer you the chance to make updated growth predictions on databases and tables, as well as reconfigure any database architecture if necessary.
    • All production databases are replicated. This is generally considered good, common practice.
  4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of Postgres)

    • A Materializer aggregates data from numerous servers and transfers it. It functions similar to MapReduce we learned in class, but with the complication of transferring data across multiple servers.
    • Postgres' autovacuum functionality is incredibly important and can be 'tuned' to activate once a certain percentage of the table's tuples are dead. This helps to minimize disk usage and avoid rewriting tables.
    • Postgres is huge and even 400TB worth of an analytics environment is not enough to outgrow Postgres.
  5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

    • Bad hardware is bad for your data.
    • Version upgrades are dangerous for businesses; their detriment can be limited by capping maintenance breaks and limiting service disruptions to only specific sections of the website.
    • NoSQL and MySQL engines can be used in conjunction with the Postgres engine.
  6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

    • What works for a 1GB database doesn't work for a 10TB database.
    • Postgres has high availability, through direct streaming, basic WAL archiving, or manual failover.
    • In regard to memory, the effective cache size should be greater than the largest index, ideally.
abooli commented 2 years ago
  1. Instagram:
    • utilizing postgresql to store the metadata about a post
    • removing garbage collection helps with more effective use of memory
    • scaling up in a sense that means to use as few CPU instructions as possible
  2. Reddit
    • the listings were computed with a simple SQL query (although it looks way cooler than that)
    • the alerts indicating replication in 2011 were ultimately solved by finding the phantom data and removing them.
    • the thing object was eventually decided to be pulled out from r2, which manages the complexity of the reddit infrastructure
  3. Pandora
    • Pandora originally used Oracle, but later decided to switch to postgres due to cost and open-source
    • the Pandora database consists of three classes: the Nexus (generates music meta data), the Meta Data, and the radio (the stations and other info)
    • pg_dump was a very slow replication to use for a database as big as Pandora’s (out-of-space issue is the 1% fail)
  4. PostgreSQL
    • Adjust uses ~24 backend serves, and around 20TB per backend server gets stored.
    • the materializer aggregates new events/data from back-end servers
    • custom 1-byte enums helps Adjust with managing complex data
  5. Large databases
    • a mix of sql and nosql was incorporated in the system
    • postgresql could safely handle scale-ups if enough attentions were paid
    • the WAL log is important for managing data back-ups and recovery
  6. Breaking PostgreSQL at scale
    • pgBackRest could be used to replace pg_dump
    • work_mem should be kept at around 256 MB based on the logs
    • the shared_buffer size should be set to 16-32 gigabytes
  7. Citus
    • citus supports fast bulk loading and parallelized COPY
    • citus comes in handy when the amount of data is beyond 100GB
    • distributed tables can be co-located
  8. Data Modeling
    • a measure is time series data about a specific use case executed
    • JSON is a good data type to store data for due to its flexibility
    • citus provides built-in data types.
  9. ConvertFlow
    • creating indexes is important for scaling up and optimizing query speed
    • it is important to expect parts of your stack and vendors to change with scale
    • periodically clearing old stale data from hot storage is important for maintaining disk space
anyu-yu commented 2 years ago
  1. Instagram
  1. Reddit.com
  1. a
axelahdritz commented 2 years ago

1) Scaling Instagram Infrastructure

2) The Evolution of Reddit.com's Architecture

3) Postgres at Pandora

4) PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

5) Why Google Stores Billions of Lines of Code in a Single Repository

6) Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

7) Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

8) Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

9) Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

Bhargavaa1 commented 2 years ago

Scaling Instagram Infrastructure

  1. Instagram handles 100 million photo/video uploads and 4+ billion likes a day.
  2. Storage needs to be consistent across data centers (and withstand disasters/storms) and computing is typically stateless/temporary with the main factor being user traffic.
  3. PostgreSQL is used for user, friendship, and media data while Cassandra is used to store user feeds and activities4.

The Evolution of Reddit.com's Architecture

  1. Reddit is the 4th most popular website in the United States and receives 320 million monthly active users.
  2. The CDN send requests to distinct stacks depending on the path, cookies, and domain.
  3. The "Thing" data model uses PostgreSQL and stores the core data of Reddit (links, subreddits, comments, and accounts)

PostgreSQL at Pandora

  1. Pandora started out using Oracle but switched to PostgreSQL as Oracle was an expensive product and consulting service.
  2. The six DBMS classes Pandora used were Nexus (generates metadata about music), Music MetaData, Radio (stations, ratings), Data Warehouse, and Clustr.
  3. Pandora first used PG_DUMP for backup but moved to PG_BASEBACKUP as PG_DUMP is considered to be slow and had problems with blocking.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

  1. Adjust ensures that advertisers are fair and no fraud is being committed in pay-per-install advertising activities.
  2. Adjust receives over 100k requests per second and has 400+ TB of data to analyze.
  3. The materializer at Adjust aggregates data from numerous servers and distributes to a lower number of servers through a function similar to mapreduce.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

  1. Databases tend to be the most difficult part of the stack to scale out
  2. Leboncoin has two data centers, one cloud provider, two thousand virtual machines, and a tech team of more than two hundred people
  3. Leboncoin uses AWS CloudFormation to automate cloud deployment and puppet to automate tasks after installation.4.

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company) Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

  1. PostgreSQL can handle databases of any size and even worked on data that was multiple petabytes large.
  2. Christophe Pettus advises us to get into the habit of planning our upgrade strategy as falling behind on major versions for even small databases such as 10GB might be a problem.
  3. Queries tend to decrease in performance and pg_dump backups take too long to restore but the database contains 100GB of data

Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

  1. Citus provides PostgreSQL with many benefits of distributed databases including query routing, distributed transactions, and parallelized COPY
  2. Citus should be utilized on data that is 100GB and beyond or applications that are complex (Saas, time series, loT)
  3. Data in distributed tables have certain limitations such as unique constraints and limitations for primary/foreign keys and joins

Data modeling, the secret sauce of building & managing a large scale data warehouse

  1. Min Wei defines a measure to be time series data about a certain use case executed on a specific device with two levels of dimensions.
  2. Microsoft uses JSON for staging tables (beginning of the data mesh where data types are quite complicated) and hstore for reporting tables. Another interesting aspect of Citus is the use of indexes and their ability to convert IO transactions from random to sequential.
  3. Citus is quite beneficial because it has built-in data types that are common for data, scalable, and serves a “distributed SQL engine.”

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

  1. ConvertFlow allows marketing teams to develop websites and apps without using code. This new style of code is called “low-code” and makes code more accessible to others.
  2. The transition from 0 to one million events had fast queries, loose storage limits, and indexes to speed up operations
  3. The transition from one to one hundred million events was quite difficult as scaling did not function and caused numerous reports to time out. The Microsoft team used Citus instead of Google Firebase to approach this problem as Firebase tends to be more expensive!