Document Design Decisions

nehanims / notes

Backend for voice-notes

0 stars 0 forks source link

Document Design Decisions #22

Open nehanims opened 4 weeks ago

nehanims commented 4 weeks ago

Create ARDS:

Adding design decision rationale to your project on GitHub is essential for maintaining clarity and providing context for future contributors. Here's how you can effectively document design decisions:

1. Create a Dedicated Documentation Section:

Docs Folder: Create a docs/ folder in the root of your repository to store all documentation files.
Design Decisions Directory: Within the docs/ folder, create a design-decisions/ directory where you can keep all design decision documents.

2. Use ADRs (Architecture Decision Records):

What Are ADRs? ADRs are short documents that capture an architectural decision, the context in which it was made, and its consequences. They provide a structured format to document design decisions.
Structure of an ADR:
- Title: A brief and descriptive title (e.g., "ADR-001: Database Selection for Metadata Storage").
- Context: Explain the background, problem, or requirements that led to the decision.
- Decision: Describe the decision that was made.
- Consequences: Detail the positive and negative consequences of the decision, including trade-offs and any alternative options considered.

Example ADR:

 # ADR-001: Use PostgreSQL for Metadata Storage

 ## Context
 We need a reliable and scalable database solution to store metadata for the audio files and transcriptions in our voice notes app.

 ## Decision
 We decided to use PostgreSQL due to its strong support for ACID transactions, powerful query capabilities, and broad community support.

 ## Consequences
 - **Positive:** PostgreSQL offers excellent support for complex queries and scalability, which is essential for our growing dataset.
 - **Negative:** It may require more effort to set up and maintain compared to a simpler NoSQL database.
 - **Alternatives Considered:** We considered using MongoDB, but it was less suitable for our transactional requirements.

3. Link to ADRs in Your README or CONTRIBUTING File:

README: Provide a brief overview of your design decision documentation and include a link to the docs/design-decisions/ directory.
CONTRIBUTING: If you have a CONTRIBUTING.md file, add a section about the design decisions and how contributors should document any new decisions they make.

4. Maintain a Design Decision Log:

Index File: Maintain an index.md or README.md within the design-decisions/ folder that lists all the ADRs with links to each document.
Tracking Changes: If a decision is revisited or changed, create a new ADR that references the previous one and explains the changes.

5. Use GitHub Issues and Pull Requests:

Link ADRs to Issues: When discussing design decisions in GitHub Issues or Pull Requests, link to the relevant ADR to provide context.
Discuss Decisions: Use GitHub Issues or Discussions to debate and document the rationale before finalizing a decision in an ADR.

6. Keep ADRs Up-to-Date:

Review Regularly: Periodically review ADRs to ensure they still reflect the current architecture and decisions.
Versioning: If a significant change occurs, create a new version of the ADR to reflect the latest decision, rather than modifying the original.

By following this structured approach, you’ll have a clear and well-documented rationale for all major design decisions in your project, making it easier for current and future contributors to understand the architecture and contribute effectively.

nehanims commented 4 weeks ago

Why use Kafka?

A messaging system with fault tolerance (redundant nodes with duplicated data + distributed consensus mechanisms, persistent event logs allow replay of events in case of failure), distributed and scalable (add nodes to cluster to scale), fast reads and writes (sequential access, append only log). It also provides temporal decoupling of events/producers from their downstream processing/consumers. Just add a new consumer if you need further processing on past events for a topic, from the first event or from an offset within the topic. If the world is a state machine, this is a persistent log of state transitions. Edit: I guess this is a design pattern! Kafka releases the consumers from explicitly needing to handle short-term backpressure (like short term undexpected loads), although if the producers are consistently producing at a rate such that the backlog is growing continuously, then some backpressure mechanism will need to be implemented.

Notes about Kafka's speed

The performance difference between sequential and random access in HDDs (Hard Disk Drives) and SSDs (Solid-State Drives) is quite significant, both within each type of storage and when comparing the two.

HDD (Hard Disk Drive):

Sequential Access:
- HDDs are optimized for sequential access, where the read/write head can continuously read or write data as it moves along the disk platter.
- Sequential read/write speeds for modern HDDs typically range from 100 MB/s to 250 MB/s.
Random Access:
- HDDs suffer greatly from random access patterns because the read/write head must physically move to different locations on the disk, leading to increased latency.
- Random read/write speeds are much slower, typically around 0.5 MB/s to 2 MB/s, and the latency can be in the range of 5 to 15 ms.
Comparison:
- Sequential access in HDDs can be up to 100x faster than random access, depending on the specific operations and the HDD model.

SSD (Solid-State Drive):

Sequential Access:
- SSDs excel at sequential access as well, with modern SSDs (using NVMe over PCIe) reaching speeds of 1 GB/s to 7 GB/s.
Random Access:
- SSDs also handle random access much better than HDDs since they have no moving parts. The access time is in microseconds (µs) rather than milliseconds (ms).
- Random read/write speeds for SSDs vary but can range from 30 MB/s to 200 MB/s, with much lower latency, typically around 0.05 ms to 0.1 ms.
Comparison:
- The gap between sequential and random access in SSDs is much smaller compared to HDDs. Sequential access might be 5x to 10x faster than random access in SSDs, but the random access performance of SSDs is still orders of magnitude better than that of HDDs.

HDD vs. SSD:

For sequential access, SSDs are 10x to 30x faster than HDDs.
For random access, SSDs can be 100x to 1000x faster than HDDs, especially for small-sized data operations.

Overall, SSDs outperform HDDs in both sequential and random access scenarios, with the difference being most dramatic in random access.

Additionally, Persistent event management systems are more fault tolerant, and easier to debug. For example, if a consumer fails at a particular message, it will try processing the last failed message when restarted.

nehanims commented 4 weeks ago

Why use a fully reactive tech stack?

Non-blocking reactive endpoints offer several advantages, especially in scenarios where handling a large number of concurrent requests efficiently is crucial. Here’s a breakdown of the benefits and some examples of popular libraries and frameworks:

Advantages of Non-Blocking Reactive Endpoints

High Concurrency Handling:
- Reactive endpoints allow servers to handle many more concurrent connections compared to traditional blocking I/O. This is because threads are not blocked waiting for I/O operations (like reading from a database or network), allowing them to be reused for other tasks while waiting for responses.
Better Resource Utilization:
- By avoiding blocking, reactive systems can make more efficient use of CPU and memory. Since threads aren’t sitting idle waiting for I/O, the system can serve more requests with the same hardware.
Scalability:
- Reactive systems scale better with the number of requests, especially under high load. This is particularly beneficial in microservices architectures or cloud environments where services need to scale dynamically.
Improved Responsiveness:
- Applications can remain responsive under load because they can continue to process incoming requests without being bottlenecked by slow I/O operations.
Fault Tolerance and Resilience:
- Many reactive frameworks offer built-in patterns for handling failures, retries, and fallbacks, which contribute to building resilient systems.

Popular Libraries and Frameworks

Java/Spring Ecosystem:
- Project Reactor: The core library in the Spring ecosystem for building reactive applications. It’s used in Spring WebFlux, the reactive counterpart to Spring MVC.
- RxJava: A popular library for composing asynchronous and event-based programs using observable sequences. Often used in Android development and other JVM-based projects.
- Vert.x: A polyglot event-driven application framework that supports reactive programming. It’s particularly useful for building microservices and high-throughput web applications.
JavaScript/Node.js:
- Express.js with async/await: While not inherently reactive, Node.js uses a non-blocking event-driven architecture. Express.js, combined with async/await, can be used to create non-blocking APIs.
- RxJS: A library for reactive programming using observables, extensively used in Angular applications for managing asynchronous operations.
Python:
- FastAPI: A modern, fast (high-performance) web framework for building APIs with Python. It is based on standard Python-type hints and uses the async and await syntax for non-blocking code.
- Tornado: A Python web framework and asynchronous networking library, originally developed at FriendFeed.
.NET:
- ASP.NET Core: With support for asynchronous programming using async and await, ASP.NET Core can be used to create non-blocking APIs.
- Reactive Extensions (Rx.NET): A library for composing asynchronous and event-based programs using observable sequences in .NET.

Projects That Benefit from Non-Blocking Reactive Endpoints

High-Throughput APIs: Services like social media platforms, real-time analytics dashboards, or any service expected to handle a large number of concurrent API requests.
Streaming Data Applications: Applications processing streams of data, such as video streaming services, live sports updates, or stock market tickers.
Microservices Architectures: Systems composed of multiple microservices, where inter-service communication needs to be efficient and resilient.
IoT Applications: Applications that need to handle data from numerous devices in real-time, such as smart home systems or industrial IoT solutions.
Event-Driven Systems: Systems that react to a series of events, such as user actions in a web application, or events from a messaging system like Kafka.
Cloud-Native Applications: Applications designed to scale elastically in cloud environments, where efficient resource utilization is crucial.

nehanims commented 4 weeks ago

source

nehanims commented 4 weeks ago

source

nehanims commented 4 weeks ago

Live Video Streaming Many VoIP and video conferencing applications leverage UDP due to its lower overhead and ability to tolerate packet loss. Real-time communication benefits from UDP's reduced latency compared to TCP.

DNS DNS (Domain Name Service) queries typically use UDP for their fast and lightweight nature. Although DNS can also use TCP for large responses or zone transfers, most queries are handled via UDP.

Market Data Multicast In low-latency trading, UDP is utilized for efficient market data delivery to multiple recipients simultaneously.

IoT UDP is often used in IoT devices for communications, sending small packets of data between devices.

source

nehanims commented 4 weeks ago

Imperative Programming Imperative programming describes a sequence of steps that change the program’s state. Languages like C, C++, Java, Python (to an extent), and many others support imperative programming styles.

Declarative Programming Declarative programming emphasizes expressing logic and functionalities without describing the control flow explicitly. Functional programming is a popular form of declarative programming.

Object-Oriented Programming (OOP) Object-oriented programming (OOP) revolves around the concept of objects, which encapsulate data (attributes) and behavior (methods or functions). Common object-oriented programming languages include Java, C++, Python, Ruby, and C#.

Aspect-Oriented Programming (AOP) Aspect-oriented programming (AOP) aims to modularize concerns that cut across multiple parts of a software system. AspectJ is one of the most well-known AOP frameworks that extends Java with AOP capabilities.

Functional Programming Functional Programming (FP) treats computation as the evaluation of mathematical functions and emphasizes the use of immutable data and declarative expressions. Languages like Haskell, Lisp, Erlang, and some features in languages like JavaScript, Python, and Scala support functional programming paradigms.

Reactive Programming Reactive Programming deals with asynchronous data streams and the propagation of changes. Event-driven applications, and streaming data processing applications benefit from reactive programming.

Generic Programming Generic Programming aims at creating reusable, flexible, and type-independent code by allowing algorithms and data structures to be written without specifying the types they will operate on. Generic programming is extensively used in libraries and frameworks to create data structures like lists, stacks, queues, and algorithms like sorting, searching.

Concurrent Programming Concurrent Programming deals with the execution of multiple tasks or processes simultaneously, improving performance and resource utilization. Concurrent programming is utilized in various applications, including multi-threaded servers, parallel processing, concurrent web servers, and high-performance computing.

nehanims commented 3 weeks ago

Why NOT to use GraphQL for production: Sounds like a headache to work with in terms of auth, BUT maybe a great tool to expose the database to frontend during development and then only build the useful APIs identified by frontend as REST endpoints. So use graphql in dev environment only to allow quick feature testing without requiring building each REST endpoint until it is finalized.

nehanims commented 3 weeks ago

Difference between kafka kraft mode and zookeeper

The main difference between Zookeeper and KRaft mode in Kafka revolves around how Kafka manages metadata and consensus.

1. Zookeeper Mode:

Metadata Management: In traditional Kafka setups, metadata (like topic configurations, partition assignments, etc.) is managed by an external Zookeeper cluster.
Consensus Mechanism: Zookeeper handles the consensus needed to manage the Kafka cluster state. It keeps track of broker registrations, leader elections, and configuration changes.
Architecture: This mode requires an additional Zookeeper ensemble, which adds operational complexity since it needs to be managed separately from the Kafka brokers.

2. KRaft Mode:

Metadata Management: In KRaft (Kafka Raft), Kafka brokers themselves manage the metadata, removing the need for an external Zookeeper cluster.
Consensus Mechanism: KRaft uses an internal Raft consensus protocol, which is integrated directly into the Kafka brokers. This protocol is responsible for maintaining consistency and ensuring high availability.
Architecture: KRaft simplifies the architecture by eliminating the dependency on Zookeeper. All metadata and cluster state are managed natively within Kafka, making the system more streamlined and easier to manage.

Key Differences:

Zookeeper Dependency: Zookeeper mode relies on an external Zookeeper cluster, whereas KRaft mode is Zookeeper-less, relying on the Raft consensus algorithm integrated into Kafka.
Consensus Protocol: Zookeeper uses its own consensus protocol (ZAB - Zookeeper Atomic Broadcast), while KRaft uses the Raft protocol, which is designed for simplicity and scalability.
Operational Complexity: KRaft reduces operational complexity by eliminating the need for a separate Zookeeper cluster, which can be a significant advantage in terms of deployment and management.

Other Considerations:

Stability and Maturity: As of now, Zookeeper mode is more mature and widely used, while KRaft mode is a newer option that is still evolving.
Compatibility: KRaft mode is primarily targeted at new Kafka deployments, while existing deployments may continue using Zookeeper mode until they are ready to migrate.

In summary, while the consensus mechanism is a significant part of the difference, KRaft mode also represents a shift towards a more integrated and simplified Kafka architecture, aiming to improve ease of use and scalability.

nehanims commented 2 weeks ago

Split the storage into Object Store for Files and Relational DB for Metadata

Metadata would be some kind of relational DB since you might need to query and manipulate it often, whereas audio, you won't really change much. Relational data is great for such operation, but LOB storage and varying size data isn't efficient in relational DBs. S3 is designed to storage large files or varying sizes. And S3 is designed to handle streaming etc (what specific things make it easier?). So access patterns, and the size of data are the main reason to split the files and metadata.

In relational databases, the efficiency of handling varying-sized columns (e.g., VARCHAR, TEXT, BLOB) can significantly impact both storage and performance. Here’s how this works:

1. Storage Efficiency:

Fixed-Length Columns (CHAR): Every entry in a fixed-length column takes up the same amount of space, regardless of the actual data length. This can lead to inefficient use of space if the data varies widely in size, as all entries will reserve the maximum length, even if not fully utilized.
Variable-Length Columns (VARCHAR, TEXT, BLOB): These columns only use as much space as needed for the actual data plus a small overhead for storing the length of the data. This is more space-efficient when dealing with varying-sized data because only the necessary amount of storage is used.

2. Performance Considerations:

Row Length and Page Usage: Relational databases store rows in pages (usually 4KB, 8KB, or larger). Rows that are too large might require multiple pages, leading to more complex read/write operations. Using varying-sized columns can help keep rows smaller, reducing the number of pages and improving performance.
Indexing: Indexing on variable-length columns can be less efficient compared to fixed-length columns. The database needs to handle the varying lengths, which can make index maintenance and lookups more complex and slower.
Data Fragmentation: Variable-length data can lead to fragmentation within pages over time, especially with frequent updates or deletions. This can degrade performance as the database needs to handle fragmented pages.
Memory Allocation: When reading variable-length columns, the database needs to allocate memory dynamically based on the actual size of the data. This can add overhead compared to fixed-length columns, where the memory allocation is straightforward and predictable.

3. Choosing the Right Type:

For columns where data length is predictable and mostly consistent, using CHAR or other fixed-length types can be beneficial for performance due to the predictability in storage and memory allocation.
For columns where data length varies significantly (e.g., names, addresses, descriptions), VARCHAR or TEXT is generally more efficient because it avoids the wasted space associated with fixed-length columns.
If the data size can vary extremely (e.g., files, images, documents), BLOB or TEXT types are appropriate. These are typically stored outside the main table space, with the table storing pointers, reducing the impact on row and page size.

4. Best Practices:

Normalization: Keep large variable-length columns in separate tables if possible, especially if they are infrequently accessed. This keeps the main table smaller and more efficient.
Partitioning: For large tables with varying-sized columns, consider partitioning based on data size or other relevant criteria to improve performance.
Regular Maintenance: Regularly run database maintenance tasks like vacuuming, analyzing, or defragmenting to minimize the performance impact of fragmentation.

By understanding how varying-sized columns affect storage and performance, you can make informed decisions about schema design in relational databases.

nehanims commented 2 weeks ago

Client side caching for frequently requested data?