iris2008 / iris2008.github.io

My Hexo Blog
0 stars 0 forks source link

Technical Notes | Iris' Blog #7

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Technical Notes | Iris' Blog

https://iris2008.github.io/2021/02/23/Technical-Notes/

iris2008 commented 3 years ago

Node Postgres tutorial

last modified July 7, 2020

The Node Postgres tutorial shows how to work with PostgreSQL database in JavaScript with node-postgres.

The node-postgres

The node-postgres is a collection of Node.js modules for interfacing with the PostgreSQL database. It has support for callbacks, promises, async/await, connection pooling, prepared statements, cursors, and streaming results.

In our examples we also use the Ramda library. See Ramda tutorial for more information.

Setting up node-postgres

First, we install node-postgres.

$ node -v v11.5.0 We use Node version 11.5.0.

$ npm init -y We initiate a new Node application.

$ npm i pg We install node-postgres with nmp i pg.

$ npm i ramda In addition, we install Ramda for beautiful work with data.

cars.sql

DROP TABLE IF EXISTS cars;

CREATE TABLE cars(id SERIAL PRIMARY KEY, name VARCHAR(255), price INT);
INSERT INTO cars(name, price) VALUES('Audi', 52642);
INSERT INTO cars(name, price) VALUES('Mercedes', 57127);
INSERT INTO cars(name, price) VALUES('Skoda', 9000);
INSERT INTO cars(name, price) VALUES('Volvo', 29000);
INSERT INTO cars(name, price) VALUES('Bentley', 350000);
INSERT INTO cars(name, price) VALUES('Citroen', 21000);
INSERT INTO cars(name, price) VALUES('Hummer', 41400);
INSERT INTO cars(name, price) VALUES('Volkswagen', 21600);

In some of the examples, we use this cars table.

The node-postgres first example In the first example, we connect to the PostgreSQL database and return a simple SELECT query result.

first.js

const pg = require('pg');
const R = require('ramda');

const cs = 'postgres://postgres:s$cret@localhost:5432/ydb';

const client = new pg.Client(cs);
client.connect();

client.query('SELECT 1 + 4').then(res => {

    const result = R.head(R.values(R.head(res.rows)));

    console.log(result);
}).finally(() => client.end());

The example connects to the database and issues a SELECT statement.

const pg = require('pg'); const R = require('ramda'); We include the pg and ramda modules.

const cs = 'postgres://postgres:s$cret@localhost:5432/ydb'; This is the PostgreSQL connection string. It is used to build a connection to the database.

const client = new pg.Client(cs); client.connect(); A client is created. We connect to the database with connect().

client.query('SELECT 1 + 4').then(res => {

const result = R.head(R.values(R.head(res.rows)));

console.log(result);

}).finally(() => client.end()); We issue a simple SELECT query. We get the result and output it to the console. The res.rows is an array of objects; we use Ramda to get the returned scalar value. In the end, we close the connection with end().

$ node first.js 5 This is the output.

The node-postgres column names In the following example, we get the columns names of a database.

column_names.js const pg = require('pg');

const cs = 'postgres://postgres:s$cret@localhost:5432/ydb';

const client = new pg.Client(cs);

client.connect();

client.query('SELECT * FROM cars').then(res => {

const fields = res.fields.map(field => field.name);

console.log(fields);

}).catch(err => { console.log(err.stack); }).finally(() => { client.end() }); The column names are retrieved with res.fields attribute. We also use the catch clause to output potential errors.

$ node column_names.js [ 'id', 'name', 'price' ] The output shows three column names of the cars table.

Selecting all rows In the next example, we select all rows from the database table.

all_rows.js const pg = require('pg'); const R = require('ramda');

const cs = 'postgres://postgres:s$cret@localhost:5432/ydb';

const client = new pg.Client(cs);

client.connect();

client.query('SELECT * FROM cars').then(res => {

const data = res.rows;

console.log('all data');
data.forEach(row => {
    console.log(`Id: ${row.id} Name: ${row.name} Price: ${row.price}`);
})

console.log('Sorted prices:');
const prices = R.pluck('price', R.sortBy(R.prop('price'), data));
console.log(prices);

}).finally(() => { client.end() }); The example outputs all rows from the cars table and a sorted list of car prices.

$ node all_rows.js all data Id: 1 Name: Audi Price: 52642 Id: 2 Name: Mercedes Price: 57127 Id: 3 Name: Skoda Price: 9000 Id: 4 Name: Volvo Price: 29000 Id: 5 Name: Bentley Price: 350000 Id: 6 Name: Citroen Price: 21000 Id: 7 Name: Hummer Price: 41400 Id: 8 Name: Volkswagen Price: 21600 Sorted prices: [ 9000, 21000, 21600, 29000, 41400, 52642, 57127, 350000 ] This is the output.

The node-postgres parameterized query Parameterized queries use placeholders instead of directly writing the values into the statements. Parameterized queries increase security and performance.

parameterized.js const pg = require('pg');

const cs = 'postgres://postgres:s$cret@localhost:5432/ydb';

const client = new pg.Client(cs);

client.connect();

const sql = 'SELECT * FROM cars WHERE price > $1'; const values = [50000];

client.query(sql, values).then(res => {

const data = res.rows;

data.forEach(row => console.log(row));

}).finally(() => { client.end() }); The example uses a parameterized query in a simple SELECT statement.

const sql = 'SELECT * FROM cars WHERE price > $1'; This is the SELECT query. The $1 is a placeholder which is later replaced with a value in a secure way.

const values = [50000]; These are the values to be inserted into the parameterized query.

client.query(sql, values).then(res => { The values are passed to the query() method as the second parameter.

$ node parameterized.js { id: 1, name: 'Audi', price: 52642 } { id: 2, name: 'Mercedes', price: 57127 } { id: 5, name: 'Bentley', price: 350000 } This is the output.

The node-postgres with async/await Node Postgres supports the async/await syntax.

async_await.js const pg = require('pg'); const R = require('ramda');

const cs = 'postgres://postgres:s$cret@localhost:5432/ydb';

async function fetchNow() {

const client = new pg.Client(cs);

try {
    await client.connect();

    let result = await client.query('SELECT now()');
    return R.prop('now', R.head(result.rows));
} finally {
    client.end()
}

}

fetchNow().then(now => console.log(now)); The example outputs the result of a SELECT now() query with async/await.

$ node async_await.js 2019-02-17T11:53:01.447Z This is the output.

The node-postgres rowmode By default, node-postgres returns data as an array of objects. We can tell node-postgres to return the data as an array of arrays.

row_mode.js const pg = require('pg'); const R = require('ramda');

const cs = 'postgres://postgres:s$cret@localhost:5432/ydb';

const client = new pg.Client(cs);

client.connect();

const query = { text: 'SELECT * FROM cars', rowMode: 'array' };

client.query(query).then(res => {

const data = res.rows;

console.log('all data');
data.forEach(row => {
    console.log(`Id: ${row[0]} Name: ${row[1]} Price: ${row[2]}`);
})

console.log('Sorted prices:');

const prices = data.map(x => x[2]);

const sorted = R.sort(R.comparator(R.lt), prices);
console.log(sorted);

}).finally(() => { client.end() }); The example shows all rows from the cars table. It enables the array row mode.

const query = { text: 'SELECT * FROM cars', rowMode: 'array' }; We use the configuration object where we set the rowMode to array.

console.log('all data'); data.forEach(row => { console.log(Id: ${row[0]} Name: ${row[1]} Price: ${row[2]}); }) Now we loop over an array of arrays.

$ node row_mode.js all data Id: 1 Name: Audi Price: 52642 Id: 2 Name: Mercedes Price: 57127 Id: 3 Name: Skoda Price: 9000 Id: 4 Name: Volvo Price: 29000 Id: 5 Name: Bentley Price: 350000 Id: 6 Name: Citroen Price: 21000 Id: 7 Name: Hummer Price: 41400 Id: 8 Name: Volkswagen Price: 21600 Sorted prices: [ 9000, 21000, 21600, 29000, 41400, 52642, 57127, 350000 ] This is the output.

The node-postgres pooling example Connection pooling improves performance of a database application. It is especially useful web applications.

pooled.js const pg = require('pg');

var config = { user: 'postgres', password: 's$cret', database: 'ydb' }

const pool = new pg.Pool(config);

pool.connect() .then(client => { return client.query('SELECT * FROM cars WHERE id = $1', [1]) .then(res => { client.release(); console.log(res.rows[0]); }) .catch(e => { client.release(); console.log(e.stack); }) }).finally(() => pool.end()); The example shows how to set up an example which uses connection pooling. When we are done with a query, we call the client.release() method to return the connection to the pool.

}).finally(() => pool.end()); The pool.end() drains the pool of all active clients, disconnect them, and shut down any internal timers in the pool. This is used in scripts such as this example. In web applications, we can call it when the web server shuts down or don't call it at all.

In this tutorial, we have used node-postgres to interact with PostgreSQL in Node.js.

iris2008 commented 3 years ago

Javascript Notes

three dots

Array/Object spread operator ... will keep those you want and overwrite those you give a new value

const adrian = {
  fullName: 'Adrian Oprea',
  occupation: 'Software developer',
  age: 31,
  website: 'https://oprea.rocks'
};
const bill = {
  ...adrian,
  fullName: 'Bill Gates',
  website: 'https://microsoft.com'
};

When use in function parameters, it's called REST operator

function sum(...numbers) {
    return numbers.reduce((accumulator, current) => {
        return accumulator += current;
    });
};

sum(1,2) // 3
sum(1,2,3,4,5) // 15
iris2008 commented 3 years ago

How Sharding Works

Your application suddenly becomes popular. Traffic and data is starting to grow, and your database gets more overloaded every day. People on the internet tell you to scale your database by sharding, but you don’t really know what it means. You start doing some research, and run into this post. Welcome!

What is sharding?

Sharding is a method of splitting and storing a single logical dataset in multiple databases. By distributing the data among multiple machines, a cluster of database systems can store larger dataset and handle additional requests. Sharding is necessary if a dataset is too large to be stored in a single database. Moreover, many sharding strategies allow additional machines to be added. Sharding allows a database cluster to scale along with its data and traffic growth. Sharding is also referred as horizontal partitioning. The distinction of horizontal vs vertical comes from the traditional tabular view of a database. A database can be split vertically — storing different tables & columns in a separate database, or horizontally — storing rows of a same table in multiple database nodes.

An illustrated example of vertical and horizontal partitioning

Example of vertical partitioning

fetch_user_data(user_id) -> db[“USER”].fetch(user_id) fetch_photo(photo_id) -> db[“PHOTO”].fetch(photo_id)

Example of horizontal partitioning

fetch_user_data(user_id) -> user_db[user_id % 2].fetch(user_id) Vertical partitioning is very domain specific. You draw a logical split within your application data, storing them in different databases. It is almost always implemented at the application level — a piece of code routing reads and writes to a designated database. In contrast, sharding splits a homogeneous type of data into multiple databases. You can see that such an algorithm is easily generalizable. That’s why sharding can be implemented at either the application or database level. In many databases, sharding is a first-class concept, and the database knows how to store and retrieve data within a cluster. Almost all modern databases are natively sharded. Cassandra, HBase, HDFS, and MongoDB are popular distributed databases. Notable examples of non-sharded modern databases are Sqlite, Redis (spec in progress), Memcached, and Zookeeper. There exist various strategies to distribute data into multiple databases. Each strategy has pros and cons depending on various assumptions a strategy makes. It is crucial to understand these assumptions and limitations. Operations may need to search through many databases to find the requested data. These are called cross-partition operations and they tend to be inefficient. Hotspots are another common problem — having uneven distribution of data and operations. Hotspots largely counteract the benefits of sharding. Before you start: you may not need to shard! Sharding adds additional programming and operational complexity to your application. You lose the convenience of accessing the application’s data in a single location. Managing multiple servers adds operational challenges. Before you begin, see whether sharding can be avoided or deferred. Get a more expensive machine. Storage capacity is growing at the speed of Moore’s law. From Amazon, you can get a server with 6.4 TB of SDD, 244 GB of RAM and 32 cores. Even in 2013, Stack Overflow runs on a single MS SQL server. (Some may argue that splitting Stack Overflow and Stack Exchange is a form of sharding) If your application is bound by read performance, you can add caches or database replicas. They provide additional read capacity without heavily modifying your application. Vertically partition by functionality. Binary blobs tend to occupy large amounts of space and are isolated within your application. Storing files in S3 can reduce storage burden. Other functionalities such as full text search, tagging, and analytics are best done by separate databases. Not everything may need to be sharded. Often times, only few tables occupy a majority of the disk space. Very little is gained by sharding small tables with hundreds of rows. Focus on the large tables. Driving Principles To compare the pros and cons of each sharding strategy, I’ll use the following principles. How the data is read — Databases are used to store and retrieve data. If we don’t need to read data at all, we can simply write it to /dev/null. If we only need to batch process the data once in a while, we can append to a single file and periodically scan through them. Data retrieval requirements (or lack thereof) heavily influence the sharding strategy. How the data is distributed — Once you have a cluster of machines acting together, it is important to ensure that data and work is evenly distributed. Uneven load causes storage and performance hotspots. Some databases redistribute data dynamically, while others expect clients to evenly distribute and access data. Once sharding is employed, redistributing data is an important problem. Once your database is sharded, it is likely that the data is growing rapidly. Adding an additional node becomes a regular routine. It may require changes in configuration and moving large amounts of data between nodes. It adds both performance and operational burden. Common Definitions Many databases have their own terminologies. The following terminologies are used throughout to describe different algorithms. Shard or Partition Key is a portion of primary key which determines how data should be distributed. A partition key allows you to retrieve and modify data efficiently by routing operations to the correct database. Entries with the same partition key are stored in the same node. A logical shard is a collection of data sharing the same partition key. A database node, sometimes referred as a physical shard, contains multiple logical shards. Case 1 — Algorithmic Sharding One way to categorize sharding is algorithmic versus dynamic. In algorithmic sharding, the client can determine a given partition’s database without any help. In dynamic sharding, a separate locator service tracks the partitions amongst the nodes.

An algorithmically sharded database, with a simple sharding function Algorithmically sharded databases use a sharding function (partition_key) -> database_id to locate data. A simple sharding function may be “hash(key) % NUM_DB”. Reads are performed within a single database as long as a partition key is given. Queries without a partition key require searching every database node. Non-partitioned queries do not scale with respect to the size of cluster, thus they are discouraged. Algorithmic sharding distributes data by its sharding function only. It doesn’t consider the payload size or space utilization. To uniformly distribute data, each partition should be similarly sized. Fine grained partitions reduce hotspots — a single database will contain many partitions, and the sum of data between databases is statistically likely to be similar. For this reason, algorithmic sharding is suitable for key-value databases with homogeneous values. Resharding data can be challenging. It requires updating the sharding function and moving data around the cluster. Doing both at the same time while maintaining consistency and availability is hard. Clever choice of sharding function can reduce the amount of transferred data. Consistent Hashing is such an algorithm. Examples of such system include Memcached. Memcached is not sharded on its own, but expects client libraries to distribute data within a cluster. Such logic is fairly easy to implement at the application level. Case 2— Dynamic Sharding

A dynamic sharding scheme using range based partitioning. In dynamic sharding, an external locator service determines the location of entries. It can be implemented in multiple ways. If the cardinality of partition keys is relatively low, the locator can be assigned per individual key. Otherwise, a single locator can address a range of partition keys. To read and write data, clients need to consult the locator service first. Operation by primary key becomes fairly trivial. Other queries also become efficient depending on the structure of locators. In the example of range-based partition keys, range queries are efficient because the locator service reduces the number of candidate databases. Queries without a partition key will need to search all databases. Dynamic sharding is more resilient to nonuniform distribution of data. Locators can be created, split, and reassigned to redistribute data. However, relocation of data and update of locators need to be done in unison. This process has many corner cases with a lot of interesting theoretical, operational, and implementational challenges. The locator service becomes a single point of contention and failure. Every database operation needs to access it, thus performance and availability are a must. However, locators cannot be cached or replicated simply. Out of date locators will route operations to incorrect databases. Misrouted writes are especially bad — they become undiscoverable after the routing issue is resolved. Since the effect of misrouted traffic is so devastating, many systems opt for a high consistency solution. Consensus algorithms and synchronous replications are used to store this data. Fortunately, locator data tends to be small, so computational costs associated with such a heavyweight solution tends to be low. Due to its robustness, dynamic sharding is used in many popular databases. HDFS uses a Name Node to store filesystem metadata. Unfortunately, the name node is a single point of failure in HDFS. Apache HBase splits row keys into ranges. The range server is responsible for storing multiple regions. Region information is stored in Zookeeper to ensure consistency and redundancy. In MongoDB, the ConfigServer stores the sharding information, and mongos performs the query routing. ConfigServer uses synchronous replication to ensure consistency. When a config server loses redundancy, it goes into read-only mode for safety. Normal database operations are unaffected, but shards cannot be created or moved. Case 3 — Entity Groups

Entity Groups partitions all related tables together Previous examples are geared towards key-value operations. However, many databases have more expressive querying and manipulation capabilities. Traditional RDBMS features such as joins, indexes and transactions reduce complexity for an application. The concept of entity groups is very simple. Store related entities in the same partition to provide additional capabilities within a single partition. Specifically: Queries within a single physical shard are efficient. Stronger consistency semantics can be achieved within a shard. This is a popular approach to shard a relational database. In a typical web application data is naturally isolated per user. Partitioning by user gives scalability of sharding while retaining most of its flexibility. It normally starts off as a simple company-specific solution, where resharding operations are done manually by developers. Mature solutions like Youtube’s Vitess and Tumblr’s Jetpants can automate most operational tasks. Queries spanning multiple partitions typically have looser consistency guarantees than a single partition query. They also tend to be inefficient, so such queries should be done sparingly. However, a particular cross-partition query may be required frequently and efficiently. In this case, data needs to be stored in multiple partitions to support efficient reads. For example, chat messages between two users may be stored twice — partitioned by both senders and recipients. All messages sent or received by a given user are stored in a single partition. In general, many-to-many relationships between partitions may need to be duplicated. Entity groups can be implemented either algorithmically or dynamically. They are usually implemented dynamically since the total size per group can vary greatly. The same caveats for updating locators and moving data around applies here. Instead of individual tables, an entire entity group needs to be moved together. Other than sharded RDBMS solutions, Google Megastore is an example of such a system. Megastore is publicly exposed via Google App Engine’s Datastore API. Case 4 — Hierarchical keys & Column-Oriented Databases

Column-oriented databases partition its data by row keys. Column-oriented databases are an extension of key-value stores. They add expressiveness of entity groups with a hierarchical primary key. A primary key is composed of a pair (row key, column key). Entries with the same partition key are stored together. Range queries on columns limited to a single partition are efficient. That’s why a column key is referred as a range key in DynamoDB. This model has been popular since mid 2000s. The restriction given by hierarchical keys allows databases to implement data-agnostic sharding mechanisms and efficient storage engines. Meanwhile, hierarchical keys are expressive enough to represent sophisticated relationships. Column-oriented databases can model a problem such as time series efficiently. Column-oriented databases can be sharded either algorithmically or dynamically. With small and numerous small partitions, they haveconstraints similarto key-value stores. Otherwise, dynamic sharding is more suitable. The term column database is losing popularity. Both HBase and Cassandra once marketed themselves as column databases, but not anymore. If I need to categorize these systems today, I would call them hierarchical key-value stores, since this is the most distinctive characteristic between them. Originally published in 2005, Google BigTable popularized column-oriented databases amongst the public. Apache HBase is a BigTable-like database implemented on top of Hadoop ecosystem. Apache Cassandra previously described itself as a column database — entries were stored in column families with row and column keys. CQL3, the latest API for Cassandra, presents a flattened data model — (partition key, column key) is simply a composite primary key. Amazon’s Dynamo popularized highly available databases. Amazon DynamoDB is a platform-as-a-service offering of Dynamo. DynamoDB uses (hash key, range key) as its primary key. Understanding the pitfalls Many caveats are discussed above. However, there are other common issues to watch out for with many strategies. A logical shard (data sharing the same partition key) must fit in a single node. This is the most important assumption, and is the hardest to change in future. A logical shard is an atomic unit of storage and cannot span across multiple nodes. In such a situation, the database cluster is effectively out of space. Having finer partitions mitigates this problem, but it adds complexity to both database and application. The cluster needs to manage additional partitions and the application may issue additional cross-partition operations. Many web applications shard data by user. This may become problematic over time, as the application accumulates power users with a large amount of data. For example, an email service may have users with terabytes of email. To accommodate this, a single user’s data is split into partitions. This migration is usually very challenging as it invalidates many core assumptions on the underlying data model.

Illustration of a hotspot at the end of partition range even after numerous shard splits. Even though dynamic sharding is more resilient to unbalanced data, an unexpected workload can reduce its effectiveness. In a range-partitioned sharding scheme, inserting data in partition key order creates hot spots. Only the last range will receive inserts. This partition range will split as it becomes large. However, out of the split ranges, only the latest range will receive additional writes. The write throughput of a cluster is effectively reduced to a single node. MongoDB, HBase, and Google Datastore discourages this. In the case of dynamic sharding, it is bad to have a large number of locators. Since the locators are frequently accessed, they are normally served directly from RAM. HDFS’s Name Node needs at least 150 bytes of memory per file for its metadata, thus storing a large number of files is prohibitive. Many databases allocate a fixed amount of resources per partition range. HBase recommends about 20~200 regions per server. Concluding Remarks There are many topics closely related to sharding not covered here. Replication is a crucial concept in distributed databases to ensure durability and availability. Replication can be performed agnostic to sharding or tightly coupled to the sharding strategies. The details behind data redistribution are important. As previously mentioned, ensuring both the data and locators are in sync while the data is being moved is a hard problem. Many techniques make a tradeoff between consistency, availability, and performance. For example, HBase’s region splitting is a complex multi-step process. To make it worse, a brief downtime is required during a region split. None of this is magic. Everything follows logically once you consider how the data is stored and retrieved. Cross-partition queries are inefficient, and many sharding schemes attempt to minimize the number of cross-partition operations. On the other hand, partitions need to be granular enough to evenly distribute the load amongst nodes. Finding the right balance can be tricky. Of course, the best solution in software engineering is avoiding the problem altogether. As stated before, there are many successful websites operating without sharding. See if you can defer or avoid the problem altogether. Happy databasing!

iris2008 commented 3 years ago

GIT - How to merge branch from another repo

You can't merge a repository into a branch. You can merge a branch from another repository into a branch in your local repository. Assuming that you have two repositories, foo and bar both located in your current directory:

$ ls
foo bar

Change into the foo repository:

$ cd foo

Add the bar repository as a remote and fetch it:

$ git remote add bar ../bar
$ git remote update

Create a new branch baz in the foo repository based on whatever your current branch is:

$ git switch -c baz

Merge branch somebranch from the bar repository into the current branch:

$ git merge --allow-unrelated-histories bar/somebranch
(--allow-unrelated-histories is not required prior to git version 2.9)
iris2008 commented 3 years ago

Best Practice of AWS Lambda

Function code

Separate the Lambda handler from your core logic. This allows you to make a more unit-testable function. In Node.js this may look like:

exports.myHandler = function(event, context, callback) { var foo = event.foo; var bar = event.bar; var result = MyLambdaFunction (foo, bar);

callback(null, result);

}

function MyLambdaFunction (foo, bar) { // MyLambdaFunction logic here } Take advantage of execution environment reuse to improve the performance of your function. Initialize SDK clients and database connections outside of the function handler, and cache static assets locally in the /tmp directory. Subsequent invocations processed by the same instance of your function can reuse these resources. This saves cost by reducing function run time.

To avoid potential data leaks across invocations, don’t use the execution environment to store user data, events, or other information with security implications. If your function relies on a mutable state that can’t be stored in memory within the handler, consider creating a separate function or separate versions of a function for each user.

Use a keep-alive directive to maintain persistent connections. Lambda purges idle connections over time. Attempting to reuse an idle connection when invoking a function will result in a connection error. To maintain your persistent connection, use the keep-alive directive associated with your runtime. For an example, see Reusing Connections with Keep-Alive in Node.js.

Use environment variables to pass operational parameters to your function. For example, if you are writing to an Amazon S3 bucket, instead of hard-coding the bucket name you are writing to, configure the bucket name as an environment variable.

Control the dependencies in your function's deployment package. The AWS Lambda execution environment contains a number of libraries such as the AWS SDK for the Node.js and Python runtimes (a full list can be found here: Lambda runtimes). To enable the latest set of features and security updates, Lambda will periodically update these libraries. These updates may introduce subtle changes to the behavior of your Lambda function. To have full control of the dependencies your function uses, package all of your dependencies with your deployment package.

Minimize your deployment package size to its runtime necessities. This will reduce the amount of time that it takes for your deployment package to be downloaded and unpacked ahead of invocation. For functions authored in Java or .NET Core, avoid uploading the entire AWS SDK library as part of your deployment package. Instead, selectively depend on the modules which pick up components of the SDK you need (e.g. DynamoDB, Amazon S3 SDK modules and Lambda core libraries).

Reduce the time it takes Lambda to unpack deployment packages authored in Java by putting your dependency .jar files in a separate /lib directory. This is faster than putting all your function’s code in a single jar with a large number of .class files. See Deploy Java Lambda functions with .zip or JAR file archives for instructions.

Minimize the complexity of your dependencies. Prefer simpler frameworks that load quickly on execution environment startup. For example, prefer simpler Java dependency injection (IoC) frameworks like Dagger or Guice, over more complex ones like Spring Framework.

Avoid using recursive code in your Lambda function, wherein the function automatically calls itself until some arbitrary criteria is met. This could lead to unintended volume of function invocations and escalated costs. If you do accidentally do so, set the function reserved concurrency to 0 immediately to throttle all invocations to the function, while you update the code.

Function configuration Performance testing your Lambda function is a crucial part in ensuring you pick the optimum memory size configuration. Any increase in memory size triggers an equivalent increase in CPU available to your function. The memory usage for your function is determined per-invoke and can be viewed in Amazon CloudWatch. On each invoke a REPORT: entry will be made, as shown below:

REPORT RequestId: 3604209a-e9a3-11e6-939a-754dd98c7be3 Duration: 12.34 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 18 MB By analyzing the Max Memory Used: field, you can determine if your function needs more memory or if you over-provisioned your function's memory size.

To find the right memory configuration for your functions, we recommend using the open source AWS Lambda Power Tuning project. For more information, see AWS Lambda Power Tuning on GitHub.

To optimize function performance, we also recommend deploying libraries that can leverage Advanced Vector Extensions 2 (AVX2). This allows you to process demanding workloads, including machine learning inferencing, media processing, high performance computing (HPC), scientific simulations, and financial modeling. For more information, see Creating faster AWS Lambda functions with AVX2.

Load test your Lambda function to determine an optimum timeout value. It is important to analyze how long your function runs so that you can better determine any problems with a dependency service that may increase the concurrency of the function beyond what you expect. This is especially important when your Lambda function makes network calls to resources that may not handle Lambda's scaling.

Use most-restrictive permissions when setting IAM policies. Understand the resources and operations your Lambda function needs, and limit the execution role to these permissions. For more information, see AWS Lambda permissions.

Be familiar with Lambda quotas. Payload size, file descriptors and /tmp space are often overlooked when determining runtime resource limits.

Delete Lambda functions that you are no longer using. By doing so, the unused functions won't needlessly count against your deployment package size limit.

If you are using Amazon Simple Queue Service as an event source, make sure the value of the function's expected invocation time does not exceed the Visibility Timeout value on the queue. This applies both to CreateFunction and UpdateFunctionConfiguration.

In the case of CreateFunction, AWS Lambda will fail the function creation process.

In the case of UpdateFunctionConfiguration, it could result in duplicate invocations of the function.

Metrics and alarms Use Working with AWS Lambda function metrics and CloudWatch Alarms instead of creating or updating a metric from within your Lambda function code. It's a much more efficient way to track the health of your Lambda functions, allowing you to catch issues early in the development process. For instance, you can configure an alarm based on the expected duration of your Lambda function invocation in order to address any bottlenecks or latencies attributable to your function code.

Leverage your logging library and AWS Lambda Metrics and Dimensions to catch app errors (e.g. ERR, ERROR, WARNING, etc.)

Working with streams Test with different batch and record sizes so that the polling frequency of each event source is tuned to how quickly your function is able to complete its task. BatchSize controls the maximum number of records that can be sent to your function with each invoke. A larger batch size can often more efficiently absorb the invoke overhead across a larger set of records, increasing your throughput.

By default, Lambda invokes your function as soon as records are available in the stream. If the batch that Lambda reads from the stream only has one record in it, Lambda sends only one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to five minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires.

Increase Kinesis stream processing throughput by adding shards. A Kinesis stream is composed of one or more shards. Lambda will poll each shard with at most one concurrent invocation. For example, if your stream has 100 active shards, there will be at most 100 Lambda function invocations running concurrently. Increasing the number of shards will directly increase the number of maximum concurrent Lambda function invocations and can increase your Kinesis stream processing throughput. If you are increasing the number of shards in a Kinesis stream, make sure you have picked a good partition key (see Partition Keys) for your data, so that related records end up on the same shards and your data is well distributed.

Use Amazon CloudWatch on IteratorAge to determine if your Kinesis stream is being processed. For example, configure a CloudWatch alarm with a maximum setting to 30000 (30 seconds).

iris2008 commented 3 years ago

YAML Syntax

You are reading the latest community version of the Ansible documentation. Red Hat subscribers, select 2.9 in the version selection to the left for the most recent Red Hat release.

YAML Syntax

This page provides a basic overview of correct YAML syntax, which is how Ansible playbooks (our configuration management language) are expressed.

We use YAML because it is easier for humans to read and write than other common data formats like XML or JSON. Further, there are libraries available in most programming languages for working with YAML.

You may also wish to read Working with playbooks at the same time to see how this is used in practice.

YAML Basics For Ansible, nearly every YAML file starts with a list. Each item in the list is a list of key/value pairs, commonly called a “hash” or a “dictionary”. So, we need to know how to write lists and dictionaries in YAML.

There’s another small quirk to YAML. All YAML files (regardless of their association with Ansible or not) can optionally begin with --- and end with .... This is part of the YAML format and indicates the start and end of a document.

All members of a list are lines beginning at the same indentation level starting with a "- " (a dash and a space):


A list of tasty fruits

An employee record

martin: name: Martin D'vloper job: Developer skill: Elite More complicated data structures are possible, such as lists of dictionaries, dictionaries whose values are lists or a mix of both:

Employee records


martin: {name: Martin D'vloper, job: Developer, skill: Elite} ['Apple', 'Orange', 'Strawberry', 'Mango'] These are called “Flow collections”.

Ansible doesn’t really use these too much, but you can also specify a boolean value (true/false) in several forms:

create_key: yes needs_agent: no knows_oop: True likes_emacs: TRUE uses_cvs: false Use lowercase ‘true’ or ‘false’ for boolean values in dictionaries if you want to be compatible with default yamllint options.

Values can span multiple lines using | or >. Spanning multiple lines using a “Literal Block Scalar” | will include the newlines and any trailing spaces. Using a “Folded Block Scalar” > will fold newlines to spaces; it’s used to make what would otherwise be a very long line easier to read and edit. In either case the indentation will be ignored. Examples are:

include_newlines: | exactly as you see will appear these three lines of poetry

fold_newlines: > this is really a single line of text despite appearances While in the above > example all newlines are folded into spaces, there are two ways to enforce a newline to be kept:

fold_some_newlines: > a b

c
d
  e
f

same_as: "a b\nc d\n e\nf\n" Let’s combine what we learned so far in an arbitrary YAML example. This really has nothing to do with Ansible, but will give you a feel for the format:


An employee record

name: Martin D'vloper job: Developer skill: Elite employed: True foods:

Gotchas While you can put just about anything into an unquoted scalar, there are some exceptions. A colon followed by a space (or newline) ": " is an indicator for a mapping. A space followed by the pound sign " #" starts a comment.

Because of this, the following is going to result in a YAML syntax error:

foo: somebody said I should put a colon here: so I did

windows_drive: c: …but this will work:

windows_path: c:\windows You will want to quote hash values using colons followed by a space or the end of the line:

foo: 'somebody said I should put a colon here: so I did'

windows_drive: 'c:' …and then the colon will be preserved.

Alternatively, you can use double quotes:

foo: "somebody said I should put a colon here: so I did"

windows_drive: "c:" The difference between single quotes and double quotes is that in double quotes you can use escapes:

foo: "a \t TAB and a \n NEWLINE" The list of allowed escapes can be found in the YAML Specification under “Escape Sequences” (YAML 1.1) or “Escape Characters” (YAML 1.2).

The following is invalid YAML:

foo: "an escaped \' single quote" Further, Ansible uses “{{ var }}” for variables. If a value after a colon starts with a “{“, YAML will think it is a dictionary, so you must quote it, like so:

foo: "{{ variable }}" If your value starts with a quote the entire value must be quoted, not just part of it. Here are some additional examples of how to properly quote things:

foo: "{{ variable }}/additional/string/literal" foo2: "{{ variable }}\backslashes\are\also\special\characters" foo3: "even if it's just a string literal it must all be quoted" Not valid:

foo: "E:\path\"rest\of\path In addition to ' and " there are a number of characters that are special (or reserved) and cannot be used as the first character of an unquoted scalar: [] {} > | * & ! % # ` @ ,.

You should also be aware of ? : -. In YAML, they are allowed at the beginning of a string if a non-space character follows, but YAML processor implementations differ, so it’s better to use quotes.

In Flow Collections, the rules are a bit more strict:

a scalar in block mapping: this } is [ all , valid

flow mapping: { key: "you { should [ use , quotes here" } Boolean conversion is helpful, but this can be a problem when you want a literal yes or other boolean values as a string. In these cases just use quotes:

non_boolean: "yes" other_string: "False" YAML converts certain strings into floating-point values, such as the string 1.0. If you need to specify a version number (in a requirements.yml file, for example), you will need to quote the value if it looks like a floating-point value:

version: "1.0"

iris2008 commented 3 years ago

How to manage code in git

All production code merged into develop branch and master branch now.

To make our developers life easier, I suggest all developer working on framework changes follow below guideline going forward:

  1. Please use master branch to trace production code only. After each production release, developer who works on production implementation should create a PR from develop branch to master branch and merge it.

  2. Please use develop branch to trace next release code only. Developers should create a feature branch from develop branch for any changes that will be in next release. Deploy the change to any DIT stage and test it first (sandbox/sandboxftr1/sandboxftr2/sandboxftr3). After test on DIT stage, create a PR from feature branch to develop branch if this change will be in next release. Then deploy develop branch to any SIT stages ( DEV/SIT2/SIT3) and QA test it.

  3. Please use release branch to trace future release code. If changes will not in next release, but in future release, please create a release branch from develop branch first, then create feature branch from release branch and working on feature branch. Deploy the change to any DIT stage and test it first (sandbox/sandboxftr1/sandboxftr2/sandboxftr3). After test on DIT stage, create a PR from feature branch to release branch and deploy release branch to any SIT stages (DEV/SIT2/SIT3) and QA test it. When this release becomes next release, create a PR from release branch to develop branch and merge it.

  4. Please use stage branches (sandbox/sandboxftr1/sandboxftr2/sandboxftr3/dev/sit2/sit3/perf/perf2) for deployment only. Never use these branches to trace code changes. These branches can be deleted and be recreated anytime.

  5. Developer should delete their feature branches after the changes was in production and merged into master branch.

Please let me know if you have any concerns

iris2008 commented 2 years ago

How to revert in GIT

This depends a lot on what you mean by "revert".

Temporarily switch to a different commit If you want to temporarily go back to it, fool around, then come back to where you are, all you have to do is check out the desired commit:

This will detach your HEAD, that is, leave you with no branch checked out:

git checkout 0d1d7fc32 Or if you want to make commits while you're there, go ahead and make a new branch while you're at it:

git checkout -b old-state 0d1d7fc32 To go back to where you were, just check out the branch you were on again. (If you've made changes, as always when switching branches, you'll have to deal with them as appropriate. You could reset to throw them away; you could stash, checkout, stash pop to take them with you; you could commit them to a branch there if you want a branch there.)

Hard delete unpublished commits If, on the other hand, you want to really get rid of everything you've done since then, there are two possibilities. One, if you haven't published any of these commits, simply reset:

This will destroy any local modifications.

Don't do it if you have uncommitted work you want to keep.

git reset --hard 0d1d7fc32

Alternatively, if there's work to keep:

git stash git reset --hard 0d1d7fc32 git stash pop

This saves the modifications, then reapplies that patch after resetting.

You could get merge conflicts, if you've modified things which were

changed since the commit you reset to.

If you mess up, you've already thrown away your local changes, but you can at least get back to where you were before by resetting again.

Undo published commits with new commits On the other hand, if you've published the work, you probably don't want to reset the branch, since that's effectively rewriting history. In that case, you could indeed revert the commits. With Git, revert has a very specific meaning: create a commit with the reverse patch to cancel it out. This way you don't rewrite any history.

This will create three separate revert commits:

git revert a867b4af 25eee4ca 0766c053

It also takes ranges. This will revert the last two commits:

git revert HEAD~2..HEAD

Similarly, you can revert a range of commits using commit hashes (non inclusive of first hash):

git revert 0d1d7fc..a867b4a

Reverting a merge commit

git revert -m 1

To get just one, you could use rebase -i to squash them afterwards

Or, you could do it manually (be sure to do this at top level of the repo)

get your index and work tree into the desired state, without changing HEAD:

git checkout 0d1d7fc32 .

Then commit. Be sure and write a good message describing what you just did

git commit The git-revert manpage actually covers a lot of this in its description. Another useful link is this git-scm.com section discussing git-revert.

If you decide you didn't want to revert after all, you can revert the revert (as described here) or reset back to before the revert (see the previous section).

You may also find this answer helpful in this case: How can I move HEAD back to a previous location? (Detached head) & Undo commits

iris2008 commented 2 years ago

AWS Template Mapping

Remember, you don't want to use !Ref ServiceName in Mapping part. Otherwise, you see the error like Mapping attribute have to be String or list

iris2008 commented 2 years ago

1!RyGI76kUlB1OYgXPU8VzXBviRHClU2fycb0Sko_3pw4OD2k8dp46L9bopwQFpS8ezL1OwZCW1Ig5WNkBNDryhM63hfyZk45M0QmB2FfeNfLOG7zpUVA4vLjKYaqfs2NzTZ8Dszzf_x6Izj4mVBdbt9N5YNZD9PnRuS9Eurblkk1cYJYxY5VaczLko5tH3USgMd7eWbe5hjbeqZpx1gmZxzEPwtSJ4sASn2BoMKXpMEA088UmHlMJrIq6xPM_JbIUZGiZ6qJn04HTtkfOpWp_ODJRSZFjxYb6xHfoRl4H5vErawBuJX2vIrB-wqzyUiaLzIR-8LaH20ZN8gmr1c7_z-cka52mrWvfm-VhalVL7E9huYWi0bU2KVboa_HSsoOGkfJ1UOTLJafKmFfbvlXq_lgkbKoUzES_DaWCt0z1O4hpkK9mCE3Zs6rZgW2k8ykEYrIKf2J-BY-l-1FdgHGxdv-l94ZeEWwWeUO8ABN7mNuU3nvX7-xBorHIFIA1s7oIOwjvbNoA

iris2008 commented 2 years ago

1!o8zPElecEFaTuHLZl8AlhnUdPD2PBxcjnyPH7rOc21Hd7AfnYgqpr2QwFruDTlzOJkRbYozHTQDgyuFzJVRRLDoeScxUSHfmgBBm_Fsi8jfnUEySEFZqrIzMRYClH1x64IJN9gXqOBrDnW1xTU4blYPxPkDKOWpc8MfOsg2B_prwQVYlsfLuTey2m_qF1I11oOzy130ocxq25rcdjVxFk0QUUp-b86pRR8CnwNQCIjJaBZXlYbnncu0GvN2t19I65tOrC0FBkqBB3LTziUtZScs1Aou-IKZVmnkbRaSxaxjKM2J7ORTfwKOd_3LXScJ5n-brwFOIDD4XcA70pceW19vJAXmJoQGyAPA-_ZaMq57xUWDm1sPW6akDscEeKvHvo5d5yTJ0neu8drMrquj2NCnazy02YB4UFgOJegH1lvOIs587eFe_bQnh-OXc29bRthhFQuwvVzFX81M_A0TvLl_MBi_3I-LXUC2Bipq4bpw3UucmrVrhevzytpQs2cF9ad6rx6oF

iris2008 commented 2 years ago

AWS CLI - How to see step function execution history/event

Get state machine arn

aws stepfunctions list-state-machines

Get execution arn

aws stepfunctions list-executions --state-machine-arn arn:aws:states:ca-central-1:33301:stateMachine:SCCG-GetDocumentFromBox-sit2-SF

list event

aws stepfunctions get-execution-history --execution-arn arn:aws:states:ca-central-1:xxxxx:execution:xxxxxxxx-SF:5e92c132-5ec4-41ff-9830-4c5481bb6d8c

iris2008 commented 2 years ago

How to debug in VSCode for npm test

Modify launch.json as below

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "type": "node",
            "request": "launch",
            "name": "Launch Program",
            "skipFiles": [
                "<node_internals>/**"
            ],
            "program": "${workspaceFolder}\\node_modules\\mocha\\bin\\mocha"
        }
    ]
}

Add break in .test.js

Add break in real code to be tested

Click Run->Start Debugging in VS Code (instead of run npm test)

iris2008 commented 2 years ago

AWS Usage Plan

Don't use API keys for authentication or authorization for your APIs. If you have multiple APIs in a usage plan, a user with a valid API key for one API in that usage plan can access all APIs in that usage plan. Instead, use an IAM role, a Lambda authorizer, or an Amazon Cognito user pool.

iris2008 commented 1 year ago

Invoke an AWS.Lambda function from another lambda function

We can use the AWS SDK to invoke another lambda function and get the execution response.

When we have multiple lambda functions which are dependent on one another then we may need the execution output of another lambda function in the current lambda function inorder to process it.

In such cases, we need to call/invoke the dependent lambda function from the current lambda function.

Let’s say we have two lambda functions

process PDF contents - process_pdf_invoice update PDF contents in the database - save_invoice_info We will be using nodejs to implement this scenario.

Function - process_pdf_invoice

Let’s write the code for process_pdf_invoice lambda function.

process_pdf_invoice.js

exports.handler = function(event, context) {
  // Let's return some dummy data
  const invoice = {
    "DueDate": "2013-02-15",
    "Balance": 1990.19,
    "DocNumber": "SAMP001",
    "Status": "Payable",
    "Line": [
      {
        "Description": "Sample Expense",
        "Amount": 500,
        "DetailType": "ExpenseDetail",
        "Customer": "ABC123 (Sample Customer)",
        "Ref": "DEF234 (Sample Construction)",
        "Account": "EFG345 (Fuel)",
        "LineStatus": "Billable"
      }
    ],
    "TotalAmt": 1990.19
  }
  return invoice

}

Function - save_invoice_info

Let’s write the code for lambda function save_invoice_info To invoke the lambda function we need AWS SDK (i.e aws-sdk). By default it will be available in AWS lambda if not we need to install it as a layer.

save_invoice_info.js

const AWS = require('aws-sdk');
const lambda = new AWS.Lambda();

async function getInvoiceInfo(){
  // params to send to lambda
  const params = {
    FunctionName: 'process_pdf_invoice',
    InvocationType: 'RequestResponse',
    LogType: 'None',
    Payload: '{}',
  };
  const response = await lambda.invoke(params).promise();
  if(response.StatusCode !== 200){
    throw new Error('Failed to get response from lambda function')
  }
  return JSON.parse(response.Payload);
}

exports.handler = async function(event, context) {
  // invoke and get info from `process_pdf_invoice`
  const invoice = await getInvoiceInfo();
  console.log('invoice', invoice);
  // now write the code to save data into database
  return {'status': 'saved'}
}

The code lambda.invoke(params).promise() will invoke lambda function and returns the response. If the invokation is success then it will return 200 otherwise it will return 5XX code. response.Payload will give the response returned by the lambda function.

References

https://docs.aws.amazon.com/lambda/latest/dg/lambda-nodejs.html https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Lambda.html#invoke-property

iris2008 commented 1 year ago

npm test issue

nyc is not recognozed as internal or external command

If npm install -g nyc doesn't resolve the issue, you may want to try npm install --save-dev nyc , which may get the job done

iris2008 commented 1 year ago

AWS FIFO Queue

The first-in-first-out (FIFO) queue is the type of AWS SQS queue that guarantees order and provides exactly once delivery of messages. That sounds great, but there are some other important features to understand to avoid unexpected queue behaviour.

1) If a message fails to be processed, it may block other messages

When you send a message to a FIFO queue a message group id must be provided. This is a way to group messages, so that messages within that group are always received in order.

Message 1 is at the front of the queue. If it’s received by a consumer, but for whatever reasons fails to process and isn’t deleted, then no other messages with the same message group id can be received.

FIFO queue tip 1

When sending messages to the queue, choose the message group id carefully.

Messages that have the same messages group id will be returned in order. While this is intended behaviour for a FIFO queue, remember that only once a message has been removed from the queue will the next message with the same message group id be returned.

Messages can be removed from the queue in these ways:

Deleted using the SQS delete message API Deleted automatically once the message retention period has expired Moved automatically to a dead-letter queue after the configured maximum receives

FIFO queue tip 2

When creating your FIFO queue, configure the visibility timeout based on the time it takes to process each queue message.

If you have long-running queue message processing, configure the visibility timeout to be greater than the maximum duration of this processing. The maximum value you can choose is 12 hours.

If 12 hours isn’t enough, consider creating a dead-letter queue and setting the maximum receives of your queue to 1. That way, your queue message will be processed at most once.

2) If you don’t set the visibility timeout correctly, your message may be re-processed

In fact 1 we saw that when we do multiple receive messages calls on an SQS FIFO queue only the first one returns a result. That was the case because all the messages had the same message group id, and SQS was maintaining message order.

What if we want to be able to receive the same message again to retry processing which may have failed? That’s where the visibility timeout comes in. It configures how long after a message is received by one consumer will it be able to be received again by another. The default visibility timeout is 30 seconds. an initial receive message request returns the message another receive message request within the visibility timeout returns no messages after waiting for the visibility timeout to expire, another receive message request returns the message again

Long-running queue message processing

Now imagine a scenario where the processing of a queue message takes a long time, even hours. For example, you could be developing a system for an online media site, where each queue message is a video that needs transcoding into many different formats.

In this case, if you leave the visibility timeout as the default, then a new consumer will start processing your queue message every 30 seconds. That could use a lot of compute resources unnecessarily, and have other undesirable effects.

3) You can have a maximum of 20,000 inflight messages

A FIFO SQS queue has an important limit compared to standard SQS queues. The number of inflight messages is limited to 20,000.

A message is considered to be in flight after it is received from a queue by a consumer, but not yet deleted from the queue

AWS’s definition of an in flight message That means that if you’ve got consumers that are currently processing 20,000 messages, then the next receive message request you make won’t return anything. And that’s the case even if you have messages with a different message group id to those already in flight.

The other less obvious implication of this limit is that even if you have only one message inflight, any other messages with the same message group id count towards the infight limit.

FIFO queue tip 3

A FIFO queue has a maximum inflight message limit of 20,000. This could cause an issue if:

you have many consumers – eventually your consumers won’t be able to receive any more messages you have many messages with the same message group id – in this case you may be blocked from receiving messages with a different message group id Consider the implications of the inflight message limit when designing your FIFO queue. You might want to:

keep the number of messages with the same message group id low implement a dead-letter queue so that messages that fail processing are quickly moved out of the main queue

Final thoughts

You’ll understand now that there’s a bit more to think about with the SQS FIFO queue than first appears. Bearing the above in mind, you should be able to design a solution to meet your requirements.

SQS simulator To try out some of the concepts introduced in this article right away, try this SQS simulator I’ve made available for you. You’ll see how a FIFO queue behaves when sending, receiving, and deleting messages.

iris2008 commented 1 year ago

AWS SQS Issue

When you see the SQS cannot trigger a lambda, you can try to remove the below, deploy and then add it back , deploy

  eventSourceMappingConfig: [{
    eventSourceArn: props.DocQueueARN.ssmparameter.value, 
    batchSize: 1,
    enabled: true,
  }],
iris2008 commented 1 year ago

Example code for RDS instance in AWS generated by ChatGPT

parameter_group_name = "mydb-param-group"
parameter_group_family = "postgres9.6"

description = "Custom parameter group for mydb"

parameters = {
    "max_connections": "100",
    "shared_buffers": "32MB",
    "effective_cache_size": "256MB",
    "maintenance_work_mem": "64MB",
    "checkpoint_completion_target": "0.7",
    "wal_buffers": "16MB",
    "default_statistics_target": "100",
    "random_page_cost": "4",
    "effective_io_concurrency": "200",
    "work_mem": "4MB",
    "min_wal_size": "1GB",
    "max_wal_size": "2GB",
    "max_worker_processes": "8",
    "max_parallel_workers_per_gather": "2",
    "max_parallel_workers": "4"
}

const { RdsCluster, RdsClusterInstance } = require('@aws-cdk/aws-rds');

const myCluster = new RdsCluster(this, 'MyCluster', {
    engine: RdsCluster.Engine.POSTGRES,
    masterUser: {
        username: 'admin'
    },
    instances: 1,
    instanceProps: {
        instanceType: ec2.InstanceType.of(ec2.InstanceClass.BURSTABLE2, ec2.InstanceSize.SMALL),
        vpc: myVpc,
        vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC },
    },
});

const myInstance = new RdsClusterInstance(this, 'MyInstance', {
    cluster: myCluster,
});

In this example, we are setting several parameters to customize the behavior of the PostgreSQL instance, such as:

max_connections: The maximum number of client connections allowed.
shared_buffers: The amount of memory used for shared memory buffers.
effective_cache_size: The amount of memory the system considers to be available for disk caching.
maintenance_work_mem: The amount of memory used for maintenance operations.
checkpoint_completion_target: The fraction of the maximum time between automatic WAL file checkpoints.
wal_buffers: The amount of memory used for WAL buffers.
default_statistics_target: The default target for statistics collection.
random_page_cost: The cost of a nonsequentially fetched disk page.
effective_io_concurrency: The number of concurrent disk I/O operations.
work_mem: The amount of memory used for sort and hash operations.
min_wal_size: The minimum size to shrink the WAL to.
max_wal_size: The maximum size to allow the WAL to grow to.
max_worker_processes: The maximum number of worker processes.
max_parallel_workers_per_gather: The maximum number of parallel workers per gather.
max_parallel_workers: The maximum number of parallel workers that can be active.
It's worth noting that, the values in this example are just examples and may not be appropriate for all use cases. You should consult the PostgreSQL documentation and tune these parameters to suit your specific requirements.
iris2008 commented 1 year ago

Example of two RDS instance share the same subnet

Sure, here's an example of how you might launch multiple RDS instances of PostgreSQL in the same subnet using the AWS CLI:

First, you would need to create a subnet group that contains the subnet you want to launch the instances in. You can use the create-db-subnet-group command to create a subnet group:

Copy code aws rds create-db-subnet-group --db-subnet-group-name mydb-subnet-group --db-subnet-group-description "My DB Subnet Group" --subnet-ids subnet-01234567890abcdef0 subnet-01234567890abcdef1 This command creates a subnet group named "mydb-subnet-group" that contains two subnets, "subnet-01234567890abcdef0" and "subnet-01234567890abcdef1".

Then, you can use the create-db-instance command to launch multiple RDS instances in the same subnet group:

Copy code aws rds create-db-instance --db-name mydb --db-instance-identifier mydb1 --db-instance-class db.t2.micro --engine postgres --master-username admin --master-user-password mypassword --allocated-storage 20 --db-subnet-group-name mydb-subnet-group

aws rds create-db-instance --db-name mydb --db-instance-identifier mydb2 --db-instance-class db.t2.micro --engine postgres --master-username admin --master-user-password mypassword --allocated-storage 20 --db-subnet-group-name mydb-subnet-group This command creates two RDS instances named "mydb1" and "mydb2" of type "postgres" with an instance class "db.t2.micro" and storage of 20GB each, in the "mydb-subnet-group" subnet group.

You can also use the AWS Management Console or the RDS API to launch multiple RDS instances in the same subnet, the process is quite similar.

It's worth noting that, for high availability and failover, it is recommended to spread the RDS instances across different subnets within the same VPC. This way, if one subnet becomes unavailable, the RDS instances in that subnet will be automatically failed over to another subnet within the same VPC.

It's also important to ensure that the subnet that you are launching the RDS instances in, have the correct routing, security group, and other necessary configurations to allow the RDS instances to connect to the Internet and other Amazon services.

iris2008 commented 1 year ago

AWS API and Lambda integration issue - 500: API Configuration Error

If you have a wrong lambda name in "x-amazon-apigateway-integration/uri" , you will see the error :

Wed Apr 19 17:52:47 UTC 2023 : Execution failed due to configuration error: Invalid permissions on Lambda function Wed Apr 19 17:52:47 UTC 2023 : Gateway response type: API_CONFIGURATION_ERROR with status code: 500

Pay attention to the upper case/lower case, that causes issues always

iris2008 commented 1 year ago

Difference between npm install and npm ci

In short, the main differences between using npm install and npm ci are:

The project must have an existing package-lock.json or npm-shrinkwrap.json. If dependencies in the package lock do not match those in package.json, npm ci will exit with an error, instead of updating the package lock. npm ci can only install entire projects at a time: individual dependencies cannot be added with this command. If a node_modules is already present, it will be automatically removed before npm ci begins its install. It will never write to package.json or any of the package-locks: installs are essentially frozen. Essentially, npm install reads package.json to create a list of dependencies and uses package-lock.json to inform which versions of these dependencies to install. If a dependency is not in package-lock.json it will be added by npm install.

npm ci (also known as Clean Install) is meant to be used in automated environments — such as test platforms, continuous integration, and deployment — or, any situation where you want to make sure you're doing a clean install of your dependencies.

It installs dependencies directly from package-lock.json and uses package.json only to validate that there are no mismatched versions. If any dependencies are missing or have incompatible versions, it will throw an error.

Use npm install to add new dependencies, and to update dependencies on a project. Usually, you would use it during development after pulling changes that update the list of dependencies but it may be a good idea to use npm ci in this case.

Use npm ci if you need a deterministic, repeatable build. For example during continuous integration, automated jobs, etc. and when installing dependencies for the first time, instead of npm install.

npm install Installs a package and all its dependencies. Dependencies are driven by npm-shrinkwrap.json and package-lock.json (in that order). without arguments: installs dependencies of a local module. Can install global packages. Will install any missing dependencies in node_modules. It may write to package.json or package-lock.json. When used with an argument (npm i packagename) it may write to package.json to add or update the dependency. when used without arguments, (npm i) it may write to package-lock.json to lock down the version of some dependencies if they are not already in this file. npm ci Requires at least npm v5.7.1. Requires package-lock.json or npm-shrinkwrap.json to be present. Throws an error if dependencies from these two files don't match package.json. Removes node_modules and install all dependencies at once. It never writes to package.json or package-lock.json. Algorithm While npm ci generates the entire dependency tree from package-lock.json or npm-shrinkwrap.json, npm install updates the contents of node_modules using the following algorithm (source):

load the existing node_modules tree from disk clone the tree fetch the package.json and assorted metadata and add it to the clone walk the clone and add any missing dependencies dependencies will be added as close to the top as is possible without breaking any other modules compare the original tree with the cloned tree and make a list of actions to take to convert one to the other execute all of the actions, deepest first kinds of actions are install, update, remove and move

iris2008 commented 1 year ago

How to fix npm vulnerability issue found in package-lock.json file

Final solution:

  1. Run npm update –save to update packages to latest version

  2. Use npm ls or “https://compulim.github.io/lock-walker/” to find the nested dependency, say “json-schema”

  3. Run npm install json-schema –save to update package.json and package-lock.json

  4. Update package.json and add below for “json-schema”

    "overrides": { "json-schema": "$json-schema" }

  5. Repeat 2 - 4 for the next vulnerability

iris2008 commented 1 year ago

How JWT token validate

RS256 (Asymmetric Key encryption or Public Key encryption) involves two keys, a public key, and a private key. The private key is used to generate the signature whereas the public key is used to validate the signature. In this case the private key is only in possession of the authentication server who has generated the JWT token and we no longer need to distribute the private key. On the resource server we can validate the token by using the public key. Both keys are non-interchangeable, one can only be used to generate and other can only be used for validation.

JSON Web Key Set (JWKS) One question arises that how we can get the public key. The JSON Web Key Set (JWKS) is a set of keys that contains the public keys used to verify any JSON Web Token (JWT) issued by the authorization. Most authorization servers expose a discovery endpoint, like https://YOUR_DOMAIN/.well-known/openid-configuration. You can use this endpoint to configure your application or API to automatically locate the JSON Web key set endpoint (jwks_uri), which contains the public key used to sign the JWT

iris2008 commented 1 year ago

How CSR works in CA

The end product (the signed certificate by CA): Does it contain server's private key or public key?

The certificate is a public document. It therefore can only contain the public key. If it contained the private key, then that key wouldn't be private any more.

While initiating a CSR request, why a server needs to sign a CSR by its private key? Is it correct?

Yes, it is generally correct. This concept is called Proof of Possession (PoPo) and it used to prove to the CA that you (or the server in this case) have the private key corresponding to the public key which will be signed by the CA (or at least had it at the time just before the CA signed your certificate). If the CA didn't insist on PoPo then you could repudiate any signed future message as follows:

You have your public key signed by the CA to create your certificate. At the time, you sign your request with your private key as you should. Everything is good. I come along and copy your public key from your certificate. I now present that to the CA as a CSR but without PoPo. The CA signs it and sends me a certificate, which now contains my name and your public key. At some point, you send a digitally signed (with your private key) message to a third party, say your bank, asking them to donate $1000 to Stack Overflow. You later decide that the $1000 would be better spent on a vacation, so you dispute the signed message to your bank. The bank says But you digitally signed the message to authenticate it!! As you know the CA signs certificates without PoPo, you simply have to say that I must have sent the message instead, using your private key which I've now destroyed in an attempt to hide the evidence. The bank cannot prove that (6) isn't true as they didn't check I had possession of the private key corresponding to the public key in my request, and therefore your statement of it wasn't me cannot be rejected - the bank has to reimburse you. If the bank insisted on PoPo when I submitted your public key to the CA, my request would have failed and you could not repudiate your message later. But once a CA signs a request without PoPo - all bets are off for non-repudiation.

Eventually, does CA generate a certificate from CSR and how it derives the public key of the server from CSR?

There is no derivation to do - your server's public key is in the request in a construct called a CertificateRequestInfo.

This CertificateRequestInfo contains your (or server's) name and the public key. It can also contain other elements such as requested extensions. The CA takes whatever information it requires from this CertificateRequestInfo (only the public key is mandatory) and uses the info to generate a construct called a tbsCertificate (the 'tbs' stands for To Be Signed). This construct contains your name, your public key and whatever extensions the CA deems fit. It then signs this tbsCertificate to create your certificate.

iris2008 commented 1 year ago

AWS Lambda subnet configuration

Connecting Lambda functions to your VPC

A Lambda function always runs inside a VPC owned by the Lambda service. By default, a Lambda function isn't connected to VPCs in your account. When you connect a function to a VPC in your account, the function can't access the internet unless your VPC provides access.

Lambda accesses resources in your VPC using a Hyperplane ENI. Hyperplane ENIs provide NAT capabilities from the Lambda VPC to your account VPC using VPC-to-VPC NAT (V2N). V2N provides connectivity from the Lambda VPC to your account VPC, but not in the other direction.

When you create a Lambda function (or update its VPC settings), Lambda allocates a Hyperplane ENI for each subnet in your function's VPC configuration. Multiple Lambda functions can share a network interface, if the functions share the same subnet and security group.

To connect to another AWS service, you can use VPC endpoints for private communications between your VPC and supported AWS services. An alternative approach is to use a NAT gateway to route outbound traffic to another AWS service.

To give your function access to the internet, route outbound traffic to a NAT gateway in a public subnet. The NAT gateway has a public IP address and can connect to the internet through the VPC's internet gateway.

For information about how to configure Lambda VPC networking, see Connecting outbound networking to resources in a VPC and Connecting inbound interface VPC endpoints for Lambda.

Shared subnets

VPC sharing allows multiple AWS accounts to create their application resources, such as Amazon EC2 instances and Lambda functions, in shared, centrally-managed virtual private clouds (VPCs). In this model, the account that owns the VPC (owner) shares one or more subnets with other accounts (participants) that belong to the same AWS Organization.

To access private resources, connect your function to a private shared subnet in your VPC. The subnet owner must share a subnet with you before you can connect a function to it. The subnet owner can also unshare the subnet a later time, thereby removing connectivity. For details on how to share, unshare, and manage VPC resources in shared subnets, see How to share your VPC with other accounts in the Amazon VPC guide.

AWS::Lambda::Function VpcConfig

The VPC security groups and subnets that are attached to a Lambda function. When you connect a function to a VPC, Lambda creates an elastic network interface for each combination of security group and subnet in the function's VPC configuration. The function can only access resources and the internet through that VPC. For more information, see VPC Settings.

iris2008 commented 10 months ago

Error 407 related to Proxy authentication

I just don’t understand why the Company Root cert and Proxy Root cert were already provided in the request below but we still got “407” error.

The root CA trust will eliminate the possible SSL error. 407 is something with the AD account credential that you may use to run the service or code. If you cannot do that, bypass proxy authentication is the way to go.