sassoftware / viya4-iac-aws

This project contains Terraform configuration files to provision infrastructure components required to deploy SAS Viya platform products products on Amazon AWS.
Apache License 2.0
37 stars 43 forks source link

Issues with external PostgreSQL on AWS RDS #270

Open miaeyg opened 7 months ago

miaeyg commented 7 months ago

I have an issue and a question about using an external PostgreSQL on AWS (RDS) instead of the SAS internal default postgreSQL (Crunchy):

  1. I think there is a error in the description of the "storage_size" parameter described here https://github.com/sassoftware/viya4-iac-aws/blob/main/docs/CONFIG-VARS.md#postgresql-server The descriptions says: "Max storage allowed for the PostgreSQL server in MB" but looking at https://registry.terraform.io/modules/terraform-aws-modules/rds/aws/latest and the parameter "allocated_storage" (which uses the "storage_size" input var) is measured in GB, not MB: "The allocated storage in gigabytes". In addition the parameter's default value of 50 seems low compared to the Crunchy default value which is 128GB.

  2. I installed SAS with an external PostgreSQL and used the default value for "ssl_enforcement_enabled" which is "true" and now my SAS services to not start. I see in the log of sas-logon-app (as an example) this Java SSL exception below. Question is how do I fix this exception?

java.lang.IllegalStateException: org.postgresql.util.PSQLException: SSL error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sas.commons.db.postgres.PostgresDbInitializer.doCreate(PostgresDbInitializer.java:199) at com.sas.commons.db.postgres.PostgresDbInitializer.createSchema(PostgresDbInitializer.java:158)

I downloaded the AWS CA in PEM format for my region from https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html#UsingWithRDS.SSL.RegionCertificates and saved in a local directory then when I invoked the DaC project I added this option to the invocation of the viya-deployment Ansible project: --volume /home/sasinst/deployments/aws-ca:/config/v4_cfg_tls_trusted_ca_certs viya4-deployment and added this line to ansible-vars.yaml: V4_CFG_TLS_TRUSTED_CA_CERTS: "/home/sasinst/deployments/aws-creds" but it did not help

dhoucgitter commented 7 months ago

Hi @miaeyg. The steps that you've followed to download the AWS CA file in PEM format are what you need to fix the error. Since you are using the DAC docker container and specifying the volume mapping as --volume /home/sasinst/deployments/aws-ca:/config/v4_cfg_tls_trusted_ca_certs, the docker-entrypoint.sh script should set the correct V4_CFG_TLS_TRUSTED_CA_CERTS environment variable value for you when the playbook runs inside the container. Try commenting out the V4_CFG_TLS_TRUSTED_CA_CERTS setting that you have in ansible-vars.yaml to see if that helps. The value you have set for that variable reflects what you might use if you were running DAC using the locally installed Ansible binary since it points to a location on your local workstation.

miaeyg commented 7 months ago

Hi @dhoucgitter, tried your suggestion as well and got the same results (Java SSL exceptions in all SAS services)

dhoucgitter commented 7 months ago

Hi @miaeyg, did you run DAC using "viya,uninstall" tags to remove your previous deployment if you haven't already, then reinstall with "viya,install"? Also scan for the string "Running: ansible-playbook" in the initial lines of the docker run output to inspect the value of V4_CFG_TLS_TRUSTED_CA_CERTS being set.

miaeyg commented 7 months ago

Hi @dhoucgitter, each time I do a test I always uninstall and reinstall. I can tell you that looking at the deployment folders created by DaC I saw these files created:

site-config/vdm/security/cacerts/il-central-1-bundle.pem site-config/vdm/generators/customer-provided-ca-certificates.yaml

The contents of the second file is

apiVersion: builtin
kind: ConfigMapGenerator
metadata:
  name: sas-customer-provided-ca-certificates
behavior: merge
files:
  - cacerts1.pem=site-config/vdm/security/cacerts/il-central-1-bundle.pem

The second file is also included in the "kustomization.yaml" like this:

generators:
## vdm defined generators (pre)
- site-config/vdm/generators/sas-license.yaml
- site-config/vdm/generators/sas-shared-config.yaml
- site-config/vdm/generators/sas-consul-config-secret.yaml
- site-config/vdm/generators/ingress-input.yaml
- site-config/vdm/generators/openldap-bootstrap-config.yaml
- site-config/vdm/generators/postgres-default-secrets.yaml
- site-config/vdm/generators/customer-provided-ca-certificates.yaml
- site-config/vdm/generators/customer-provided-merge-sas-certframe-configmap.yaml

Note this line:

So it looks like the AWS CA PEM file was captured in the DaC docker and included in the correct places.

miaeyg commented 7 months ago

In addition to my previous reply I want to suggest that the generated file "site-config/vdm/generators/customer-provided-ca-certificates.yaml" contains a syntax error "cacerts1.pem=" in the last line:


apiVersion: builtin
kind: ConfigMapGenerator
metadata:
  name: sas-customer-provided-ca-certificates
behavior: merge
files:
  - cacerts1.pem=site-config/vdm/security/cacerts/il-central-1-bundle.pem

The source of this possible syntax error is here https://github.com/sassoftware/viya4-deployment/blob/4ef910891e6d6e6dca0bccbb7fd013c93a589ae0/roles/vdm/templates/generators/customer-provided-ca-certificates.yaml#L8

By the way, I tried to fix this by removing the "cacerts1.pem=" text and it did not help.

dhoucgitter commented 7 months ago

In addition to my previous reply I want to suggest that the generated file "site-config/vdm/generators/customer-provided-ca-certificates.yaml" contains a syntax error "cacerts1.pem=" in the last line:

The DAC generated customer-provided-ca-certificates.yaml file I have contains the same format. There are other .yaml files that contain a similar

files:
  - token=filespec

format as well so I do not expect that is causing the problem.

dhoucgitter commented 7 months ago

Hi @miaeyg, thanks for noticing that and good catch. The description for the storage_size input variable should indicate that the units are in GB, not MB. I will open a ticket to fix that doc reference. Will also handle adjusting the default size if needed.

  1. I think there is a error in the description of the "storage_size" parameter described here https://github.com/sassoftware/viya4-iac-aws/blob/main/docs/CONFIG-VARS.md#postgresql-server The descriptions says: "Max storage allowed for the PostgreSQL server in MB" but looking at https://registry.terraform.io/modules/terraform-aws-modules/rds/aws/latest and the parameter "allocated_storage" (which uses the "storage_size" input var) is measured in GB, not MB: "The allocated storage in gigabytes".
miaeyg commented 7 months ago

Thanks.

Note the default PVC size for internal PG is 128GB so one would expect the defsult size for the external PG to be the same.

About the SSL problem I think I know the reason but can be sure only later today. I believe I downloaded the CAs of the wrong region... I have two accounts in different regions and I accidently downloaded the CAs of the other region. Quite sure this is the root cause. Will update again once I test with the correct region.

קבל ‏Outlook עבור Android‏https://aka.ms/AAb9ysg


From: David Houck @.> Sent: Tuesday, January 30, 2024 9:47:23 PM To: sassoftware/viya4-iac-aws @.> Cc: Eyal Gonen @.>; Mention @.> Subject: Re: [sassoftware/viya4-iac-aws] Issues with external PostgreSQL on AWS RDS (Issue #270)

EXTERNAL MAIL

Hi @miaeyghttps://github.com/miaeyg, thanks for noticing that and good catch. The description for the storage_size input variable should indicate that the units are in GB, not MB. I will open a ticket to fix that doc reference. Will also handle adjusting the default size if needed.

  1. I think there is a error in the description of the "storage_size" parameter described here https://github.com/sassoftware/viya4-iac-aws/blob/main/docs/CONFIG-VARS.md#postgresql-server The descriptions says: "Max storage allowed for the PostgreSQL server in MB" but looking at https://registry.terraform.io/modules/terraform-aws-modules/rds/aws/latest and the parameter "allocated_storage" (which uses the "storage_size" input var) is measured in GB, not MB: "The allocated storage in gigabytes".

— Reply to this email directly, view it on GitHubhttps://github.com/sassoftware/viya4-iac-aws/issues/270#issuecomment-1917772380, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMSUNY4L7MT2Z55IIYZQ5GDYRFE4XAVCNFSM6AAAAABCQZVSJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXG43TEMZYGA. You are receiving this because you were mentioned.Message ID: @.***>

miaeyg commented 7 months ago

Hi @dhoucgitter, about the SSL problem - by bad! I was using the wrong AWS CA PEM file as I downloaded by mistake the PEM for a different region than the region my RDS was running on (I have two accounts in different regions so I downloaded the PEM appropriate for the second account). Got it to work now. Sorry for the mess.

I have another related question - the generated RDS instance is marked as "Publicly accessible=yes". What if I do not want to make it public. How can I make it private and not public using this Terraform project? Is it supported to have it non-public?

Also, how can I control in which AZ the generated RDS instance will be created using this project? SAS recommends that the RDS instance will be in the same AZ as the SAS instances to avoid latency and to improve performance.

Also, it seems like the RDS instance generated by this project uses the following CA from AWS: "rds-ca-2019". According to AWS https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL-certificate-rotation.html they recommend to consider using newer CAs as this CA expires on August 2024 and not auto-rotated. How can I change the CA used what AWS recommend "rds-ca-rsa2048-g1 CA" from this Terraform project?

dhoucgitter commented 7 months ago

Hi @miaeyg, where do you see that your RDS instance is using the rds-ca-2019 CA? In the external RDS instance I have created by IAC AWS, I am seeing rds-ca-rsa2048-g1 as the CA that's being used. Wondering if this could be a regional decision? Showing this CA for my RDS instance

image
miaeyg commented 7 months ago

Hi @dhoucgitter, I have already destroyed my env. so I do not have a screenshot for you but I went to RDS and saw the DB instance, clicked on it and in the details screen there was a "Certificate Authority' option and the value was "rds-ca-2019". I am sure about it. I have created the RDS instance in the "us-east-1" region. Is this the same region you are testing?

Is there an option to control the CA used for the RDS instance within any of the parameters/options of the "postgres_servers" variable in terraform.tfvars file?

Here is my "postgres_servers" section from terraform.tfvars:

postgres_servers = {
  default = {
    "server_version"          = "15"
    "instance_type"           = "db.r6idn.xlarge"
    "storage_size"            = 128
    "ssl_enforcement_enabled" = true
    "administrator_login"     = "pgadmin"
    "administrator_password"  = "my$up3rS3cretPassw00rd"
  }
}
dhoucgitter commented 7 months ago

Found an AWS blog entry that seems relevant to the discussion:

Now through January 25, 2024 – new RDS DB instances will have the rds-ca-2019 certificate by default, unless you specify a different CA via the ca-certificate-identifier option on the create-db-instance API; or you specify a default CA override for your account like mentioned in the above section. Starting January 26, 2024 – any new database instances will default to using the rds-ca-rsa2048-g1 certificate. If you wish for new instances to use a different certificate, you can specify which certificate to use with the AWS console or the AWS CLI.

miaeyg commented 7 months ago

Thanks @dhoucgitter ! Just to mention that I created the RDS database instance after January 26th and still got rds-ca-2019 but I can try again. What about my other two questions? Here they are again:

I have another related question - the generated RDS instance is marked as "Publicly accessible=yes". What if I do not want to make it public. How can I make it private and not public using this Terraform project? Is it supported to have it non-public?

Also, how can I control in which AZ the generated RDS instance will be created using this project? SAS recommends that the RDS instance will be in the same AZ as the SAS instances to avoid latency and to improve performance.

dhoucgitter commented 7 months ago

Hi @miaeyg, the CA being assigned to new RDS instances seems to be different in different regions. I have opened a AWS case asking about the discrepancy between RDS instances created in us-west-1 where they are assigned the rds-ca-rsq2048-g1 CA and us-east-1 where they were still getting the rds-ca-2019 CA as of yesterday.

how can I control in which AZ the generated RDS instance will be created using this project? SAS recommends that the RDS instance will be in the same AZ as the SAS instances to avoid latency and to improve performance.

The RDS instance AZs are controlled by the AZs of the subnets that the RDS instance is using. The last few sentences of the first paragraph in the Subnet Requirements section helps explain the possibilities.

miaeyg commented 7 months ago

Hi @dhoucgitter thanks.

Reading the doc you pointed me to says:

Amazon RDS relies on having two subnets in separate AZs to provide database redundancy should a failure occur in one AZ.

Can I assume that first database subnet (and hence the corresponding AZ) on the provided list to terraform.tfvars file will be used for the RDS Master node and the second database subnet on the list will be used for the RDS Standby node? If not, then I do not understand how I can force the RDS Master to be on the same AZ as the private subnet's AZ (where all SAS components reside)?

dhoucgitter commented 7 months ago

Hi @miaeyg, as you described, the behavior that I have observed is that RDS deploys the PostgresSQL database instance to the AZ corresponding to the first subnet in the list, whether that is the public subnet list or the database subnet list. I have not experienced or forced a DB failover however my expectation is that if a local AZ event caused the running PostgresSQL db to fail, RDS would start up a new instance in the AZ corresponding to the second subnet in the list as you describe.

miaeyg commented 7 months ago

Hi @dhoucgitter thanks. I think this info should be reflected in the project's doc to have the RDS in the same AZ as SAS components by setting so the first database subnet will be in the same AZ as the private subnet's AZ for performance reasons.

Still waiting on the last question about how can I make the RDS instance non-public.

Thanks for all your kind help!

miaeyg commented 7 months ago

Hi @dhoucgitter in addition to my previous comment from yesterday, there is one more question that came up - when setting up the "default" and "CDS" PostgreSQL as external PostgreSQL meaning as AWS RDS Database instances is it possible to setup only one database instance for both and have the two databases (default + CDS) reside within this single database instance?

I believe that using the supplied examples in this project will result in two database instances each with one database - correct me if I am wrong

dhoucgitter commented 7 months ago

Hi @miaeyg,

I believe that using the supplied examples in this project will result in two database instances each with one database - correct me if I am wrong

You are right, the intent is to set up two separate database instances, each with it's own database. There may be additional concerns, but one reason for the separate instance is to provide a dedicated and thus higher performance path for the applications that plan to rely upon the CDS instance without suffering any latencies that may occur for those apps when using the default DB instance that most other Viya applications may already using.

dhoucgitter commented 6 months ago

Hi @miaeyg, circling back around on the unanswered question.

I have another related question - the generated RDS instance is marked as "Publicly accessible=yes". What if I do not want to make it public. How can I make it private and not public using this Terraform project? Is it supported to have it non-public?

Yes, IAC supports creation of a private or publicy_accessible=false RDS instance. If your .tfvars content specifies a CIDR range for either default_public_access_cidrs or postgres_public_access_cidrs then publicly_accessible=true will get set. Setting both of those variables to null should result in creating a private RDS instance where publicy_accessible=false.

miaeyg commented 6 months ago

Thanks @dhoucgitter for answering this leftover question and all my other previous RDS questions!

miaeyg commented 2 months ago

Hi @dhoucgitter,

I am reopening this issue - maybe you will have an idea. I just deployed SAS Viya in AWS using this project and noticed that RDS instance was deployed in the AZ corresponding to the second (not first as the assumption was) in the "database" list. I am using BYON scenario #2. Question is if and how to control in which AZ the RDS deploys to.

Here are screenshots of what I noticed:

Note RDS instance was deployed to the il-central-1b AZ: image

The terraform.tfvars includes this: image

And you can see that the first subnet in the "database" list is in the il-central-1a AZ: image

I guess this is bad from performance perspective. Any idea what happened and how to control this behavior of RDS deployment?

Here is your reply about this some time ago:

Hi @miaeyg, as you described, the behavior that I have observed is that RDS deploys the PostgresSQL database instance to the AZ corresponding to the first subnet in the list, whether that is the public subnet list or the database subnet list. I have not experienced or forced a DB failover however my expectation is that if a local AZ event caused the running PostgresSQL db to fail, RDS would start up a new instance in the AZ corresponding to the second subnet in the list as you describe.

miaeyg commented 2 months ago

Hi @dhoucgitter,

I am reopening this issue - maybe you will have an idea. I just deployed SAS Viya in AWS using this project and noticed that RDS instance was deployed in the AZ corresponding to the second (not first as the assumption was) in the "database" list. I am using BYON scenario #2. Question is if and how to control in which AZ the RDS deploys to.

Here are screenshots of what I noticed:

Note RDS instance was deployed to the il-central-1b AZ: image

The terraform.tfvars includes this: image

And you can see that the first subnet in the "database" list is in the il-central-1a AZ: image

I guess this is bad from performance perspective. Any idea what happened and how to control this behavior of RDS deployment?

Here is your reply about this some time ago:

Hi @miaeyg, as you described, the behavior that I have observed is that RDS deploys the PostgresSQL database instance to the AZ corresponding to the first subnet in the list, whether that is the public subnet list or the database subnet list. I have not experienced or forced a DB failover however my expectation is that if a local AZ event caused the running PostgresSQL db to fail, RDS would start up a new instance in the AZ corresponding to the second subnet in the list as you describe.

dhoucgitter commented 2 months ago

Hi @miaeyg , what happens if you remove the second subnet and only supply the first one. Is RDS able to successfully place the postgres instance there?

This is only to verify that RDS can create the instance there and that your ID is correct along with the other bits, not a permanent solution.

miaeyg commented 2 months ago

Hi @dhoucgitter your suggestion to supply only 1 database subnet fails to create an RDS instance with this error:

DBSubnetGroupDoesNotCoverEnoughAZs: The DB subnet group doesn't meet Availability Zone (AZ) coverage requirement. 
Current AZ coverage: il-central-1a. Add subnets to cover at least 2 AZs.

It seems like supplying 2 subnets in two different AZs is an AWS requirement but it seems like AWS RDS instance is not necessarily placed in the first database subnet provided to Terraform, as I have shown in my previous comment, which can cause SAS private subnet to be placed in AZ1 and RDS instance to be placed in AZ2 which I understand is bad performance-wise. How can we force AWS to place the RDS instance in the same AZ as the SAS private subnet?

miaeyg commented 1 month ago

Hi @dhoucgitter did you see me previous update on Jun 20th? Do you have any suggestion for this?