jeremydaly / data-api-client

A "DocumentClient" for the Amazon Aurora Serverless Data API
MIT License
444 stars 63 forks source link

Frequent BadRequestException: Communications link failure #11

Closed tommedema closed 4 years ago

tommedema commented 5 years ago

I've created a simple CLI tool that is supposed to create a new table using the Data API:

// tslint:disable: no-console

import dataApiClient from 'data-api-client'

const aSecretArn = process.env.AURORA_SECRET_ARN
const aClusterArn = process.env.AURORA_CLUSTER_ARN
const aDbName = process.env.AURORA_DATABASE_NAME

if (
  aSecretArn === undefined ||
  aClusterArn === undefined ||
  aDbName === undefined
) {
  throw new Error('one or more env vars are undefined')
}

const data = dataApiClient({
  database: aDbName,
  resourceArn: aClusterArn,
  secretArn: aSecretArn,
  options: {
    // aurora serverless data API is only available in us-east-1 for now
    // see https://read.acloud.guru/getting-started-with-the-amazon-aurora-serverless-data-api-6b84e466b109
    region: 'us-east-1'
  }
})

;(async () => {
  try {
    const result = await data.query(`
      create table reminders
      (
        id varchar(36) not null,
        PRIMARY KEY (id)
      )
    `)

    console.log('query result')
    console.dir(result)
  }
  catch (e) {
    console.log('query error')
    console.dir(e)
  }
})()

The issue is that when I run this about 9 out of 10 times I get the following error:

query error
{ BadRequestException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
    at Object.extractError (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/protocol/json.js:51:27)
    at Request.extractError (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/protocol/rest_json.js:55:8)
    at Request.callListeners (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/request.js:683:14)
    at Request.transition (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/tommedema/projects/prem/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/Users/tommedema/projects/prem/node_modules/aws-sdk/lib/request.js:685:12)
  message: 'Communications link failure\n\nThe last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.',
  code: 'BadRequestException',
  time: 2019-08-05T02:46:16.940Z,
  requestId: '8cc3fcb3-6928-4b52-ab24-8ac9023fcd84',
  statusCode: 400,
  retryable: false,
  retryDelay: 25.77670642672414 }

Only sometimes (after many retries) I get a valid response (in this case "table already exists").

Using data-api-client version 1.0.0-beta

My Aurora Serverless cluster was made in cloudformation:

    RDSAuroraServerlessCluster:
      Type: AWS::RDS::DBCluster
      Properties:
        MasterUsername: ${{env:AURORA_MASTER_USERNAME}}
        MasterUserPassword: ${{env:AURORA_MASTER_PASSWORD}}
        DatabaseName: ${{env:AURORA_DATABASE_NAME}}
        Engine: aurora
        EngineMode: serverless
        ScalingConfiguration:
          AutoPause: true
          MaxCapacity: 4
          MinCapacity: 1
          SecondsUntilAutoPause: 500

    RDSAuroraClusterMasterSecret:
      Type: AWS::SecretsManager::Secret
      Properties:
        Description: This contains the RDS Master user credentials for RDS Aurora Serverless Cluster
        SecretString:
          !Sub |
            {
              "username": "${{env:AURORA_MASTER_USERNAME}}",
              "password": "${{env:AURORA_MASTER_PASSWORD}}"
            }

And I enabled the data API manually:

aws rds modify-db-cluster --db-cluster-identifier ARN --enable-http-endpoint

Note that when it does give a valid response, it seems to keep working for a while. After a while it stops working again and gives me the BadRequestException issue for many subsequent tries. This makes me believe that the issue is related to the cold starts of Aurora Serverless. How did you take care of this?

Note that increasing the connectTimeout option does not seem to help:

options: {
    maxRetries: 10,
    httpOptions: {
      connectTimeout: 30000
    }
  }
Schavras commented 5 years ago

I guess you have configure Aurora to shutdown after 5 minutes. For me, this happens when the aurora is closed, so the initial requests fail until the service is fully up. Then it works like charm, until 5 mins of inactivity.

cbschuld commented 5 years ago

@Schavras is correct; AWS states this is "normal and expected" behavior when the serverless Aurora is "waking up." Serverless Aurora is not the best name for it IMO - maybe "Sometimes Cold Aurora" would be better. This is not an issue of this lib and more of an experience with Aurora in serverless mode.

Can be closed IMO.

jeremydaly commented 4 years ago

Sorry for the late reply. This is definitely normal behavior for the Data API and not specific to this lib. I suppose I could make a more graceful error.

tommedema commented 4 years ago

@jeremydaly isn't the purpose of this lib to make working with the data API easier? a retry mechanism should therefore be within scope imo

jeremydaly commented 4 years ago

It is, but a cold start to an Aurora Serverless cluster can take more than 30 seconds. How many times do you want it to retry?

tommedema commented 4 years ago

I'd use exponential backoff, starting with 30 seconds, backoff rate of 2, and then retry up to 5 times: http://backoffcalculator.com/?interval=30&attempts=5&rate=2

Retry   Seconds Timestamp
1   30  2019-11-14 10:32:07
2   90  2019-11-14 10:33:07
3   210 2019-11-14 10:35:07
4   450 2019-11-14 10:39:07
5   930 2019-11-14 10:47:07
jeremydaly commented 4 years ago

I think this heavily depends on your use case. Are you using this with containers/VMs or from Lambda?

tommedema commented 4 years ago

From Lambda; I guess you make a good point that a long timeout would not be helpful there. I switched away from using aurora serverless because it was many times more expensive that dynamodb in our case, so feel free to close this.

Thanks for thinking along

On Thu, Nov 14, 2019 at 10:38 AM Jeremy Daly notifications@github.com wrote:

I think this heavily depends on your use case. Are you using this with containers/VMs or from Lambda?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeremydaly/data-api-client/issues/11?email_source=notifications&email_token=AACRAOM24AMDE5SEDY65SILQTWLIVA5CNFSM4IJHAYM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEC3FQY#issuecomment-554021571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRAOMUIV4EACEV33EIYDLQTWLIVANCNFSM4IJHAYMQ .

coyoteecd commented 4 years ago

@tommedema @jeremydaly The 'Communications link failure' error has just been fixed in the latest AWS SDK 2.601.0 package. Related issue is aws-sdk-js#2914. To enable retries, you can use the maxRetries and retryDelayOptions that are passed to the RDSDataService constructor, see docs here. Since the dataApiClient 'options' parameter is passed as-is, this means you can use it with the existing release as well.

Example:

const data = dataApiClient({
  database: aDbName,
  resourceArn: aClusterArn,
  secretArn: aSecretArn,
  options: {
    // aurora serverless data API is only available in us-east-1 for now
    // see https://read.acloud.guru/getting-started-with-the-amazon-aurora-serverless-data-api-6b84e466b109
    region: 'us-east-1',
    // Retries 10 times, waiting 5 seconds between retries
    maxRetries: 10,
    retryDelayOptions = { base: 5000 };
  }
})

Obviously this is still subject to the discussion above that it doesn't make sense when executed in a Lambda, but in our case we needed this to run database migrations locally against a development cluster that may or may not be paused, so the retry was useful.

I think you can close this.