Fix Dynamsoft Health Check for Prod

mmarcotte commented 3 years ago

Currently the Dynamsoft check is hardcoded for the Staging environment. We need to make this work for the Production environment that uses its own set of URLs. Additionally, these are frequently returning an unhealthy status. We should investigate why we receive so many 503 errors.

Tasks:

[x] Point production application to the production EC2 instance. (#771)
[ ] Understand why staging Dynamsoft EC2 instance reports 503s.

adunkman commented 3 years ago

Some more information about the intermittent failures:

The error rate is very consistent:
It happens seemingly randomly across the four assets checked (dynamsoft.webtwain.initiate.js, /dynamsoft.webtwain.config.js, dynamsoft.webtwain.install.js, and dynamsoft.webtwain.css):

adunkman commented 3 years ago

I’ve discovered that the us-west-1 EC2 instances were stopped, and there is a latency routing rule in effect — meaning that some requests were routed to these stopped instances.

adunkman commented 3 years ago

Restarting these issues significantly impacted the error rate (upper line is total health checks, bottom line is health check failures due to the EC2 instance):

Screenshot of health checks and error rates

Changing scales to highlight the EC2 instance failures since the drop, we are still seeing intermittent failures:

Screenshot of error rates since Jan 9

Next up:

Understand our request volume. This seems like an abnormally high number of queries for system health checks — there are 5 environments (dev, stg, irs, mig, test) in this log instance, so ~20,000 requests per 3 hours is approximately 111 requests/minute, or 22 requests/environment/minute.
Find logs indicating what is the source of these errors. Looking at our EC2 setup script, we should be able to access nginx logs on the hosts themselves, and dig into the ALB logs as well.

adunkman commented 3 years ago

Confirmed errors are happening equally across environments, which is expected given they are sharing the same EC2 configuration. (Note that dev has been partially down during the time period of this chart).

adunkman commented 3 years ago

On the subject of request volume, it’s by design and caused by my misunderstanding of the health check interval, currently configured to 30 seconds. This interval is not the interval between health checks — it’s the interval between checks from each health check worker, and there are "typically about 15".

The number of seconds between the time that Route 53 gets a response from your endpoint and the time that it sends the next health check request. Typically, about 15 health checkers check the health of a specified endpoint. If you choose an interval of 30 seconds, the endpoint will receive a health check request every two to three seconds. If you choose an interval of 10 seconds, the endpoint will receive a request more than once per second.

adunkman commented 3 years ago

Furthermore, we cannot configure it to perform fewer requests. The request interval can either be 10 seconds or 30 (we’re already using 30, the slower interval).

I’m going to manually change the region configuration to see if reducing the regions the health check are in will reduce the number of workers performing checks (it’s unclear in the documentation).

adunkman commented 3 years ago

I reconfigured the IRS environment’s health checker. The documentation states that health check configurations may take up to an hour to propagate. This Kibana graph&_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:f8384e80-2516-11eb-82a4-0f6df8788ac7,key:environment.stage,negate:!f,params:(query:irs),type:phrase,value:irs),query:(match:(environment.stage:(query:irs,type:phrase))))),linked:!f,query:(language:kuery,query:''),uiState:(),vis:(aggs:!((enabled:!t,id:'1',params:(),schema:metric,type:count),(enabled:!t,id:'2',params:(filters:!((input:(language:kuery,query:'message:%22Request%20ended:%20GET%20%2Fpublic-api%2Fhealth%22'),label:''),(input:(language:kuery,query:'message:%22Dynamsoft%20health%20check%20failed%22'),label:'')),row:!t),schema:split,type:filters),(enabled:!t,id:'3',params:(drop_partials:!t,extended_bounds:(),field:timestamp,interval:h,min_doc_count:1,timeRange:(from:now-2d,to:now),useNormalizedEsInterval:!t),schema:segment,type:date_histogram)),params:(addLegend:!t,addTimeMarker:!f,addTooltip:!t,categoryAxes:!((id:CategoryAxis-1,labels:(filter:!t,show:!t,truncate:100),position:bottom,scale:(type:linear),show:!t,style:(),title:(),type:category)),dimensions:(splitRow:!((accessor:0,aggType:filters,format:(),params:())),x:(accessor:1,aggType:date_histogram,format:(id:date,params:(pattern:'YYYY-MM-DD%20HH:mm')),params:(bounds:(max:'2021-01-14T15:25:45.310Z',min:'2021-01-12T15:25:45.310Z'),date:!t,format:'YYYY-MM-DD%20HH:mm',interval:PT1H)),y:!((accessor:2,aggType:count,format:(id:number),params:()))),grid:(categoryLines:!f),labels:(),legendPosition:right,seriesParams:!((data:(id:'1',label:Count),drawLinesBetweenPoints:!t,mode:normal,show:true,showCircles:!t,type:line,valueAxis:ValueAxis-1)),thresholdLine:(color:%2334130C,show:!f,style:full,value:10,width:1),times:!(),type:line,valueAxes:!((id:ValueAxis-1,labels:(filter:!f,rotate:0,show:!t,truncate:100),name:LeftAxis-1,position:left,scale:(mode:normal,type:log),show:!t,style:(),title:(text:Count),type:value))),title:'',type:line))) will indicate if customizing AWS regions results in a reduction of overall health check requests.

adunkman commented 3 years ago

I’ve hit the end of my timebox on this issue — given that this is not a user-reported issue, I’m going to move forward with:

If the region configuration change shows promise, I’ll roll it out across environments.
I’ll increase the failed health check interval by 1 to reduce the number of times the alert is triggered.
771 is opened to handle configuring the health check appropriately for production.

adunkman commented 3 years ago

Screenshot of reduction in request volume

Overriding the regions to which the health check is deployed reduces the request volume per environment, from (this graph is a logarithmic y-axis) 1,779 req/hour/env to 671 req/hour/env. I’ll prepare a pull request to make this change.

ustaxcourt / ef-cms

Fix Dynamsoft Health Check for Prod #712

771 is opened to handle configuring the health check appropriately for production.