ustaxcourt / ef-cms

An Electronic Filing / Case Management System.
https://dawson.ustaxcourt.gov/
Other
86 stars 46 forks source link

Fix Dynamsoft Health Check for Prod #712

Closed mmarcotte closed 3 years ago

mmarcotte commented 3 years ago

Currently the Dynamsoft check is hardcoded for the Staging environment. We need to make this work for the Production environment that uses its own set of URLs. Additionally, these are frequently returning an unhealthy status. We should investigate why we receive so many 503 errors.

Tasks:

adunkman commented 3 years ago

Some more information about the intermittent failures:

adunkman commented 3 years ago

I’ve discovered that the us-west-1 EC2 instances were stopped, and there is a latency routing rule in effect — meaning that some requests were routed to these stopped instances.

adunkman commented 3 years ago

Restarting these issues significantly impacted the error rate (upper line is total health checks, bottom line is health check failures due to the EC2 instance):

Screenshot of health checks and error rates

Changing scales to highlight the EC2 instance failures since the drop, we are still seeing intermittent failures:

Screenshot of error rates since Jan 9

Next up:

adunkman commented 3 years ago

Confirmed errors are happening equally across environments, which is expected given they are sharing the same EC2 configuration. (Note that dev has been partially down during the time period of this chart).

image

adunkman commented 3 years ago

On the subject of request volume, it’s by design and caused by my misunderstanding of the health check interval, currently configured to 30 seconds. This interval is not the interval between health checks — it’s the interval between checks from each health check worker, and there are "typically about 15".

The number of seconds between the time that Route 53 gets a response from your endpoint and the time that it sends the next health check request. Typically, about 15 health checkers check the health of a specified endpoint. If you choose an interval of 30 seconds, the endpoint will receive a health check request every two to three seconds. If you choose an interval of 10 seconds, the endpoint will receive a request more than once per second.

adunkman commented 3 years ago

Furthermore, we cannot configure it to perform fewer requests. The request interval can either be 10 seconds or 30 (we’re already using 30, the slower interval).

I’m going to manually change the region configuration to see if reducing the regions the health check are in will reduce the number of workers performing checks (it’s unclear in the documentation).

adunkman commented 3 years ago

I reconfigured the IRS environment’s health checker. The documentation states that health check configurations may take up to an hour to propagate. This Kibana graph&_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:f8384e80-2516-11eb-82a4-0f6df8788ac7,key:environment.stage,negate:!f,params:(query:irs),type:phrase,value:irs),query:(match:(environment.stage:(query:irs,type:phrase))))),linked:!f,query:(language:kuery,query:''),uiState:(),vis:(aggs:!((enabled:!t,id:'1',params:(),schema:metric,type:count),(enabled:!t,id:'2',params:(filters:!((input:(language:kuery,query:'message:%22Request%20ended:%20GET%20%2Fpublic-api%2Fhealth%22'),label:''),(input:(language:kuery,query:'message:%22Dynamsoft%20health%20check%20failed%22'),label:'')),row:!t),schema:split,type:filters),(enabled:!t,id:'3',params:(drop_partials:!t,extended_bounds:(),field:timestamp,interval:h,min_doc_count:1,timeRange:(from:now-2d,to:now),useNormalizedEsInterval:!t),schema:segment,type:date_histogram)),params:(addLegend:!t,addTimeMarker:!f,addTooltip:!t,categoryAxes:!((id:CategoryAxis-1,labels:(filter:!t,show:!t,truncate:100),position:bottom,scale:(type:linear),show:!t,style:(),title:(),type:category)),dimensions:(splitRow:!((accessor:0,aggType:filters,format:(),params:())),x:(accessor:1,aggType:date_histogram,format:(id:date,params:(pattern:'YYYY-MM-DD%20HH:mm')),params:(bounds:(max:'2021-01-14T15:25:45.310Z',min:'2021-01-12T15:25:45.310Z'),date:!t,format:'YYYY-MM-DD%20HH:mm',interval:PT1H)),y:!((accessor:2,aggType:count,format:(id:number),params:()))),grid:(categoryLines:!f),labels:(),legendPosition:right,seriesParams:!((data:(id:'1',label:Count),drawLinesBetweenPoints:!t,mode:normal,show:true,showCircles:!t,type:line,valueAxis:ValueAxis-1)),thresholdLine:(color:%2334130C,show:!f,style:full,value:10,width:1),times:!(),type:line,valueAxes:!((id:ValueAxis-1,labels:(filter:!f,rotate:0,show:!t,truncate:100),name:LeftAxis-1,position:left,scale:(mode:normal,type:log),show:!t,style:(),title:(text:Count),type:value))),title:'',type:line))) will indicate if customizing AWS regions results in a reduction of overall health check requests.

adunkman commented 3 years ago

I’ve hit the end of my timebox on this issue — given that this is not a user-reported issue, I’m going to move forward with:

adunkman commented 3 years ago

Screenshot of reduction in request volume

Overriding the regions to which the health check is deployed reduces the request volume per environment, from (this graph is a logarithmic y-axis) 1,779 req/hour/env to 671 req/hour/env. I’ll prepare a pull request to make this change.