Closed mmarcotte closed 3 years ago
Some more information about the intermittent failures:
The error rate is very consistent:
It happens seemingly randomly across the four assets checked (dynamsoft.webtwain.initiate.js
, /dynamsoft.webtwain.config.js
, dynamsoft.webtwain.install.js
, and dynamsoft.webtwain.css
):
I’ve discovered that the us-west-1
EC2 instances were stopped, and there is a latency routing rule in effect — meaning that some requests were routed to these stopped instances.
Restarting these issues significantly impacted the error rate (upper line is total health checks, bottom line is health check failures due to the EC2 instance):
Changing scales to highlight the EC2 instance failures since the drop, we are still seeing intermittent failures:
Next up:
Confirmed errors are happening equally across environments, which is expected given they are sharing the same EC2 configuration. (Note that dev
has been partially down during the time period of this chart).
On the subject of request volume, it’s by design and caused by my misunderstanding of the health check interval, currently configured to 30 seconds. This interval is not the interval between health checks — it’s the interval between checks from each health check worker, and there are "typically about 15".
The number of seconds between the time that Route 53 gets a response from your endpoint and the time that it sends the next health check request. Typically, about 15 health checkers check the health of a specified endpoint. If you choose an interval of 30 seconds, the endpoint will receive a health check request every two to three seconds. If you choose an interval of 10 seconds, the endpoint will receive a request more than once per second.
Furthermore, we cannot configure it to perform fewer requests. The request interval can either be 10 seconds or 30 (we’re already using 30, the slower interval).
I’m going to manually change the region configuration to see if reducing the regions the health check are in will reduce the number of workers performing checks (it’s unclear in the documentation).
I reconfigured the IRS environment’s health checker. The documentation states that health check configurations may take up to an hour to propagate. This Kibana graph&_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:f8384e80-2516-11eb-82a4-0f6df8788ac7,key:environment.stage,negate:!f,params:(query:irs),type:phrase,value:irs),query:(match:(environment.stage:(query:irs,type:phrase))))),linked:!f,query:(language:kuery,query:''),uiState:(),vis:(aggs:!((enabled:!t,id:'1',params:(),schema:metric,type:count),(enabled:!t,id:'2',params:(filters:!((input:(language:kuery,query:'message:%22Request%20ended:%20GET%20%2Fpublic-api%2Fhealth%22'),label:''),(input:(language:kuery,query:'message:%22Dynamsoft%20health%20check%20failed%22'),label:'')),row:!t),schema:split,type:filters),(enabled:!t,id:'3',params:(drop_partials:!t,extended_bounds:(),field:timestamp,interval:h,min_doc_count:1,timeRange:(from:now-2d,to:now),useNormalizedEsInterval:!t),schema:segment,type:date_histogram)),params:(addLegend:!t,addTimeMarker:!f,addTooltip:!t,categoryAxes:!((id:CategoryAxis-1,labels:(filter:!t,show:!t,truncate:100),position:bottom,scale:(type:linear),show:!t,style:(),title:(),type:category)),dimensions:(splitRow:!((accessor:0,aggType:filters,format:(),params:())),x:(accessor:1,aggType:date_histogram,format:(id:date,params:(pattern:'YYYY-MM-DD%20HH:mm')),params:(bounds:(max:'2021-01-14T15:25:45.310Z',min:'2021-01-12T15:25:45.310Z'),date:!t,format:'YYYY-MM-DD%20HH:mm',interval:PT1H)),y:!((accessor:2,aggType:count,format:(id:number),params:()))),grid:(categoryLines:!f),labels:(),legendPosition:right,seriesParams:!((data:(id:'1',label:Count),drawLinesBetweenPoints:!t,mode:normal,show:true,showCircles:!t,type:line,valueAxis:ValueAxis-1)),thresholdLine:(color:%2334130C,show:!f,style:full,value:10,width:1),times:!(),type:line,valueAxes:!((id:ValueAxis-1,labels:(filter:!f,rotate:0,show:!t,truncate:100),name:LeftAxis-1,position:left,scale:(mode:normal,type:log),show:!t,style:(),title:(text:Count),type:value))),title:'',type:line))) will indicate if customizing AWS regions results in a reduction of overall health check requests.
I’ve hit the end of my timebox on this issue — given that this is not a user-reported issue, I’m going to move forward with:
Overriding the regions to which the health check is deployed reduces the request volume per environment, from (this graph is a logarithmic y-axis) 1,779 req/hour/env to 671 req/hour/env. I’ll prepare a pull request to make this change.
Currently the Dynamsoft check is hardcoded for the Staging environment. We need to make this work for the Production environment that uses its own set of URLs. Additionally, these are frequently returning an unhealthy status. We should investigate why we receive so many 503 errors.
Tasks: