Closed orinem closed 1 year ago
I added socket.tag.activetunnels
to the CHANNEL_OPEN_FAILURE log message. In almost all cases, activetunnels was 8 when a failure occurred with reasonCode of 4. There were a few of 7, but that can be explained by the fact that activetunnels isn't incremented until after a channel is setup, so an eighth channel was likely being setup at the time of the failure.
Given this, it seems that this particular AMT BIOS has a limit of 8 simultaneously open channels.
I suggest only allowing some number of simultaneous channels less than 8. It can easily be done with an "await semaphore" in CIRAHandler.ExecRequest() as follows:
npm install await-semaphore
CIRAHandler.ts:
import { Semaphore } from 'await-semaphore'
...
export class CIRAHandler {
xml: string
httpHandler: HttpHandler
username: string
password: string
channel: CIRAChannel
channelState: number = 0
connectAttempts: number = 0
socket: CIRASocket
semaphore: Semaphore
constructor (httpHandler: HttpHandler, username: string, password: string) {
this.username = username
this.password = password
this.httpHandler = httpHandler
// I chose 4 simultaneous channels, half the assumed maximum
this.semaphore = new Semaphore(4)
}
...
async ExecRequest (xml: string): Promise<any> {
// If it weren't for the recursive call, we could probably
// wrap the entire function contents with this.semaphore.use()
const release = await this.semaphore.acquire()
if (this.channelState === 0) {
this.channelState = await this.Connect()
}
if (this.channelState === 2) {
try {
if (this.httpHandler.isAuthInProgress == null) {
this.httpHandler.isAuthInProgress = new Promise((resolve, reject) => {
this.httpHandler.authResolve = resolve
})
} else {
await this.httpHandler.isAuthInProgress
}
const data = await this.channel.write(xml)
const parsedData = this.handleResult(data)
release()
return parsedData
} catch (error) {
if (error?.message === 'Unauthorized' || error?.message === 'Closed') {
this.channelState = this.channel.CloseChannel()
release()
return await this.ExecRequest(xml)
} else {
release()
throw error
}
}
}
release()
return null
}
Limited testing so far indicates that this works.
The reasonCode 2 failures are likely due to a limit on the number of wsman connections the BIOS can handle simultaneously; if only one, then perhaps the number of pending connections allowed. Either way, reducing the number of simultaneous channels will also reduce the number of connections to wsman too and this error hasn't been seen with the above charge.
Well, that is a rock I wish I hadn't turned over...
There are still occasional reasonCode 2 channel open failures that will need to be retried.
In addition, there are some enumeration failures claiming (see below): "The action could not be completed due to concurrency or locking problems.", so it looks like the number of simultaneous enumerations needs to be limited.
Furthermore, the handling of "Enumerate()" failures in DeviceAction.ts is inconsistent. Some cases check the response for null, then will throw an exception looking for enumResponse.Envelope.Body.EnumerateResponse.EnumerationContext
which doesn't exist when the call fails; others check for the existence of the enumeration context with something like:
const enumContext: string = result?.Envelope?.Body?.EnumerateResponse?.EnumerationContext
if (enumContext == null) {
logger.error(`getPowerState failed. Reason: ${messages.ENUMERATION_RESPONSE_NULL}`)
return null
}
This prevents the exception, but loses the reason for the failure.
Example failure response:
2022-08-07T05:19:11.429Z info: getSoftwareIdentity enumeration result: {
"Envelope": {
"Header": {
"To": "http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous",
"RelatesTo": 0,
"Action": "http://schemas.dmtf.org/wbem/wsman/1/wsman/fault",
"MessageID": "uuid:00000000-8086-8086-8086-000000000400"
},
"Body": {
"Fault": {
"Code": {
"Value": "a:Sender",
"Subcode": {
"Value": "e:Concurrency"
}
},
"Reason": {
"Text": "The action could not be completed due to concurrency or locking problems."
},
"Detail": ""
}
}
},
"statusCode": 400
}
Seems that CIRAHandler isn't a good place for the semaphore - it needs to be alongside activetunnels, i.e. in CIRASocket.tag.
There are problems as to when to release the semaphore, especially considering that the browser can 'cancel' the request at any point in time - which results in the CIRA Channel being closed and, as far as I can tell, any outstanding request being orphaned.
See the channel_semaphore branch on https://github.com/orinem/mps for work in progress on this issue.
The semaphore basically works, and extra logging indicates that all the acquisitions are released, but the concurrency problem with enumerations remains.
The WS-Enumeration section here:
would indicate that 3 would be a good number for the maximum number of concurrent enumeration requests. If we don't differentiate between enumeration requests and other requests, then for now, 3 would be a good initial value for the semaphore.
Release | Number of Supported Concurrent Enumeration Flows | Total Number of Concurrent Enumeration Contexts |
---|---|---|
Pre-7.0 | 3 | 1 |
7.0 | 3 | 3 |
8.0 | 6 | 3 |
Note that AMT V16 on a Dell Precision 3571 returns channel numbers 0 through 7 only, indicating a maximum of 8 channels can be open at the same time.
It is indeed much better with pull request #726 - but as mentioned above with my semaphore solution, the concurrency problem with enumerations remains since each enumeration requires at least an ENUMERATE then a PULL*. The solution (and the semaphore solution) throttles the ENUMERATE and PULL individually, so it's possible for other enumeration requests to sneak in between the ENUMERATE and the PULL and fail.
So in addition, the number of concurrent ENUMERATE/PULL pairs need to be limited.
*the case where more than one PULL is required isn't considered here.
Handling tracking those pairs will be more involved to get correct. Are you OK with seeing if the current solution solves the issue and if we find cases where it isn't sufficient, we re-open the issue?
Sure. Once everything gets merged I'll test it again.
it is merged now in main.
I did just get an enumeration failure...
2022-11-22T22:48:30.891Z error: Exception during alarm occurrences request : TypeError: Cannot read properties of undefined (reading 'EnumerationContext')
It worked on a 'reload'.
I think we should close this issue and raise a separate issue for enumerations.
closing per owners request.
Describe the bug 🪲
AMT resources (enumeration contexts for example) are limited. If too many requests are sent to a computer over CIRA in a given time frame, some will fail, typically with a CHANNEL_OPEN_FAILURE with either code 2 or 4. This results in an exception and the API call fails.
Code 2 is OPEN_CONNECT_FAILED and code 4 is OPEN_RESOURCE_SHORTAGE.
It is particularly bad if the Pull for an enumeration fails as the enumeration context will be lost until it times out. One fewer context makes the problem more likely.
This isn't the fault of the caller. MPS needs to do some kind of flow control and only let so many requests over CIRA be active at once - especially with enumerations. Failures need to be retried - preferably after a delay.
To Reproduce 🪜 Steps to reproduce the behavior:
These exceptions do occur randomly, but the following is almost certain to generate these errors.
Note that error messages will flash on the page, and some data will not be filled in. Further attempts to refresh the page may generate more errors and different data will not be filled in.
The MPS logs will have errors like this:
and later, after restarting everything:
Expected behavior
The page should reload with no errors.
AMT Device (please complete the following information): 🖥️
Service Deployment (please complete the following information): ⛈️