open-amt-cloud-toolkit / mps

The Management Presence Server (MPS) is a cloud-agnostic microservice that enables platforms featuring Intel® AMT to be managed over the internet.
https://open-amt-cloud-toolkit.github.io/docs/
Apache License 2.0
23 stars 25 forks source link

Exceptions in mps due to limited AMT resources #664

Closed orinem closed 1 year ago

orinem commented 2 years ago

Describe the bug 🪲

AMT resources (enumeration contexts for example) are limited. If too many requests are sent to a computer over CIRA in a given time frame, some will fail, typically with a CHANNEL_OPEN_FAILURE with either code 2 or 4. This results in an exception and the API call fails.

Code 2 is OPEN_CONNECT_FAILED and code 4 is OPEN_RESOURCE_SHORTAGE.

It is particularly bad if the Pull for an enumeration fails as the enumeration context will be lost until it times out. One fewer context makes the problem more likely.

This isn't the fault of the caller. MPS needs to do some kind of flow control and only let so many requests over CIRA be active at once - especially with enumerations. Failures need to be retried - preferably after a delay.

To Reproduce 🪜 Steps to reproduce the behavior:

These exceptions do occur randomly, but the following is almost certain to generate these errors.

  1. Set MPS logging to "silly" and restart MPS. Wait for devices to reconnect with CIRA.
  2. Use the sample UI to go to the device details page for a device connected using CIRA.
  3. Reload the page several times quickly in succession.

Note that error messages will flash on the page, and some data will not be filled in. Further attempts to refresh the page may generate more errors and different data will not be filled in.

The MPS logs will have errors like this:

2022-08-03T06:47:47.739Z silly: MPS: CHANNEL_OPEN_FAILURE, 794, 4
2022-08-03T06:47:47.740Z info: getComputerSystemPackage getResponse :
2022-08-03T06:47:47.740Z silly: getComputerSystemPackage Complete
2022-08-03T06:47:47.740Z error: Exception during AMT Hardware Information request : TypeError: Cannot read properties of null (reading 'Envelope')
2022-08-03T06:47:47.741Z debug: Event message published
2022-08-03T06:47:47.749Z debug: end of request, closing channel
2022-08-03T06:47:47.750Z debug: end of request, closing channel
2022-08-03T06:47:47.750Z silly: MPS: CHANNEL_OPEN_FAILURE, 795, 4
2022-08-03T06:47:47.751Z silly: getRedirectionService Complete
2022-08-03T06:47:47.752Z error: Exception during AMT Features request: TypeError: Cannot read properties of null (reading 'Envelope')

and later, after restarting everything:

2022-08-03T20:57:04.637Z silly: MPS: CHANNEL_OPEN_FAILURE, 41, 2
2022-08-03T20:57:04.638Z info: getChassis getChassis :
2022-08-03T20:57:04.638Z silly: getChassis Complete
2022-08-03T20:57:04.638Z error: Exception during AMT Hardware Information request : TypeError: Cannot read properties of null (reading 'Envelope')

Expected behavior

The page should reload with no errors.

AMT Device (please complete the following information): 🖥️

Service Deployment (please complete the following information): ⛈️

orinem commented 2 years ago

I added socket.tag.activetunnels to the CHANNEL_OPEN_FAILURE log message. In almost all cases, activetunnels was 8 when a failure occurred with reasonCode of 4. There were a few of 7, but that can be explained by the fact that activetunnels isn't incremented until after a channel is setup, so an eighth channel was likely being setup at the time of the failure.

Given this, it seems that this particular AMT BIOS has a limit of 8 simultaneously open channels.

I suggest only allowing some number of simultaneous channels less than 8. It can easily be done with an "await semaphore" in CIRAHandler.ExecRequest() as follows:

npm install await-semaphore

CIRAHandler.ts:

import { Semaphore } from 'await-semaphore'

...

export class CIRAHandler {
  xml: string
  httpHandler: HttpHandler
  username: string
  password: string
  channel: CIRAChannel
  channelState: number = 0
  connectAttempts: number = 0
  socket: CIRASocket
  semaphore: Semaphore
  constructor (httpHandler: HttpHandler, username: string, password: string) {
    this.username = username
    this.password = password
    this.httpHandler = httpHandler
    // I chose 4 simultaneous channels, half the assumed maximum
    this.semaphore = new Semaphore(4)
  }

...

    async ExecRequest (xml: string): Promise<any> {
    // If it weren't for the recursive call, we could probably
    // wrap the entire function contents with this.semaphore.use()
    const release = await this.semaphore.acquire()
    if (this.channelState === 0) {
      this.channelState = await this.Connect()
    }
    if (this.channelState === 2) {
      try {
        if (this.httpHandler.isAuthInProgress == null) {
          this.httpHandler.isAuthInProgress = new Promise((resolve, reject) => {
            this.httpHandler.authResolve = resolve
          })
        } else {
          await this.httpHandler.isAuthInProgress
        }
        const data = await this.channel.write(xml)
        const parsedData = this.handleResult(data)
        release()
        return parsedData
      } catch (error) {
        if (error?.message === 'Unauthorized' || error?.message === 'Closed') {
          this.channelState = this.channel.CloseChannel()
          release()
          return await this.ExecRequest(xml)
        } else {
          release()
          throw error
        }
      }
    }

    release()
    return null
  }

Limited testing so far indicates that this works.

The reasonCode 2 failures are likely due to a limit on the number of wsman connections the BIOS can handle simultaneously; if only one, then perhaps the number of pending connections allowed. Either way, reducing the number of simultaneous channels will also reduce the number of connections to wsman too and this error hasn't been seen with the above charge.

orinem commented 2 years ago

Well, that is a rock I wish I hadn't turned over...

There are still occasional reasonCode 2 channel open failures that will need to be retried.

In addition, there are some enumeration failures claiming (see below): "The action could not be completed due to concurrency or locking problems.", so it looks like the number of simultaneous enumerations needs to be limited.

Furthermore, the handling of "Enumerate()" failures in DeviceAction.ts is inconsistent. Some cases check the response for null, then will throw an exception looking for enumResponse.Envelope.Body.EnumerateResponse.EnumerationContext which doesn't exist when the call fails; others check for the existence of the enumeration context with something like:

    const enumContext: string = result?.Envelope?.Body?.EnumerateResponse?.EnumerationContext
    if (enumContext == null) {
      logger.error(`getPowerState failed. Reason: ${messages.ENUMERATION_RESPONSE_NULL}`)
      return null
    }

This prevents the exception, but loses the reason for the failure.

Example failure response:

2022-08-07T05:19:11.429Z info: getSoftwareIdentity enumeration result: {
        "Envelope": {
                "Header": {
                        "To": "http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous",
                        "RelatesTo": 0,
                        "Action": "http://schemas.dmtf.org/wbem/wsman/1/wsman/fault",
                        "MessageID": "uuid:00000000-8086-8086-8086-000000000400"
                },
                "Body": {
                        "Fault": {
                                "Code": {
                                        "Value": "a:Sender",
                                        "Subcode": {
                                                "Value": "e:Concurrency"
                                        }
                                },
                                "Reason": {
                                        "Text": "The action could not be completed due to concurrency or locking problems."
                                },
                                "Detail": ""
                        }
                }
        },
        "statusCode": 400
}
orinem commented 2 years ago

Seems that CIRAHandler isn't a good place for the semaphore - it needs to be alongside activetunnels, i.e. in CIRASocket.tag.

There are problems as to when to release the semaphore, especially considering that the browser can 'cancel' the request at any point in time - which results in the CIRA Channel being closed and, as far as I can tell, any outstanding request being orphaned.

orinem commented 2 years ago

See the channel_semaphore branch on https://github.com/orinem/mps for work in progress on this issue.

The semaphore basically works, and extra logging indicates that all the acquisitions are released, but the concurrency problem with enumerations remains.

orinem commented 2 years ago

The WS-Enumeration section here:

https://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide/default.htm?turl=WordDocuments%2Fdsp0226webservicesformanagementwsmanagementspecification.htm

would indicate that 3 would be a good number for the maximum number of concurrent enumeration requests. If we don't differentiate between enumeration requests and other requests, then for now, 3 would be a good initial value for the semaphore.

Release Number of Supported Concurrent Enumeration Flows Total Number of Concurrent Enumeration Contexts
Pre-7.0 3 1
7.0 3 3
8.0 6 3
orinem commented 2 years ago

Note that AMT V16 on a Dell Precision 3571 returns channel numbers 0 through 7 only, indicating a maximum of 8 channels can be open at the same time.

orinem commented 1 year ago

It is indeed much better with pull request #726 - but as mentioned above with my semaphore solution, the concurrency problem with enumerations remains since each enumeration requires at least an ENUMERATE then a PULL*. The solution (and the semaphore solution) throttles the ENUMERATE and PULL individually, so it's possible for other enumeration requests to sneak in between the ENUMERATE and the PULL and fail.

So in addition, the number of concurrent ENUMERATE/PULL pairs need to be limited.

*the case where more than one PULL is required isn't considered here.

matt-primrose commented 1 year ago

Handling tracking those pairs will be more involved to get correct. Are you OK with seeing if the current solution solves the issue and if we find cases where it isn't sufficient, we re-open the issue?

orinem commented 1 year ago

Sure. Once everything gets merged I'll test it again.

matt-primrose commented 1 year ago

it is merged now in main.

orinem commented 1 year ago

I did just get an enumeration failure...

2022-11-22T22:48:30.891Z error: Exception during alarm occurrences request : TypeError: Cannot read properties of undefined (reading 'EnumerationContext')

It worked on a 'reload'.

I think we should close this issue and raise a separate issue for enumerations.

matt-primrose commented 1 year ago

closing per owners request.