vmware / esx-boot

The ESXi bootloader
Other
70 stars 22 forks source link

At startup, the probability enters an endless loop in efi_get_memory_map #11

Open Chancel-xFusion opened 1 year ago

Chancel-xFusion commented 1 year ago

Describe the bug

There are three calls to GetMemoryMap in efi_get_memory_map. The corresponding execution function in EDKII is CoreGetMemoryMap. If the BIOS returns the following value, an endless loop occurs.

  1. BufLen=0. GetMemoryMap return BufLen=10704, Status=EFI_BUFFER_TOO_SMALL;
  2. BufLen=10704*2 (21408). GetMemoryMap return BufLen=5136, Status=EFI_SUCCESS;
  3. BufLen=5136*2 (10272). At this point, in the CoreGetMemoryMap function, it calculates BufferSize=10704, then determines the BufLen(10272) < BufferSize(10704) passed in, and return BufLen=10704, Status=EFI_BUFFER_TOO_SMALL;

At this point, in the efi_get_memory_map function, the loop is re-entered because Status == EFI_BUFFER_TOO_SMALL.

In contrast to Linux, GetMemoryMap is called only twice.

This problem may occur because EDKII calls MergeMemoryMap in CoreGetMemoryMap. What are some good solutions to this problem?

Reproduction steps

After upgrading to the latest BIOS firmware on the newly released 2288H V7 product of xFusion. Run the power cycle 2 or 3 times or so.

Expected behavior

ESXi can start properly.

Additional context

No response

TimothyPMann commented 1 year ago

The code in efi_get_memory_map got complicated when one of our engineers added a workaround for a system that doesn't return a value in DescriptorSize when returning EFI_BUFFER_TOO_SMALL. Previously the loop was much simpler; see the older code at https://github.com/vmware/esx-boot/blob/52bdb5059a46c6c35af5fd8c042ae91db0fa6699/uefi/efiutils/memory.c

I can't say I fully understand why an infinite loop occurs with your system. It looks like the problem basically is that the allocation and freeing that our code is doing, while trying to get a buffer that is sufficiently larger than the memory map for our purposes, can sometimes create a pattern where the memory map bounces back and forth between two sizes that differ by more than a factor of 2. I am not totally sure whether even the older, simpler code would necessarily be free from the danger of this happening.

I suspect there is a way to rewrite our code that would both make it simpler and make it robust against this issue. Not sure when I would personally find time to work on it, though. If you have the ability to file SRs or DCPN cases, that's a better way of reporting this issue than filing a bug in github on the open source release of the esx-boot. Then it can go through our regular process for product code and get someone assigned to work on it.

TimothyPMann commented 1 year ago

Here are a few ideas I have that may help improve our code:

(1) Remember the largest MemoryMapSize that has been returned so far and never request less than that, regardless of whether a retry returned a smaller size.

(2) To better work around implementations that don't set DescriptorSize: (a) Assume this value doesn't change dynamically (it shouldn't, since it depends only on DescriptorVersion) , so if GetMemoryMap has ever succeeded, assume DescriptorSize continues to be the size that was returned then. (b) If GetMemoryMap fails the first time, use a reasonable guess for DescriptorSize in the subsequent computation of how much bigger a buffer to allocate -- namely, sizeof EFI_MEMORY_DESCRIPTOR version 1. This will usually (in practice always, because there has only been one version so far) result in allocating a big enough buffer on the next try.

Chancel-xFusion commented 1 year ago

Thank you for your reply. I will open a SR case to request support.

TimothyPMann commented 1 year ago

Let me know what the SR number is, in case it gets stuck with support and they don't file a bugzilla ticket with engineering.

Chancel-xFusion commented 1 year ago

Sorry. I only have the permission to submit certification requests. I submitted the certification request. The reply was "As per our engineering team, you need to file this issue as an interop issue, not in the server_CR project."

TimothyPMann commented 1 year ago

Were you able to file it somewhere, then? I don't know what they mean by "file this issue as an interop issue". Would that be in DCPN? Or did you take that to mean you can't file it anywhere? Also, I am wondering who said that and who in engineering they are referring to. If you have name(s), I can talk to the person/people from my side.