raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
10.92k stars 4.91k forks source link

Change default page table size to 4 #4375

Open PKizzle opened 3 years ago

PKizzle commented 3 years ago

Describe the bug Google's tcmalloc assumes a page table size of 4 (which seems to be considered the default) to allocate heap and fails since the kernel is compiled with CONFIG_PGTABLE_LEVELS=3 by default.

To reproduce As example you could try to run the envoy proxy which uses tcmalloc under the hood. docker run --rm envoyproxy/envoy:v1.17.0 --version This will fail with a memory allocation error.

Expected behaviour The heap allocation should not fail and enovy should print its version information. This can be achieved by setting CONFIG_PGTABLE_LEVELS=4.

Actual behaviour The heap allocation fails leading to envoy not being able start up and fail with an error message.

System

Raspberry Pi 4 Model B Rev 1.4
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"

Raspberry Pi reference 2020-08-20
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 7252c154838ec5b4576f29c996ac8fe3750cae12, stage2

Linux raspberrypi-4 5.10.17-v8+ #1421 SMP PREEMPT Thu May 27 14:01:37 BST 2021 aarch64 GNU/Linux
Revision    : d03114
Model       : Raspberry Pi 4 Model B Rev 1.4
Throttled flag  : throttled=0x0
Camera          : supported=0 detected=0

version 7d9a298cda813f747b51fe17e1e417e7bf5ca94d (clean) (release) (start)

Logs This is the error message you will receive from envoy / tcmalloc when kernel is compiled with CONFIG_PGTABLE_LEVELS=3:

external/com_github_google_tcmalloc/tcmalloc/system-alloc.cc:550] MmapAligned() failed (size, alignment) 1073741824 1073741824 @ 0x55867b9470 0x55867aba14 0x55867ab454 0x5586794e94 0x55867a8684
external/com_github_google_tcmalloc/tcmalloc/arena.cc:34] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48 @ 0x55867b9780 0x5586794f10 0x55867a8684

Additional context This is a conversation from the OpenJDK project which had to deal with the same situation: http://openjdk.5641.n7.nabble.com/ZGC-aarch64-Unable-to-allocate-heap-for-certain-Linux-kernel-configurations-td420728.html Basically their solution is to try allocating memory multiple times until it succeeds, which does not seem to be ideal.

This is the excerpt from the kernel documentation:

AArch64 Linux uses either 3 levels or 4 levels of translation tables
with the 4KB page configuration, allowing 39-bit (512GB) or 48-bit
(256TB) virtual addresses, respectively, for both user and kernel. With
64KB pages, only 2 levels of translation tables, allowing 42-bit (4TB)
virtual address, are used but the memory layout is the same.

These are the open issues for envoy and tcmalloc

vescoc commented 3 years ago

I have also a similar problem running redpanda, because redpanda uses seastar code and seastar assumes there are a 48 bit virtual addresses.

vescoc commented 3 years ago

IMHO the right request is to enable CONFIG_ARM64_VA_BITS_48. In bcm2711_defconfig nor bcmrpi3_defconfig is not defined so the default is CONFIG_ARM64_VA_BITS_39=y. On defconfig is enabled. The CONFIG_PGTABLE_LEVELS param is not configurable but derived. Ubuntu server arm64 have the 48 bit virtual address enabled.

vescoc commented 3 years ago

I recompiled the kernel with CONFIG_ARM64_VA_BITS_48 = y. It works. I am currently testing it as a kubernetes worker node (k8s 1.21.1, containerd 1.4.4), with raspbian os 32 bit, armv8 core, the node runs 32/64 bit containers. I will monitor the system for a few days, if there are no problems I will install the new kernel on the other rpi3/4 nodes. I don't seem to have noticed any performance issues or failing services.

mazzy89 commented 2 years ago

Any updates here? is going CONFIG_ARM64_VA_BITS_48 = y to be approved by the rpi people and land in the upstream kernel anytime soon?

pelwell commented 2 years ago

As with all kernel configuration changes, the requester is going to have to justify the change to everyone's kernel beyond "it makes my application work". Increasing the granularity is likely to lead to more wasted RAM - up to 64KB per page table, but on average an additional 30KB per page table.

Make the case.

vescoc commented 2 years ago

Both ubuntu and debian have this configuration enabled for raspberry 3 and 4. This means that all existing packages that assume this configuration cannot run on raspberry pi OS. Typically these packages belong to the server universe, in particular kubernetes (envoy, redpanda, etc). If raspberry pi OS wants to keep the focus solely on desktop users, this configuration is not necessary. If, on the other hand, raspberry pi OS also wants to have an outlet in the server world, this configuration is mandatory otherwise users will either have to recompile their kernel or migrate to another distribution. If it is not possible to enable this configuration for each version of raspberry pi device, is it possible to have it exclusively for rasperry pi 4 where the impact on memory is much less?

pelwell commented 2 years ago

If we were to switch to 64KB page tables then it would only be in the arm64 Pi 4 defconfig, but note that currently we ship a single 64-bit kernel, built from arm64 bcm2711_defconfig, so it would affect 64-bit Pi 3 users as well.

PKizzle commented 2 years ago

My intention is not to decrease everyone’s available RAM, but rather find a solution that makes dealing with different page table sizes easier. Since currently the only fast approach to detect this value is trial-and-error, a new way of detection is required. Other distributions (and tools like tcmalloc) have chosen to dodge this challenge by simply going with the most used value. Since the Raspberry Pi 4 has enough RAM such that the effect of using CONFIG_ARM64_VA_BITS_48 is negligible, it might make sense to provide a separate kernel build with this adjustment until an easier detection method is found for all.

vescoc commented 2 years ago

If we were to switch to 64KB page tables then it would only be in the arm64 Pi 4 defconfig, but note that currently we ship a single 64-bit kernel, built from arm64 bcm2711_defconfig, so it would affect 64-bit Pi 3 users as well.

You are absolutely right, currently only one type of 64-bit kernel is distributed. I was wondering if it was possible to give the user the choice of which kernel to use, for example by adding the raspberry-kernel-64 package, for example, with the configuration enabled and only the kernel8. Obviously I don't know the foundation directives but I imagine that if a user wants an arm64 kernel he expects a more demanding system and therefore it might make sense that even the simple kernel8 in raspberrypi-kernel has this configuration enabled, with obvious impacts on memory as you rightly have also reported for rpi3.

fmunteanu commented 1 year ago

@pelwell any update on the decision? I believe @vescoc provided a good argument.

pelwell commented 1 year ago

That's not an argument for supporting CONFIG_ARM64_VA_BITS_48, it's just a suggestion of how it could be done, i.e. another kernel build.

4kB pages are the default for ARM64, and CONFIG_ARM64_VA_BITS_39=y is the default for 4kB pages. It's not like we're doing something strange here, and there is no compelling reason to change since there are plenty of other kernels available.

lesiw commented 8 months ago

Ran into this issue trying to get Cloudflare's wrangler working on my Raspberry Pi.

I realize that the Raspberry Pi Foundation doesn't want to make arbitrary kernel changes based on apps which, frankly, shouldn't require this setting for compatibility. But as an end user, I will express that was a complete nightmare to debug. It took multiple days of trial-and-error and lots of searching to arrive at this issue, and then cost me another several hours to recompile the kernel to validate the fix.

Leaving this setting as-is is going to leave developers with the impression that a weird smattering of server-side software does not work on the Pi. When this discrepancy rears its head, it causes software to fail with seemingly unrelated error messages, if you're lucky enough to get user-visible errors at all. It makes more sense to match x86 Debian and accept a small amount of inefficiency in exchange for greater out-of-the-box compatibility.

mrdomino commented 6 months ago

Ran into this as well trying to get cosmopolitan libc working on my Pi 4. So one more vote for "would be nice to have this"; I'm now trying Ubuntu Server.

abasu0713 commented 1 month ago

Any updates on this? Envoy is broken even on latest versions in both Debian bookworm and Ubuntu Jammy.

pelwell commented 1 month ago

64KB pages interacts badly with our CMA usage - the alignment requirements cause it to use a huge amount of memory, then stuff breaks. If somebody has a usable configuration with both of those then please let us know, otherwise this issue is going nowhere.

abasu0713 commented 1 month ago

The Issue#1861 is so many folds.. There's a blog that provides a rough idea on how to build Envoy images on Arm64/aarch64 architectures. But it doesn't work on any of the new boards. Like I have a whole set of Orange Pis and Raspberry Pis. Tried Ubuntu and Debian. Different linux kernel versions. Some I have self-compiled myself with module supports. But to be honest - compiling Envoy with docker using bazel - it's not a straight forward thing. My project is an iOT based ML project. I was really championing Envoy and especially EnvoyGateway for distributed hybrid deployments (since one of the big pitch for the product is - heterogeneous deployments). Don't get me wrong. I love Envoy.. I heavily use it on 3 different cloud providers and on bare-metal systems that are not ARM. But with all the great work happening in the ARM neoverse space and now Linus Torvalds moving to ampere ARM for all kernel testing - it is very inspiring for the linux community for edge and iOT capable systems. And it would be great if we could figure out an upstream solution that tackles it at the kernel level. I am happy to setup nightly builds of the kernel or ISO images for Arm architecture as well. Please let me know how I can be of any assistance.

tlindi commented 6 days ago

openDB lmdb fails due the issue too. It requests 4TB virtual RAM and with 39bits just can allocate 500GB.