openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
755 stars 110 forks source link

Running postgresql on k8s with mayastor cause an eror "running bootstrap script ... Bus error (core dumped)" #1124

Closed labdidi closed 2 years ago

labdidi commented 2 years ago

Describe the bug When i attempt to run bitnami/postgresql or the official image postgres using k8s with openebs mayastor storage i get an error "Bus error (core dumped)" and the deployment failed. I searched in the internet and i found that the error has with "huge_pages" to do. I tried to change the image of the postgresql with a workaround "huge_pages = off" to fix the problem but it doesn't help. I don't think that changing the image was the solution to solve the problem. there are other images which need to set "huge_pages" so they will end with the same error "Bus error (core dumped)". I think this is happening cause of the prerequisites of mayastor pods "HugePage support / A minimum of 2GiB of 2MiB-sized pages". is there any chance to change the prerequisites of the mayastor so the images which needs to set "huge_pages" work with mayastor storage on k8s?

To Reproduce $ helm install my-release bitnami/postgresql \ --set architecture=replication \ --set volumePermissions.enabled=true \ --set global.storageClass=mayastor-3 \ --set auth.password="password" \ --set auth.replicationPassword="password" > my-release.yaml

$ kubectl logs -f my-release-postgresql-primary-0 postgresql 21:05:35.14 Welcome to the Bitnami postgresql container postgresql 21:05:35.14 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql postgresql 21:05:35.14 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql/issues postgresql 21:05:35.15 postgresql 21:05:35.15 INFO ==> Starting PostgreSQL setup postgresql 21:05:35.17 INFO ==> Validating settings in POSTGRESQL_* env vars.. postgresql 21:05:35.18 INFO ==> Loading custom pre-init scripts... postgresql 21:05:35.18 INFO ==> Initializing PostgreSQL database... postgresql 21:05:35.19 DEBUG ==> Ensuring expected directories/files exist... postgresql 21:05:35.21 INFO ==> pg_hba.conf file not detected. Generating it... postgresql 21:05:35.21 INFO ==> Generating local authentication configuration The files belonging to this database system will be owned by user "1001". This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8". The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /bitnami/postgresql/data ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 20 selecting default shared_buffers ... 400kB selecting default time zone ... Etc/UTC creating configuration files ... ok running bootstrap script ... Bus error (core dumped) child process exited with exit code 135 initdb: removing contents of data directory "/bitnami/postgresql/data"

OS info

gila commented 2 years ago

This is an actual bug all the way up into the kernel in terms of how groups work.

Before I (try) to explain it, rebuilding with huge_pages=off should work for sure. You can run initdb with the flag -d to collect extra debug output.

As for the problem its actually at various levels but the kernel fix is here: https://lkml.org/lkml/2020/2/3/1153

You will need a recent kernel something around 5.8 upwards for this to be included. However, there is still an open issue for k8s itself (or the container runtime to be more exact) that would need to pick up these changes.

The problem in short has to do with the dynamic nature of using the pages post init of the container itself. IOW during startup, the memory gets reserved but when the process actually allocates it even though within bounds, it receives as SIGBUS as per the huge table docs from the kernel:

The HugeTLB controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use.

Having mayastor not use huge pages would not solve this problem, they are unrelated but exacerbated by mayastor.

You could run Postgres on a non-mayastor node in the meantime. As for your last question, could mayastor due without huge pages? Yes, it could, but it would not be able to write to PCIe devices directly anymore. Arguably that's not all that bad, it would reduce performance a bit but I don't think it would be noticeable for the majority of workloads.

Lastly "simply" having a lot more huge pages would also work but ofcourse that's not really realistic for none server environments. (note that just purely for accounting, not actual usage)

vyas-n commented 2 years ago

Hi @gila,

Would you mind linking the open issue in k8s container runtime?