[💡 FEATURE REQUEST]: Use opcache shared memory by forking workers

iluuu1994 commented 2 years ago

Plugin

No response

I have an idea!

Hi rr team! :wave::blush:

As far as I can tell, right now RoadRunner always spawns new php-cli processes per worker. When using opcache this has the significant limitation that each process will have it's own shared memory segment. Opcache caches scripts, classes, functions and also de-duplicates constant strings and arrays in this shared memory segment to be used by all processes. This segment is (usually) created with mmap() and MAP_SHARED|MAP_ANONYMOUS. For this mechanism to be used in other workers they would need to be forked, or threads (ZTS) need to be used.

This approach could reduce memory usage by multiple factors, depending on the number of workers. Another benefit is that workers could warm each others cache as they put compiled scripts into shm. A potential downside is that some locking is going on when shared memory is modified, and stability could be compromised if there are bugs in opcache corrupting shm bringing down all workers (although php-fpm would be affected here too which makes it much less likely).

A (seemingly) simple solution could be to have a master worker that accepts fork messages but doesn't handle any requests itself (to avoid accumulating memory leaks polluting newly forked workers). The new child process would become an actual worker and sends its PID back to rr and starts listening for messages. If the shared memory segment would fill up the master could be replaced and after it all workers. This is not something that should usually happen though, as shm should be configured to be big enough not to cause any restarts. That's just a rough idea, I haven't done any testing to verify this will work.

I'm currently only allowed to work on php-src itself. Let me know if this is something you're interested in working on, if not I might try something in my free time at some point.

Thanks again for rr!

rustatian commented 2 years ago

Hey, @iluuu1994 👋🏻 Thanks for the contribution and a very nice idea 👍🏻

Yeah, we had some internal discussion about that. I'm not a PHP dev, so, waiting for our PHP team, which is currently busy with the SF v3.

That's just a rough idea; I haven't done any testing to verify this will work.

We had a similar idea, and I am pretty sure that that approach will work from the RR's POV.

iluuu1994 commented 2 years ago

Hi @rustatian! Thanks for your very speedy reply! Great to know this is on the roadmap :slightly_smiling_face:

Luckily, I don't think there's anything in php-src that needs to change. Running PHP with opcache.enable_cli=1 will automatically create the given shared memory segment. The mapping of the segment is handled by the operating system when the process is forked. RR could then notify the master worker that a new worker needs to be spawned, the master worker would respond by forking itself and sending back the child PID. This could happen somewhere in https://github.com/spiral/roadrunner-worker. The master workers (non-shared, request) memory would stay low, as it does not handle any HTTP requests, so it can always be used to create new workers that have gone over the memory limit).

If there's more information you need about PHP internals, I'm happy to help if I can! My e-mail is ilija.tovilo[at]me.com. Or on Twitter.

(And it just occurred to me that by "I'm not a PHP dev" you probably meant that you don't work on the PHP part of RR, but I decided not to delete my comment in case it provides any additional information).

rustatian commented 2 years ago

Thank you 👍🏻

rustatian commented 2 years ago

Hey guys 👋🏻 RR part will be ready in the v2.12.0. The specs with the protocol to allocate new fork-ed workers I'll put in the docs and share the link here to discuss additionally 😃

iluuu1994 commented 2 years ago

@rustatian That's fantastic to hear! Thank you for your continued dedication to RoadRunner :hearts: Were you able to observe improvements in memory consumption?

rustatian commented 2 years ago

My pleasure ❤️

Were you able to observe improvements in memory consumption?

Not at the moment. For the v2.12.0, I'm planning to finish with POC. Since I have 0-knowledge of PHP, I need to create an async worker in Rust (guess why 😄), create a simple protocol, and test the RR part. Then, our PHP team will create a PHP master worker, and we will be able to see the results of our experiment 😃

rustatian commented 1 year ago

Hey @iluuu1994 👋🏻

As far as I understand, the PHP doesn't have a bundled fork syscall, only pcntl_fork, am I right?

iluuu1994 commented 1 year ago

@rustatian I'm afraid so, yes 😕

rustatian commented 1 year ago

Good 😃 It's not a problem since our target platforms for this feature are UNIX platforms (Linux, macOS, WSL2, etc). I checked Ubuntu/Fedora and Arch, and they all have this extension enabled and included by default.

rustatian commented 1 year ago

First tests:

First process - master process. (22M) The second and third - forks, which are connected to the RR via sockets. (8.6M)

rustatian commented 1 year ago

Hey guys 👋🏻, here are some updates from my side:

As we saw earlier, fork-ing one worker is a promising technique, for memory consumption specifically. However, we have some limitations from the PHP side:

We don't have threads in PHP :( So, it's impossible to use the wait4 syscall (pcntl_wait) without blocking our master process.
We can skip pcntl_wait, but then, the parent process would have zombies in its process table.
We can use some signals to clean-up a process table, let's say, periodically. But, It would require complex logic from the PHP side. And it would be hazardous to use in production.
We can kill the parent process. Then all child's will be moved to the PID 1 (init). So now, if we kill the child, it will not become a zombie. But in that case, we should kill our controller process on every child's reallocation. The master process will be a new PHP CLI process. And previous forks would not be the same as the current fork.

But the good news is that this POC showed me, that we could significantly reduce memory usage (thank you very much @iluuu1994 👍🏻). And we're already working on a secret project to support a similar scenario 😃

iluuu1994 commented 1 year ago

@rustatian Could SIGCHLD help here (in combination with pcntl_wait(WNOHANG))?

rustatian commented 1 year ago

I tried to do that (p.3 in my message). But it would require more complex logic; honestly, I don't want to overcomplicate the worker 😢 with that solution (but you're right, that's possible to use the WNOHANG flag to return immediately and then notify a parent with signals when child process dies). We're preparing a more elegant solution, which would be cross-platform.

iluuu1994 commented 1 year ago

Of course, if you have a more elegant solution, that's even better! Thank you :slightly_smiling_face:

rustatian commented 1 year ago

Thank you for your involvement. I appreciate that 👍🏻. If you check the @wolfy-j Twitter, you may guess about that solution 😄

rustatian commented 1 year ago

Good old thread 😃 One of the problems I faced when implementing this feature is impossible to wait for non-RR-child process (child of the master PHP process) from the RR. But in the Linux kernel 5.10 new syscall was introduced, and here is the sample Go program to wait for such processes to exit. I'll leave it here when I return to this feature:

package main

import (
  "errors"
  "log"
  "syscall"

  "golang.org/x/sys/unix"
)

const syscallPidfdOpen = 434

type pidFD int // file descriptor that refers to a process

func pidfdOpen(pid int, flags uint) (pidFD, error) {
  fd, _, errno := syscall.Syscall(syscallPidfdOpen, uintptr(pid), uintptr(flags), 0)
  if errno != 0 {
    return 0, errno
  }
  return pidFD(fd), nil
}

func (fd pidFD) waitForExit() error {
  fds := []unix.PollFd{{Fd: int32(fd), Events: unix.POLLIN}}
  _, err := unix.Poll(fds, -1)
  if err != nil {
    return err
  }
  if fds[0].Events & unix.POLLIN != unix.POLLIN {
    return errors.New("unexpected poll event")
  }
  // Process exited
  return nil
}

func main() {
  pid := 5768 // Example pid

  pidfd, err := pidfdOpen(pid, 0)
  if err != nil {
    log.Fatalf("opening pid fd: %v\n", err)
  }
  defer syscall.Close(int(pidfd))

  err = pidfd.waitForExit() // blocks until the process exits
  if err != nil {
    log.Fatalf("polling pid %d: %v\n", pid, err)
  }
  // Process exited
}

rustatian commented 1 year ago

Just FYI folks, work on this ticket has been resumed.

rustatian commented 1 year ago

We need to have a new transport to connect the child of our parent process to the RR. With the code snipped about this is possible to wait non-our-child process in a blocking manner. I also created a, let's say, experimental design of clearing the kernel process table of zombie processes. This is because we can't block the master worker with pcntl_wait/waitid system calls. Instead, when the process is finished, RR will send a special request to the master worker (we don't need to do this when the workers are reallocated, because any child of it will be inherited by PID1) to make a pcntl_waitid(dead_pid_here) syscall to clear the kernel process table. And voila, no blocking because the process is already dead 😆. The last problem we need to solve with @wolfy-j is to completely redirect the stdin/out/err process pipes via the new transport, because it's still not possible to easily read the non-our-child process pipes. We will probably use the unix sockets for this. Keeping you informed, your humble servant @rustatian 😃

MaxSem commented 1 year ago

@iluuu1994 what about making an option for Opcache to mmap to actual files, as opposed to always using MAP_ANONYMOUS?

iluuu1994 commented 1 year ago

@MaxSem There are other mechanisms to share memory on Linux between processes (e.g. memfd_create or shm_open). The issue is that the file needs to be mapped to the same address for every process since the data structures in shared memory reference each other through user-space pointers. However, we cannot guarantee the same address will be available for other processes because any of the addresses that would belong to the shared memory segment could already have been allocated for that process for some other purpose. I'm not a Linux expert, maybe there are some tricks that could be used.

rustatian commented 1 year ago

Hey hey guys 👋🏻 Just a few notes on this:

There is no problem to open and use the shared memory segment (with POSIX shm_open or older SystemV shmget), ftruncate it to the needed size and then mmap it.

The main problem is that we don't have access to shared memory from PHP. I mean the php-src (sources) shared memory address, which is allocated during the startup routine. Also, we can't set it via configuration or any mechanism from within the PHP code. The funny thing is that on Windows, due to the Windows platform limitation (no forks), the shared memory segment is the same for all PHP scripts (and can be configured via php.ini).

So if we could somehow, say, point every script to the same shared memory key (using OpCache), then we wouldn't have to reinvent the wheel with forking PHP processes 😃 But unfortunately we can't (or if someone knows some secret php.ini configuration key or hidden PHP method...).

And that's because this ticket is about forking the PHP-CLI process with an initialized shared memory key that would be shared between the childs (we don't need to mmap anything).

And here we come to the second problem: transport. Each process can easily communicate and wait (waitpid) for its childs. But we need to communicate with our child process child. So, our childs child is not our child. So we can't apply the same rules to the childs child (oh my). With the Linux kernel 5.10 we can use a new system call: pid_fd_open, so the first part of the problem is solved, we can wait for the childs child and thus send a special request to the master worker to use pcntl_waitpid to remove the dead worker (zombie) from the kernel process table. Also, imagine our master_worker dies 😅, ohh, then all its children will be inherited by PID1 (poor orphans) and we don't need to send the waitpid requests to these PIDs, since PID1 will handle that. So, we need to track the child PIDs of the master worker.

Second part is: how do we communicate with the child (remember, this is not our childer)? I decided to write a Unix socket transport for this. It's completely independent from the current transport we have, but solves the problem pretty well. We would have almost the same speed as with pipes.

Where we are now: As I mentioned in the previous messages, I'm a complete noob in PHP 🥲 So I'm waiting for our PHP team to help me write a PHP part. Golang part is pretty much done.

iluuu1994 commented 1 year ago

@rustatian Hello :wave::blush: I think what @MaxSem suggested is to solve the issue completely on php-src's side, without the need to fork processes to attach to the shm segment. That could be difficult due to the reasons mentioned above. AFAIK, for the same reason, Opcache support on Windows is considered somewhat broken.

https://github.com/php/php-src/blob/e8fb0edc69598e7d9380f61a1ab551b5ec6c27ca/ext/opcache/shared_alloc_win32.c#LL214C1-L214C157

We're looping through a predefined list of addresses until we find one that is free. With opcache.mmap_base (I assume that's the php.ini config you're referencing above) the configured segment needs to be free, or we fatal-error (if I read the code correctly). ASLR can increase the chances of this happening. In the past, I've heard the unofficial suggestion to use opcache.file_cache and opcache.file_cache_only=1 on Windows instead. But again, I'm not an export on this topic so I might be missing some key information.

rustatian commented 1 year ago

Hey @iluuu1994 👋🏻😊, yeah, got u. This would be a preferred way to solve the problem on the PHP side. The comment is legendary:
zend refused but did not remember the exact reason, pls add info here if one of you know why :) 😄

Yes, I meant the opcache.mmap_base option on Windows. Unfortunately we don't have the same option to redefine these predefined addresses for Linux 😢

michael-rubel commented 1 year ago

We don't have threads in PHP :(

@rustatian Have you tried krakjoe/parallel extension?

rustatian commented 1 year ago

Hey @michael-rubel 👋🏻 No, I don't 😄 I don't even tried PHP, since I'm not a PHP dev 😆

michael-rubel commented 1 year ago

@rustatian Ask a PHP dev from the team to dig this extension. It seems to work better than the old pthreads extension. Maybe it's a new field for optimizations (spawn threads instead of processes under the hood?)

P.S. The extension philosophy was taken from Golang: https://www.php.net/manual/en/philosophy.parallel.php

rustatian commented 1 year ago

@michael-rubel The idea is not about threads/processes. We have workers to fulfill this pattern. The idea was to have forks, which are not the same.

Kaspiman commented 3 weeks ago

Hi guys. Maybe we can revive the idea of process forks? This will probably help to greatly reduce the cost of process creation and reduce the use of RAM. Looks relatively simple to implement, I found mentions of it in the PHP package.

rustatian commented 3 weeks ago

This is our old experiment with @wolfy-j. This problem will be solved by Rapira, not by RR. Just to keep everyone in this thread posted - I'm working on this. Not as fast as I initially thought, but this idea is not dead for sure. New Rapira workers using approximately 5-7mb RSS on a cold start and ~10mb RSS with OpCache and after few minutes of work.

roadrunner-server / roadrunner

[💡 FEATURE REQUEST]: Use opcache shared memory by forking workers #1271

Plugin

I have an idea!