nginx / unit

NGINX Unit - universal web app server - a lightweight and versatile open source server that simplifies the application stack by natively executing application code across eight different programming language runtimes.
https://unit.nginx.org
Apache License 2.0
5.4k stars 327 forks source link

The last versions of Nginx Unit fails in the TechEmpower benchmark #794

Open joanhey opened 1 year ago

joanhey commented 1 year ago

Hi, I added Nginx and Nginx Unit to the TechEmpower benchmark. But in the last versions it's failing with PHP: https://tfb-status.techempower.com/unzip/results.2022-11-25-12-16-16-174.zip/results/20221119184855/php-unit/run/php-unit.log

Old config without problems, and faster than Nginx: https://github.com/TechEmpower/FrameworkBenchmarks/blob/R20/frameworks/PHP/php/php-unit.dockerfile https://github.com/TechEmpower/FrameworkBenchmarks/blob/R20/frameworks/PHP/php/deploy/nginx-unit.json

Actual config with problems: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/PHP/php/php-unit.dockerfile https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/PHP/php/deploy/nginx-unit.json

The last versions fail a lot.

Please, could anybody review and help with this problem. Thank you.

PD: Welcome help with the Nginx config too, only with the plaintext test. https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/C/nginx/nginx.conf

Be careful, as the bench changed a lot after the spectre/meldown patches. https://github.com/TechEmpower/FrameworkBenchmarks/issues/7321

If you need I'll send you the benchs results. And ask me for any doubt.

ac000 commented 1 year ago

Looking at the log it seems

php-unit: /usr/local/bin/docker-entrypoint.sh: Applying configuration /docker-entrypoint.d/nginx-unit.json
php-unit: 2022/11/23 16:50:03 [info] 21#21 "benchmark" prototype started

Unit was started with some initial config.

php-unit: 2022/11/23 16:50:03 [info] 105#105 "benchmark" application started
php-unit: /usr/local/bin/docker-entrypoint.sh: OK: HTTP response status code is '200'
php-unit: {
php-unit:   "success": "Reconfiguration done."
php-unit: }
php-unit: /usr/local/bin/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/...
php-unit: /usr/local/bin/docker-entrypoint.sh: Stopping Unit daemon after initial configuration...
php-unit: /usr/local/bin/docker-entrypoint.sh: Waiting for control socket to be removed...
php-unit: 2022/11/23 16:50:03 [alert] 21#21 sendmsg(20, -1, -1, 1) failed (104: Connection reset by peer)

After it was started it was reconfigured, then shutdown. All that happened within the space of a second.

From your point of view, what was the actual sequence of events?

joanhey commented 1 year ago

The strange thing is that in local work. Show all that verbose output, but work.

Finish here and start to verify the tests: image

When finish the tests and close the docker: image

Before show the summary: image

But not in github actions or the bench server.

Github actions log failing: https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/3567839471/jobs/5996024544#step:9:4375

joanhey commented 1 year ago

It's failing from the last 2 versions of Nginx Unit. And sometimes work !!

Here are the logs from an old run, using unit_1.21.0 and php7.4 working without problems. https://tfb-status.techempower.com/unzip/results.2021-01-13-13-38-00-107.zip/results/20201229183947/php-unit

And an actual, using FROM nginx/unit:1.28.0-php8.1 https://tfb-status.techempower.com/unzip/results.2022-11-25-12-16-16-174.zip/results/20221119184855/php-unit

A recent run that half worked, using FROM nginx/unit:1.27.0-php8.1 https://tfb-status.techempower.com/unzip/results.2022-11-13-10-29-24-097.zip/php-unit A picture from the fortunes results, from this run. image But json, db and query tests failed the verification and the log show a lot of errors: https://tfb-status.techempower.com/unzip/results.2022-11-13-10-29-24-097.zip/results/20221107183025/php-unit/run/php-unit.log I updated for that to v1.28, but now never work.

joanhey commented 1 year ago

When I send the PR for Nginx Unit 1.28 was passing the github actions tests. But never in the bench server, and later again failed in github actions.

https://github.com/TechEmpower/FrameworkBenchmarks/pull/7691

Github actions log working: https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/3447671893/jobs/5753927577#step:9:2154

tippexs commented 1 year ago

As discussed on the community slack https://nginxcommunity.slack.com/archives/C02SS9UB85C/p1670008983875989?thread_ts=1669810946.221959&cid=C02SS9UB85C

This issue is related to #728 and friends. A manual patch of the docker-entrypoint.sh file seems to fix the issue.

Linking the external issue as well.

https://github.com/TechEmpower/FrameworkBenchmarks/pull/7757#pullrequestreview-1203015505

As said, we will ship a fix for the entrypoint script with the next Unit release 1.29.

joanhey commented 1 year ago

If work in the next run. I'll close the issue.

tippexs commented 1 year ago

I will close it once the fix was merged if that's okay for you

joanhey commented 1 year ago

Failed in the last run:

image

image

When finish the run, I'll send the logs.

tippexs commented 1 year ago

Yes, please share the logs. The new images should have the new docker entrypoint script. In case this is another kind of issue we should Create a new one.

joanhey commented 1 year ago

It's using the patch in v1.28, and looks like the same issue. https://tfb-status.techempower.com/unzip/results.2022-12-28-17-59-18-855.zip/results/20221222172047/php-unit/build/php-unit.log

https://tfb-status.techempower.com/unzip/results.2022-12-28-17-59-18-855.zip/results/20221222172047/php-unit/run/php-unit.log

I'll update to v1.29

joanhey commented 1 year ago

Failed in the last run, using v1.29. https://github.com/TechEmpower/FrameworkBenchmarks/pull/7839

Sunit (scala) from @lolgab, also fail with v1.28 and the last commit with v1.29. https://github.com/TechEmpower/FrameworkBenchmarks/pull/7815 Sunit log with v1.29: https://tfb-status.techempower.com/unzip/results.2023-01-04-02-03-49-510.zip/snunit

The only framework using v1.28 and working is Fastapi (python): https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Python/fastapi Log: https://tfb-status.techempower.com/unzip/results.2023-01-04-02-03-49-510.zip/results/20221229020235/fastapi-nginx-unit

joanhey commented 1 year ago

It's the same issue. It is not necessary to open a new one.

joanhey commented 1 year ago

Update, Fastapi (python) fail with v1.29 in the tests.

https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/3745949871/jobs/6360874318#step:9:1796

tippexs commented 1 year ago

It's using the patch in v1.28, and looks like the same issue. https://tfb-status.techempower.com/unzip/results.2022-12-28-17-59-18-855.zip/results/20221222172047/php-unit/build/php-unit.log

https://tfb-status.techempower.com/unzip/results.2022-12-28-17-59-18-855.zip/results/20221222172047/php-unit/run/php-unit.log

I'll update to v1.29

This issue is different from the other once. It is not related to the control socket. Will have a look on it.

travisbell commented 1 year ago

Update, Fastapi (python) fail with v1.29 in the tests.

https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/3745949871/jobs/6360874318#step:9:1796

This log has these lines in it:

fastapi-nginx-unit-orjson: 2022/12/21 02:57:25 [alert] 0#25 [unit] Python failed to call 'asyncio.get_event_loop' Notice: fastapi-nginx-unit-orjson: 2022/12/21 02:57:25 [notice] 24#24 app process 25 exited with code 1

Which seems like it's the same issue as outlined in #815 perhaps?

tippexs commented 1 year ago

Yes indeed! This is exactly this issue. We debugged the Python code last night. 1.29 is working with Python 3.10.8. anything newer than this will currently fail. We are working on a fix and I will let you know once available. Thanks for the links.

micaelmalta commented 1 year ago

Yes indeed! This is exactly this issue. We debugged the Python code last night. 1.29 is working with Python 3.10.8. anything newer than this will currently fail. We are working on a fix and I will let you know once available. Thanks for the links.

Any progress on this issue?

ac000 commented 1 year ago

@micaelmalta (and anyone else) feel free to try the patch here https://github.com/nginx/unit/issues/815#issuecomment-1396132323

ac000 commented 1 year ago

I have pushed a branch that contains the above fix, if that facilitates easier testing.

joanhey commented 1 year ago

V1.29.1 still fail in the benchmark tests, almost with PHP. When finish the run, I'll send the logs.

joanhey commented 1 year ago

Logs: https://tfb-status.techempower.com/unzip/results.2023-03-10-06-06-32-467.zip/php-unit

https://tfb-status.techempower.com/unzip/results.2023-03-10-06-06-32-467.zip/fastapi-nginx-unit

Check build and run dirs

tippexs commented 1 year ago

@joanhey Thanks. We will have a look into this. More details on Monday.

lolgab commented 1 year ago

My Scala Native benchmark using libunit works using NGINX Unit 1.29.1.

joanhey commented 1 year ago

Work but with a lot of errors.

Run logs:

joanhey commented 1 year ago

With unit v1.31.0 also fail in the GH Actions tests in the bench. FROM unit:1.31.0-php8.2

https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/6079079817/job/16491119686#step:9:5995

As now also file in the test, it's easier to test it isolated to fix the problem.

ac000 commented 1 year ago

What is failing exactly?

ac000 commented 1 year ago

I mean the only errors I'm seeing for php-unit are a bunch of socket connection errors

Verifying test db for php-unit caused an exception: HTTPConnectionPool(host='tfb-server', port=8080): Max retries exceeded with url: /dbraw.php (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff2bb167670>: Failed to establish a new connection: [Errno 111] Connection refused'))
   FAIL for http://tfb-server:8080
     Server did not respond to request
     See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
--------------------------------------------------------------------------------
VERIFYING JSON
--------------------------------------------------------------------------------
Accessing URL http://tfb-server:8080/json.php: 
Verifying test json for php-unit caused an exception: HTTPConnectionPool(host='tfb-server', port=8080): Max retries exceeded with url: /json.php (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff2bb166bc0>: Failed to establish a new connection: [Errno 111] Connection refused'))
   FAIL for http://tfb-server:8080
     Server did not respond to request
     See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
--------------------------------------------------------------------------------
VERIFYING QUERY
--------------------------------------------------------------------------------
Accessing URL http://tfb-server:8080/dbquery.php?queries=2: 
Verifying test query for php-unit caused an exception: HTTPConnectionPool(host='tfb-server', port=8080): Max retries exceeded with url: /dbquery.php?queries=2 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff2bb54be80>: Failed to establish a new connection: [Errno 111] Connection refused'))
   FAIL for http://tfb-server:8080
     Server did not respond to request
     See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
--------------------------------------------------------------------------------
VERIFYING UPDATE
--------------------------------------------------------------------------------
Accessing URL http://tfb-server:8080/updateraw.php?queries=2: 
Verifying test update for php-unit caused an exception: HTTPConnectionPool(host='tfb-server', port=8080): Max retries exceeded with url: /updateraw.php?queries=2 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff2bb533b80>: Failed to establish a new connection: [Errno 111] Connection refused'))
   FAIL for http://tfb-server:8080
     Server did not respond to request
     See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
--------------------------------------------------------------------------------
VERIFYING FORTUNE
--------------------------------------------------------------------------------
Accessing URL http://tfb-server:8080/fortune.php: 
Verifying test fortune for php-unit caused an exception: HTTPConnectionPool(host='tfb-server', port=8080): Max retries exceeded with url: /fortune.php (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff2bb56e3e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
   FAIL for http://tfb-server:8080
     Server did not respond to request
     See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
--------------------------------------------------------------------------------
VERIFYING PLAINTEXT
--------------------------------------------------------------------------------
Accessing URL http://tfb-server:8080/plaintext.php: 
Verifying test plaintext for php-unit caused an exception: HTTPConnectionPool(host='tfb-server', port=8080): Max retries exceeded with url: /plaintext.php (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff2bb56ee00>: Failed to establish a new connection: [Errno 111] Connection refused'))
   FAIL for http://tfb-server:8080
     Server did not respond to request
     See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements

What is tfb-server? Is that what's running Unit? Could it be a file descriptor exhaustion issue?

joanhey commented 1 year ago

The last one is only a GH actions test: https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/6079079817/job/16491119686#step:9:5995

I think the problem is that Unit needs too much time to initialize all the forks(processes|workers).

joanhey commented 1 year ago

Before Unit was working without problems in the benchmark. Later it start to fail in the benchmark, but now it also fail in the GitHub action test.

ac000 commented 1 year ago

I think the problem is that Unit needs too much time to initialize all the forks(processes|workers).

Are you saying that Unit itself is failing somehow due to trying to start X amount of processes?

What is X in your case?

Do you have a simple reproducer config?

joanhey commented 1 year ago

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/PHP/php/deploy/nginx-unit.json

It's a benchmark that use 3 xeon servers, each with 24 cores.

We test the same PHP code with different runtimes and servers. The only failing is Unit, in the last versions of Unit.

You can download and run the tests locally. Before pass the GH action tests and fail only in the bench. But now fail in both.

ac000 commented 1 year ago

I'm just trying to understand exactly what it is that's failing...

Let me try it another way.

1) How many PHP application processes are you trying to start?

2) Do they all start correctly?

joanhey commented 1 year ago

I send it the link, but I pass it here:

{
    "listeners": {
        "*:8080": {
            "pass": "applications/benchmark"
        }
    },

    "applications": {
        "benchmark": {
            "type": "php",
            "processes": 84,
            "user": "www-data",
            "group": "www-data",
            "root": "/php/",
            "options": {
                "file": "/php/deploy/conf/php.ini"
            },
            "limits": {
                "requests": 10000000
            }
        }
    }
}
joanhey commented 1 year ago

And when try to start:

https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/6079079817/job/16491119686#step:9:5995

https://github.com/TechEmpower/FrameworkBenchmarks/actions/runs/6079079817/job/16491119686#step:9:6419

image

And it was working without problems, till Unit v1.29.

joanhey commented 1 year ago

image

Pre Unit v1.29

joanhey commented 1 year ago

You can clone the repo. And later try it locally, because now also fail with the tests.

Only run in the root repo: ./tfb --mode verify --test php-unit image

PD: only need docker

ac000 commented 1 year ago

OK, so the issue started with 1.29.0...

If it was after that then I would maybe point to b0e2d9d0 ("Isolation: Switch to fork(2) & unshare(2) on Linux."), but it started failing before that landed.

Looking at the changes between 1.28.0 and 1.29.0, there's nothing that really stands out here.

There were a couple of PHP module updates, 48b6a7b3 ("PHP: Fixed php_module_startup() call for PHP 8.2.") and a0327445 ("PHP: allowed to specify URLs without a trailing '/'.")

The only obvious process related changes was the introduction of per application cgroup support, but you're not even using that so it should basically have no impact.

The thing I find confusing about the logs is it always looks like unit is shutdown before the benchmark even runs.

Does it work if you reduce the number of processes?

As a quick test, I was able to start 84 PHP application processes in < 1 second on a 2 vCPU VM.

2023/09/14 17:49:17 [info] 882#882 "php" prototype started
2023/09/14 17:49:17 [info] 883#883 "php" application started
2023/09/14 17:49:17 [info] 884#884 "php" application started
2023/09/14 17:49:17 [info] 885#885 "php" application started
2023/09/14 17:49:17 [info] 886#886 "php" application started
...
2023/09/14 17:49:17 [info] 964#964 "php" application started
2023/09/14 17:49:17 [info] 965#965 "php" application started
2023/09/14 17:49:17 [info] 966#966 "php" application started
joanhey commented 1 year ago

I don't remember the exact Unit version than start to fail. But was just when I started this issue, so the version released before Nov 29, 2022. image

I'll try if fail with less processes.

PD: perhaps only fail with docker.

joanhey commented 5 months ago

Today I have some free time, and I try to pinpoint this issue.

The problem is that it needs too much time to start "Unit", and it isn't ready when start the verification test or any other CI command.

Actual unit docker build process

  1. Build the container
  2. Finish the build
  3. RunENTRYPOINT( start unit | /docker-entrypoint.d/{.sh, .js, .pem, .json} | kill unit )
  4. Run CMD(start unit)

The ENTRYPOINT and CMD are run after the build and need a lot of time, with ~70 processes it isn't fast enough to run the next CI command. The actual servers in the bench have a CPU with 56 threads.

No ENTRYPOINT and config with RUN

If we bypass the ENTRYPOINT and config unit with RUN and tested with 168 processes is really fast, and the next CI command run without problems.

RUN unitd && \
    curl -X PUT --data-binary @/php/deploy/nginx-unit.json --unix-socket \
        /var/run/control.unit.sock http://localhost/config

ENTRYPOINT [ ]

CMD ["unitd", "--no-daemon", "--control", "unix:/var/run/control.unit.sock"]

Perhaps I need to kill unit in the RUN.

Welcome any feedback with another solution.

ac000 commented 5 months ago

If we bypass the ENTRYPOINT and config unit with RUN and tested with 168 processes is really fast, and the next CI command run without problems.

Yes, that matches my previous comment where you can see starting Unit with a load load of PHP processes took < 1 second... so I suspect the problem is not Unit itself but around the machinery to start it...

I don't know what ENTRYPOINT is doing but perhaps you can instrument things to see where the time going?

joanhey commented 5 months ago

The ENTRYPOINT https://github.com/nginx/unit/blob/master/pkg/docker/docker-entrypoint.sh

The log forENTRYPOINT execution ( start unit | /docker-entrypoint.d/{.sh, .js, .pem, .json} | kill unit ) + CMD (start unit again) execution, both executed after the container build: https://tfb-status.techempower.com/unzip/results.2024-05-07-03-20-53-233.zip/results/20240501071818/php-unit/run/php-unit.log

We can also run the tests in local, and fail because use too much time with ~+70 processes. Why need to start unit - config - kill - start again, when unit configuration is dynamic ? Or better, why not start with the config.json in /docker-entrypoint.d/ directly ?

lcrilly commented 5 months ago

See previous discussion: https://github.com/nginx/unit/discussions/1114

The entrypoint script works well for simple configurations/applications but does hurt container startup time.

The reason for the start-config-kill-restart process is because the Control API is the means of configuring Unit and so we need a running instance to actually do the configuration.

For my own Unit-based Docker images I take a similar approach with RUN. The entrypoint script will notice that a configuration is already set and will not overwrite it. So you don't need to set it empty.

COPY unitconf.json init.json
RUN nohup /bin/sh -c "unitd --no-daemon --pid init.pid --log /dev/stdout --control unix:init.unit.sock &" && \
    curl -fsX PUT -d@init.json --unix-socket init.unit.sock _/config && \
    rm init.*

I have a proposal for unitd to start in a non-daemonizing, "configuration mode" and read configuration from stdin. Which would also remove the dependency on having curl(1) in the container. I'll write an issue about it (edit: https://github.com/nginx/unit/issues/1274).