ERROR: scoreboard: failed to lock (already locked)

jkabat commented 6 months ago

Description

After the recent upgrade of application containers to PHP 8.2.14 I see plenty of the error messages in the log:

ERROR: scoreboard: failed to lock (already locked)
ERROR: status: unable to find or access status shared

Is there something I can do to prevent it from happening? End users seem not to be affected as far as I can tell. These log messages are written randomly and only for the PHP-FPM containers. Long running PHP-CLI commands operating under the Supervisor do not have these issues.

opcode: enabled jit: disabled

Edit: worth to mention that errors randomly occurs even on the container which is barely used during of waiting period.

PHP Version

PHP 8.2.14

Operating System

Debian GNU/Linux 12 (bookworm)

bukka commented 6 months ago

I just checked and this happens when scoreboard is already used by other process. This could happen due to holding lock longer (there was potentially one such change) but a bit before I think. What version was your previous version that you updated from?

Another thing that could play role is high number of children configured. Could you provide your fpm config?

jkabat commented 6 months ago

I have upgraded from 8.1.18-fpm-bullseye to 8.2.14-fpm-bookworm

FPM config:

[global]
daemonize = no

log_level = notice

emergency_restart_threshold = 3
emergency_restart_interval = 1m
process_control_timeout = 10s

[www]
access.log = /dev/null

listen = 9000

request_terminate_timeout = 180s

pm = static
pm.max_children = 10
pm.max_requests = 200
pm.process_idle_timeout = 10s

pm.status_path = /status

bukka commented 6 months ago

Could you try to increase (significantly) or remove pm.max_requests and see if that helps?

bukka commented 6 months ago

You should generally not need this and it seems it didn't work correctly before 8.2 (we had couple of reports regarding to that) - it means it didn't restart correctly which was somehow fixed in 8.2

jkabat commented 5 months ago

I have removed pm.max_requests option, but error is still present even on the newest 8.2.17-fpm-bookworm. EDIT: could it have something to do with OPCACHE enabled? but still it happens only now and then.

Used memory | 83587760
Free memory | 84184400

bukka commented 5 months ago

I have been thinking about this and looking through the code / changes. I don't see any specific FPM change that should impact locking. Last fix in that area was also merged to 8.1.

What has got most likely biggest impact on scoreboard locking is probably status collection. Do you have some external collector for status? If so, how often does it run? If often, try to decrease it or temporarily disable it just to see if it helps.

Otherwise there are some other factors that can be the cause. Could you also try it with minimal config just to see if it helps and then try to potentially increase pm.max_children (up to 100) if that still happening:

[global]
daemonize = no
log_level = notice

[www]
listen = 9000

pm = static
pm.max_children = 10

pm.status_path = /status

If it disappears, try to add back the removed options one by one and see when it happens again

jkabat commented 5 months ago

@bukka thank you for looking into the issue.

I use healthcheck which periodically calls /status endpoint: https://raw.githubusercontent.com/renatomefi/php-fpm-healthcheck/master/php-fpm-healthcheck

It's set to 10s. Let me test the settings and I will let you know if it is of any help.

ToshY commented 4 months ago

@jkabat I'm using the same healthcheck also with an interval set to 10s, getting the same messages in log. Have you been able to resolve this?

realcnbs commented 3 months ago

If you're running this in a k8s pod this is probably caused by your liveness and readyness probes being executed at the same exact moment.

jkabat commented 2 months ago

@realcnbs yes I have in k8s pods - what would be the solution? Do you have any?

realcnbs commented 2 months ago

Play with probes delays and periods so they have a couple of seconds between them.

pimjansen commented 2 months ago

But then still, it should be able to handle both at once right? My logs are being spammed with this (300 times last 24 hours).

realcnbs commented 2 months ago

It should, but apparently it doesn't.

adamkral12 commented 1 month ago

I tried 5 seconds between readiness and liveness and I am still hitting the same issue. I also tried to reproduce it locally without any luck :/

bukka commented 1 month ago

It would be useful to get more info about the setup (including k8s deployment spec and possibly a bit info about cluster) and traffic where this happen. I'm building a new testing tool https://github.com/wstool/wst/ that will allow me to better test higher traffic issues including k8s setups so more info I get, better chance of recreating that I have...

pimjansen commented 1 month ago

It would be useful to get more info about the setup (including k8s deployment spec and possibly a bit info about cluster) and traffic where this happen. I'm building a new testing tool https://github.com/wstool/wst/ that will allow me to better test higher traffic issues including k8s setups so more info I get, better chance of recreating that I have...

I will send our dockerfile tomorrow. In general it is not really related to high traffic if im correct

jkabat commented 1 month ago

I can provide k8s setup too, agree with @pimjansen that it is not related to high traffic

pimjansen commented 1 month ago

@bukka here is a gist for the Dockerfile and related config. If im correct there is not much "special" in this usecase except maybe the healthcheck:

https://gist.github.com/pimjansen/5bc6b3222cf26f505b608207cf350e31

EDIT: And some deployment probing:

          livenessProbe:
            exec:
              command:
                - php-fpm-healthcheck
                - --listen-queue=10
            initialDelaySeconds: 0
            periodSeconds: 10
          readinessProbe:
            exec:
              command:
                - php-fpm-healthcheck
            initialDelaySeconds: 1
            periodSeconds: 5

realcnbs commented 1 month ago

I tried 5 seconds between readiness and liveness and I am still hitting the same issue. I also tried to reproduce it locally without any luck :/

Try to disable one of the probes and see if that fixes it. If it does then 5 seconds is not enough to guarantee the time gap. As an option you can try using startupProbe insted of readiness.

jkabat commented 1 month ago

I tried 5 seconds between readiness and liveness and I am still hitting the same issue. I also tried to reproduce it locally without any luck :/

Try to disable one of the probes and see if that fixes it. If it odes then 5 seconds is not enough to guarantee the time gap. As an option you can try using startupProbe insted of readiness.

When I have commented out readiness probe, it really helped...

EDIT: as per @realcnbs suggestion I have replaced readiness probe with startupProbe and I saw only one case when it happened so far. Not sure its safe to get rid of readinessProbe, will keep it as temporary solution until its fixed. Thank you guys.

karabanov commented 1 month ago

After upgrading from PHP7.4.33 to PHP8.0.30, I also started noticing such messages in the logs:

[11-Aug-2024 17:13:42] WARNING: [pool big_app_prod] child 667 said into stderr: "ERROR: scoreboard: failed to lock (already locked)"
[11-Aug-2024 17:13:42] WARNING: [pool big_app_prod] child 667 said into stderr: "ERROR: status: unable to find or access status shared memory"
[11-Aug-2024 18:12:01] NOTICE: error log file re-opened
[11-Aug-2024 18:57:01] WARNING: [pool big_app_prod] child 461 said into stderr: "ERROR: scoreboard: failed to lock (already locked)"
[11-Aug-2024 18:57:01] WARNING: [pool big_app_prod] child 461 said into stderr: "ERROR: status: unable to find or access status shared memory"

PHP is installed from the repository https://packages.sury.org/php/ and runs in a container.

The config file looks something like this:

[big_app_prod]
; Standard
listen = /www/php_sockets/php-fpm-docker-big_app_prod.sock
listen.backlog = 4096
listen.group = www-data
listen.mode = 0660
rlimit_files = 10240
chdir = /
catch_workers_output = yes
security.limit_extensions = .php

; Deployment-specific
pm = static
pm.max_children = 1500
pm.status_path = /fpm-status

php_admin_flag[log_errors] = on
php_admin_flag[display_errors] = off
php_admin_value[error_log] = /var/log/php/$pool/$pool.log
php_admin_value[sendmail_path] = /usr/bin/msmtp -C /etc/msmtprc -t --read-envelope-from

; Project-specific
php_admin_value[precision] = 12
php_admin_flag[ignore_user_abort] = on
php_admin_flag[expose_php] = off
php_admin_value[max_input_nesting_level] = 8
php_admin_value[max_execution_time] = 600
php_admin_value[memory_limit] = 300M
php_admin_value[error_reporting] = E_ALL & ~E_NOTICE & ~E_DEPRECATED & ~E_STRICT
php_admin_flag[html_errors] = off
php_admin_value[post_max_size] = 32M
php_admin_value[upload_max_filesize] = 32M
php_admin_flag[file_uploads] = True
php_admin_value[cgi.fix_pathinfo] = 0
php_admin_value[session.name] = big_app_prod
php_admin_value[session.save_handler] = redis
php_admin_value[session.save_path] = "tcp://127.0.0.1:6379?timeout=10&prefix=big_app_prod_SESS_"
php_admin_value[session.cookie_httponly] = On
php_admin_value[apc.shm_size] = 2048M
php_admin_value[max_input_vars] = 1000
php_admin_value[session.gc_maxlifetime] = 1440

Metrics are collected using:

# https://github.com/Lusitaniae/phpfpm_exporter
prometheus-php-fpm-exporter \
    --web.listen-address=127.0.0.1:5584 \
    --phpfpm.socket-paths=/www/php_sockets/php-fpm-docker-big_app_prod.sock \
    --phpfpm.status-path=/fpm-status \
    --phpfpm.script-collector-paths=/var/lib/prometheus/php_opcache_exporter.php

PHP Version

8.0.30

Operating System

Debian 11 (bullseye)

verfriemelt-dot-org commented 3 weeks ago

i think the proper solution here is to fix the locking issue, this is also reproducible by spamming the fpm process with multiple concurrent

$ SCRIPT_NAME=/status SCRIPT_FILENAME=/status QUERY_STRING=full REQUEST_METHOD=GET cgi-fcgi -bind -connect 127.0.0.1:9000

requsts. you'll eventually end up with the error described here:

127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 200
127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 500
ERROR: scoreboard: failed to lock (already locked)
ERROR: status: unable to find or access status shared memory

also i think readinessprobe should just use /ping instead of the full /status ... the script mentioned here (https://raw.githubusercontent.com/renatomefi/php-fpm-healthcheck/master/php-fpm-healthcheck) should be adapted to use /ping for the pingmode, so without arguments to avoid this alltogether. this would be a viable workaround for the time being.

pimjansen commented 3 weeks ago

i think the proper solution here is to fix the locking issue, this is also reproducible by spamming the fpm process with multiple concurrent
$ SCRIPT_NAME=/status SCRIPT_FILENAME=/status QUERY_STRING=full REQUEST_METHOD=GET cgi-fcgi -bind -connect 127.0.0.1:9000
requsts. you'll eventually end up with the error described here:
127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 200

127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 500

ERROR: scoreboard: failed to lock (already locked)

ERROR: status: unable to find or access status shared memory
also i think readinessprobe should just use /ping instead of the full /status ... the script mentioned here (https://raw.githubusercontent.com/renatomefi/php-fpm-healthcheck/master/php-fpm-healthcheck) should be adapted to use /ping for the pingmode, so without arguments to avoid this alltogether. this would be a viable workaround for the time being.

Seems fair but wouldnt it be good for the php implementation for fpm in docker to provide this out of the box? Would reduce these errrors, better control and better user experience since it is already embedded

bukka commented 4 days ago

So I finally managed to reliably recreate issue with my new tool called WST funded by STF (Soverign tech fund) that allows me to spin multiple services and test various things in its pipeline - specifically allowing to run load test and checking the logs. The actual test for this can be seen here.

After that I looked properly into the scoreboard locking and found couple of issues. I have got an initial fix in https://github.com/php/php-src/pull/15805 after which I no longer see the locking error but it might be changing a bit too much so not sure if I should maybe go for lighter version for PHP 8.2 and do the rest in master as it might be a bit risky.

goffyara commented 1 day ago

The same with php 8.3.10 and using php-fpm-healthcheck with k8s

php / php-src