vmihalko / t2_polkit

Other
0 stars 0 forks source link

Polkitd: The utils_spawn_data_free reap timeout subprocess did not work resulting in a large number of zombie processes #15

Open vmihalko opened 6 years ago

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on Apr 13, 2018, 07:22

Submitted by lin..@..ud.com

Assigned to David Zeuthen @david

Link to original bug (#106021)

Description

Hi,

We found a prolem of polkitd.

When run subprocess timeout in rules ,the utils_spawn_data_free reap timeout subprocess did not work , and it result in a large number of zombie processes

It can be reproduced in fedora27. And the upstream has not fix it.

How to reproduce:

  1. Add a debug rule , this rule waill run spawn process over 10s and result in a timeout [root@localhost ~]# cat /etc/polkit-1/rules.d/01-test.rules
    polkit.addRule(function(action, subject) {
    polkit.log("debug start")
    try {
    polkit.spawn(["/usr/bin/sleep", "15"]);
    } catch (error) {
    // polkit.log(error)
    }
    });

  2. have a look at the prcess of polkitd, [root@localhost ~]# ps -ef |grep polkit |grep -v polkit polkitd 1501 1 0 Mar31 ? 00:02:51 /usr/lib/polkit-1/polkitd --no-debug polkitd 5060 1501 0 12:37 ? 00:00:00 [sleep] <defunct> polkitd 5367 1501 0 12:38 ? 00:00:00 [sleep] <defunct> polkitd 5631 1501 0 12:38 ? 00:00:00 [sleep] <defunct> polkitd 5915 1501 0 12:38 ? 00:00:00 [sleep] <defunct> polkitd 14052 1501 0 12:42 ? 00:00:00 sleep 20

[root@localhost ~]# journalctl -fu polkit -- Logs begin at Sat 2018-03-31 14:36:03 CST. -- Apr 03 12:39:11 2-3 polkitd[1501]: /etc/polkit-1/rules.d/01-test.rules:5: Error: Error spawning helper: Timed out after 10 seconds (g-io-error-quark, 24) Apr 03 12:39:21 2-3 polkitd[1501]: /etc/polkit-1/rules.d/01-test.rules:5: Error: Error spawning helper: Timed out after 10 seconds (g-io-error-quark, 24) Apr 03 12:40:11 2-3 polkitd[1501]: /etc/polkit-1/rules.d/01-test.rules:5: Error: Error spawning helper: Timed out after 10 seconds (g-io-error-quark, 24) Apr 03 12:40:21 2-3 polkitd[1501]: /etc/polkit-1/rules.d/01-test.rules:5: Error: Error spawning helper: Timed out after 10 seconds (g-io-error-quark, 24)

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on Apr 13, 2018, 10:28

:hammer_and_wrench: lin..@..ud.com submitted a patch:

Patch 138819, "0001-add-child-reaper-thread-to-fix-zombies":
file_106021.txt

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on Apr 13, 2018, 10:34

:speech_balloon: lin..@..ud.com said:

I made a patch to fix this issue.

The root cause is :

static void utils_spawn_data_free (UtilsSpawnData *data) {

if (data->child_pid != 0) { GSource source; kill (data->child_pid, SIGTERM); / OK, we need to reap for the child ourselves - we don't want

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on Apr 13, 2018, 10:38

:speech_balloon: lin..@..ud.com said:

The GChildWatch in utils_spawn_data_free didn't work due to the release of main_loop and main context outside.

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on May 3, 2018, 18:42

:speech_balloon: David Herrmann @dvdhrm said:

Why not simply turn SIGTERM into SIGKILL and use waitid(2)? I mean, we are dealing with a timeout here, no reason to try to be graceful.

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on May 5, 2018, 14:10

:speech_balloon: lin..@..ud.com said:

(In reply to David Herrmann from comment 4) Why not simply turn SIGTERM into SIGKILL and use waitid(2)? I mean, we are dealing with a timeout here, no reason to try to be graceful.

Hi,I made a better patch to fix this. I will send out next Monday. :)

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on May 8, 2018, 04:14

:paperclip: lin..@..ud.com uploaded an attachment:

Attachment 139417, "0001-polkitd-make-sure-child-process-exits-will-be-proces.patch":
file_106021.txt

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on May 8, 2018, 04:28

:speech_balloon: lin..@..ud.com said:

Hi,I made a better patch to make sure child process exits will be processed. This patch seems to be simpler.

I made 3 timeout source.
The 1st one will send SIGTERM at 10s, 2nd one will send SIGKILL at 15s, last one quit the main loop. Once child process exit and child watch source was processed , the main loop quit. Otherwise we quit main loop at 20s.

Timer1: 10s send SIGTERM. Timer2: 15s send SIGKILL Timer3: 20s exit the mainloop

0 ~ 10s: child exit normally 10 ~ 15s: child exit by SIGTERM 15 ~ 20s: child exit by SIGKILL 20s ~ : child seems to be abnormal. we quit main loop.

Please give me some comments or suggestions on fixing the issue. : )

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on May 9, 2018, 12:35

:speech_balloon: David Herrmann @dvdhrm said:

I would prefer to send SIGKILL straight away and use waitid(2) to guarantee it is collected.

Anyway, your patch looks fine. Lets see whether a polkit maintainer can apply it.

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on May 10, 2018, 11:14

:hammer_and_wrench: lin..@..ud.com submitted a patch:

This patch seems to be much simpler and better.

Patch 139459, "polkitd-fix-zombie-not-reaped-when-js-spawned-proces.patch":
file_106021.txt

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on May 14, 2018, 10:01

:speech_balloon: lin..@..ud.com said:

(In reply to David Herrmann from comment 8) I would prefer to send SIGKILL straight away and use waitid(2) to guarantee it is collected.

Anyway, your patch looks fine. Lets see whether a polkit maintainer can apply it.

Hi,I post a new patch, this one seems mucher simpler. This patch attaches source to global default main context and can work.

Change:

vmihalko commented 6 years ago

In GitLab by @bugzilla-migration on Aug 15, 2018, 13:02

:speech_balloon: David Herrmann @dvdhrm said:

(In reply to lining916740672 from comment 10)

  • g_source_attach (source, data->main_context);
  • / attach source to the global default main context /
  • g_source_attach (source, NULL)

According to glib docs g_source_attach() is safe to attach to other threads. The callback we use is localized to the source itself, so I see no harm in doing that. Furthermore, no threading should be involved here, since the js-authority is executed inline, but I am not entirely sure it is invoked in the main-thread.

Regardless: I think this is safe.

I still believe sending SIGKILL is the right thing to do. But I also think this patch is also the right thing to do to reap children correctly.

Reviewed-by: David Herrmann <dh.herrmann@gmail.com>

Not sure who to ping to pick this up and merge upstream, though.

vmihalko commented 3 years ago

In GitLab by @jcpunk on Dec 1, 2020, 20:54

Any chance to get this revived and finished?