segmentation fault with httpd 2.4.54

arminabf commented 2 years ago

For regression testing we use a proprietary HTTP service that is able to simulate a CA authoritiy to mod_md. Since we have updated httpd to 2.4.54 we face segmentation faults with various backtraces.

Following are two examples:

#0  0xf7d6c21a in __pthread_rwlock_wrlock_full (abstime=0x0, rwlock=0x0) at pthread_rwlock_common.c:576
#1  __GI___pthread_rwlock_wrlock (rwlock=0x0) at pthread_rwlock_wrlock.c:27
#2  0xf3cce598 in CRYPTO_THREAD_write_lock (lock=0x0) at crypto/threads_pthread.c:78
#3  0xf3ce1467 in RAND_get_rand_method () at crypto/rand/rand_lib.c:851
#4  0xf3ce1653 in RAND_priv_bytes (buf=0xc620a310 "\260\244 \306", num=192) at crypto/rand/rand_lib.c:923
#5  0xf3d5a585 in bnrand (flag=PRIVATE, rnd=0xc6208350, bits=1536, top=-1, bottom=0) at crypto/bn/bn_rand.c:46
#6  0xf3d5a9e8 in bnrand_range (flag=PRIVATE, r=0xc6208350, range=0xc6208328) at crypto/bn/bn_rand.c:162
#7  0xf3d5aa99 in BN_priv_rand_range (r=0xc6208350, range=0xc6208328) at crypto/bn/bn_rand.c:184
#8  0xf3d5c0f8 in BN_is_prime_fasttest_ex (a=0xc6207250, checks=4, ctx_passed=0xc62067f0, do_trial_division=0, cb=0x0) at crypto/bn/bn_prime.c:220
#9  0xf3d5bc10 in BN_generate_prime_ex (ret=0xc6207250, bits=1536, safe=0, add=0x0, rem=0x0, cb=0x0) at crypto/bn/bn_prime.c:104
#10 0xf3dc49a5 in rsa_builtin_keygen (rsa=0xc62086e0, bits=3072, primes=2, e_value=0xc6208970, cb=0x0) at crypto/rsa/rsa_gen.c:164
#11 0xf3dc4572 in RSA_generate_multi_prime_key (rsa=0xc62086e0, bits=3072, primes=2, e_value=0xc6208970, cb=0x0) at crypto/rsa/rsa_gen.c:61
#12 0xf3dc06bf in pkey_rsa_keygen (ctx=0xc6208ab0, pkey=0xc6208ae0) at crypto/rsa/rsa_pmeth.c:743
#13 0xf3cfdf42 in EVP_PKEY_keygen (ctx=0xc6208ab0, ppkey=0x1f9438ec) at crypto/evp/pmeth_gn.c:108
#14 0xf3bd7c05 in gen_rsa (ppkey=0xf333cee0, p=0x1f941aa8, bits=3072) at md_crypt.c:787
#15 0xf3bd81c7 in md_pkey_gen (ppkey=0xf333cee0, p=0x1f941aa8, spec=0xf333ced4) at md_crypt.c:944
#16 0xf3bcc95e in md_acme_acct_register (acme=0x1f943548, store=0x1e3e5e50, md=0x1f941da0, p=0x1f941aa8) at md_acme_acct.c:650
#17 0xf3bcedbb in md_acme_drive_set_acct (d=0x1f941ae8, result=0x9a5deb8) at md_acme_drive.c:168
#18 0xf3bd1cd5 in md_acmev2_drive_renew (ad=0x1f941cb8, d=0x1f941ae8, result=0x9a5deb8) at md_acmev2_drive.c:104
#19 0xf3bd0cae in acme_renew (d=0x1f941ae8, result=0x9a5deb8) at md_acme_drive.c:801
#20 0xf3bd1332 in acme_driver_renew (d=0x1f941ae8, result=0x9a5deb8) at md_acme_drive.c:929
#21 0xf3be4d6b in run_renew (baton=0x1e3e6180, p=0x9a5de78, ptemp=0x1f941aa8, ap=0xf333d180 "") at md_reg.c:1156
#22 0xf3bec009 in pool_vado (cb=0xf3be4c61 <run_renew>, baton=0x1e3e6180, p=0x9a5de78, ap=0xf333d16c "\020l\274\036\210\365\256\t") at md_util.c:59
#23 0xf3bec048 in md_util_pool_vdo (cb=0xf3be4c61 <run_renew>, baton=0x1e3e6180, p=0x9a5de78) at md_util.c:71
#24 0xf3be4df9 in md_reg_renew (reg=0x1e3e6180, md=0x1ebc6c10, env=0x9aef588, reset=0, attempt=0, result=0x9a5deb8, p=0x9a5de78) at md_reg.c:1166
#25 0xf3bf7fb1 in process_drive_job (dctx=0x1f875ab8, job=0x1f85ef88, ptemp=0x9a5de78) at mod_md_drive.c:128
#26 0xf3bf879b in run_watchdog (state=2, baton=0x1f875ab8, ptemp=0x9a5de78) at mod_md_drive.c:219
#27 0x0811ec2e in wd_worker (thread=0x1f860528, data=0x1f875af0) at mod_watchdog.c:209
#28 0x080d831b in thread_start (thread=0x1f860528, data=0x1f860520) at util.c:3210
#29 0x083738b2 in dummy_worker (opaque=0x1f860528) at threadproc/unix/thread.c:142
#30 0xf7d673bd in start_thread (arg=0xf333db40) at pthread_create.c:463
#31 0xf7c7de16 in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:108

#0  _int_free (av=<optimized out>, p=0x48410e48, have_lock=<optimized out>) at malloc.c:4241
#1  0x0842efa9 in CRYPTO_free (str=0x48410e50, file=0x8660548 "crypto/ex_data.c", line=84) at crypto/mem.c:312
#2  0x08432554 in cleanup_cb (funcs=0x48410e50) at crypto/ex_data.c:84
#3  0x08407dfe in OPENSSL_sk_pop_free (st=0x11cff570, func=0x8432525 <cleanup_cb>) at crypto/stack/stack.c:368
#4  0x08432384 in sk_EX_CALLBACK_pop_free (sk=0x11cff570, freefunc=0x8432525 <cleanup_cb>) at include/internal/cryptlib.h:47
#5  0x084325a2 in crypto_cleanup_all_ex_data_int () at crypto/ex_data.c:100
#6  0x08431014 in OPENSSL_cleanup () at crypto/init.c:615
#7  0xf7bb5313 in __run_exit_handlers (status=0, listp=0xf7d5d3fc <__exit_funcs>, run_list_atexit=true, run_dtors=true) at exit.c:108
#8  0xf7bb53f1 in __GI_exit (status=0) at exit.c:139
#9  0x0823774f in clean_child_exit (code=0) at worker.c:485
#10 0x082394f6 in child_main (child_num_arg=0, child_bucket=0) at worker.c:1330
#11 0x0823964b in make_child (s=0x8af88b0, slot=0, bucket=0) at worker.c:1391
#12 0x0823a08c in server_main_loop (remaining_children_to_start=1) at worker.c:1698
#13 0x0823a6b9 in worker_run (_pconf=0x8ac0218, plog=0x8b07898, s=0x8af88b0) at worker.c:1852
#14 0x080d9500 in ap_run_mpm (pconf=0x8ac0218, plog=0x8b07898, s=0x8af88b0) at mpm_common.c:102
#15 0x080d0e5c in main (argc=5, argv=0xffd52234) at main.c:841

In our tests we use these settings

Protocols acme-tls/1
MDCAChallenges tls-alpn-01
MDCertificateAgreement accepted
MDCertificateStatus off
MDStoreDir /tmp/md
MDContactEmail test@md.test.com
MDCertificateAuthority http://auth.md.test.com:10016/directory
MDMessageCmd /path/to/md_message.sh
MDomain acme-test.md.test.com

Tracing the request-response flow between mod_md and CA authority we see following HTTP traffic for the first two requests:

> GET /directory HTTP/1.1
> Host: auth.md.test.com:10016
> User-Agent: Apache/2.4.54 mod_md/2.4.17
> Accept: */*
> 

< HTTP/1.1 200 OK
< Cache-Control: public, max-age=0, no-cache
< Content-Type: application/json
< Date: Mon Aug  1 09:07:04 CEST 2022
< Content-Length: 631
< 
< {
<   "4Fcmi-_EJzQ": "https://community.letsencrypt.org/t/adding-random-entries-to-the-directory/33417",
<   "keyChange": "http://auth.md.test.com:10016/acme/key-change",
<   "meta": {
<     "caaIdentities": [
<       "happy-hacker-ca.invalid"
<     ],
<     "termsOfService": "https://boulder:4431/terms/v7",
<     "website": "https://github.com/letsencrypt/boulder"
<   },
<   "newAccount": "http://auth.md.test.com:10016/acme/new-acct",
<   "newNonce": "http://auth.md.test.com:10016/acme/new-nonce",
<   "newOrder": "http://auth.md.test.com:10016/acme/new-order",
<   "revokeCert": "http://auth.md.test.com:10016/acme/revoke-cert"
< }

> HEAD /acme/new-nonce HTTP/1.1
> Host: sesdev.tarsec.com:10016
> User-Agent: Apache/2.4.54 mod_md/2.4.17
> Accept: */*
> 

< HTTP/1.1 200 OK
< Cache-Control: public, max-age=0, no-cache
< Link: <http://sesdev.tarsec.com:10016/directory>;rel="index"
< Replay-Nonce: taro_u20lgpRBeI9pRYF6gYir0N0saUDbZFqKPtToj60RIM
< Date: Mon Aug  1 09:21:14 CEST 2022
<

It is interesting that if our HTTP service closes the connection after each response, then the segmentation fault does not occur. However, that behaviour is new since 2.4.54.

icing commented 2 years ago

There is a known issue since the dawn of time in restart/reload behaviour and watchdog tasks. The problem is that the order in which watchdog tasks and server shut down is not coordinated well enough and OpenSSL shutdown may stumble over its feet.

One of the stack traces you show is a from a child exiting. Just to confirm that we are investigating the right thing: can you confirm crashes without the child exiting?

arminabf commented 2 years ago

One of the stack traces you show is a from a child exiting. Just to confirm that we are investigating the right thing: can you confirm crashes without the child exiting?

On opening this issue I thought there are various stack traces that end up in a segfault. Now that I look at it more closely I recognized that there are only the two cases that I have mentioned in my first post. Segfaults that arise during exiting of childs tend to be in the minority. So yes, there are lots of crashes without the child exiting.

icing commented 2 years ago

The first stack trace shows

#2  0xf3cce598 in CRYPTO_THREAD_write_lock (lock=0x0) at crypto/threads_pthread.c:78

and that means the lock pointer is NULL which is, afaict not supposed to happen. Openssl has a RUN_ONCE init that allocates locks for the RAND functions to use.

Which means either that this did not properly work or that someone wiped memory he was not supposed to. Can you compile the mod_md from 2.4.53 in the 2.4.54 into your test server to check that the problems disappear? That would lay the blame solely at my door...

icing commented 2 years ago

Hmm, looking at the differences between 2.4.53 and 2.4.54, there was a change in JSON reference handling. Can you try the following patch?

Index: modules/md/md_json.c
===================================================================
--- modules/md/md_json.c    (Revision 1903191)
+++ modules/md/md_json.c    (Arbeitskopie)
@@ -195,7 +195,6 @@
     j = jselect_parent(&key, 1, json, ap);

     if (!j || !json_is_object(j)) {
-        json_decref(val);
         return APR_EINVAL;
     }

@@ -206,7 +205,6 @@
     }

     if (!json_is_array(aj)) {
-        json_decref(val);
         return APR_EINVAL;
     }

@@ -660,8 +658,7 @@
 static int object_set(void *data, const char *key, const char *val)
 {
     json_t *j = data, *nj = json_string(val);
-    json_object_set(j, key, nj);
-    json_decref(nj);
+    json_object_set_new(j, key, nj);
     return 1;
 }

arminabf commented 2 years ago

Can you compile the mod_md from 2.4.53 in the 2.4.54 into your test server to check that the problems disappear? That would lay the blame solely at my door...

I tested this out.... the segfaults occur also with mod_md from 2.4.53.

Can you try the following patch?

Unfortunately, the patch does not solve the problem.

icing commented 2 years ago

Well, that means the latest changes are not causing the issues on your system.

Have you switched/updated other components? Maybe from OpenSSL 1.1 to 3.0? Trying to circle things down here...

icing commented 2 years ago

Ok, scanning the code and the report again. Can you produce a log at LogLevel md:trace4 of what happens right before the crash? You can mail it to me at stefan at eissing.org

Since the bug also happens with the 2.4.43 version, this points to a mistake in my handling of libcurl.

arminabf commented 2 years ago

Oh, I made a mistake building httpd 2.4.54 with mod_md from 2.4.53... I did it again and I have to revise my statement: indeed, with mod_md 2.4.53 the problem did not exist yet.

I'll send you next the logs file via mail...

whereisaaron commented 2 years ago

I'm using 2.4.54 but only with http-01 and not tls-alpn-01 challenges. I haven't seen this problem yet, is it likely limited to tls-alpn-01 challenges?

@arminabf thank you for the regression testing!

icing commented 2 years ago

@whereisaaron I am investigating the issue with @arminabf and so far, he is the only one seeing it. I proposed a fix in the current master that restores the old behaviour in handling curl instances. Maybe it is related to the curl version he uses. We'll see.

icing commented 2 years ago

@arminabf confirmed the issue to be fixed in the last patch which I now release as v2.4.19.

icing / mod_md

segmentation fault with httpd 2.4.54 #293