python / cpython

The Python programming language
https://www.python.org
Other
62.86k stars 30.11k forks source link

SSL session content bleeds into stdout with lots of threads #118138

Open sterwill opened 5 months ago

sterwill commented 5 months ago

Bug report

Bug description:

I've been trying to track down a rare segfault I'm seeing with Python 3 (several versions) in AWS Lambda and AWS CodeBuild. It's been very hard to reproduce, but I have a few stack traces that show the crash happens in OpenSSL's certificate verification code below Python's ssl module. While trying to isolate the problem in OpenSSL, I stumbled on a different issue that might be related, which I'm reporting here.

The program below uses the standard library to execute a lot of HTTPS requests concurrently with threads. When run on my Linux desktop in gnome-terminal, the program doesn't exhibit any weird behavior. It prints a bunch of X characters for each request it completes. However, if I pipe its output through a program like less or tee, I see additional output--raw bytes from the SSL network session bleeding into stdout. It feels like OpenSSL is writing into memory that Python's using to prepare strings or print to stdout due to a lack of synchronization or refcounting, but I don't know OpenSSL or Python internals well, so this is just a guess.

python3 test-program.py | tee /dev/null makes it happen every time on my workstation. It never crashes, it just bleeds SSL contents into the output.

It happens on Ubuntu 22.04 x86_64 with Python 3.12 and OpenSSL 3.0.2, and on AWS CloudShell with Python 3.9.16 and OpenSSL 3.0.8.

Feel free to run this program as often as you want with the URL I put in there (it's a small static web page on my personal site that won't mind a few thousand hits).

#!/usr/bin/env python3
import http.client
import logging
from concurrent.futures import ThreadPoolExecutor
from ssl import create_default_context, Purpose
from threading import Barrier

logger = logging.getLogger(__name__)

# Point this at your system's CA cert store.
CERT_FILE = '/tmp/cacert.pem'

# The problem reproduces reliably with 100 threads for me, and
# with more threads it happens even more times per execution.
#
# It also happens with a small number of threads (like 10) if
# you make do_one_request process lots of requests instead of one.
NUM_THREADS = 200

# The barrier is helpful for getting all the requests lined up
# to start together.  The problem still happens without a barrier
# if you run lots of requests in each thread instead of one.
b = Barrier(NUM_THREADS)

def do_one_request():
    try:
        # SSLContext is supposed to be thread-safe, but I allocate a new
        # one each time to rule out concurrent access to it being the problem.
        ssl_context = create_default_context(purpose=Purpose.SERVER_AUTH, cafile=CERT_FILE)
        b.wait()

        # Do a simple request and close the connection.  This URL is a small
        # static page behind AWS CloudFront owned by the bug reporter, so
        # don't feel bad about sending a few thousand requests there.  The
        # bug seems to happen with any valid HTTPS request.
        c = http.client.HTTPSConnection('tinfig.com', context=ssl_context)
        c.request('GET', 'https://tinfig.com/')
        _ = c.getresponse()
        c.close()

        # Print enough output so a pager or reading process might have to buffer.
        print(' ' + ('X' * 200))
    except Exception as e:
        logger.exception(e)

with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    for i in range(NUM_THREADS):
        executor.submit(do_one_request)
    executor.shutdown(wait=True)

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

sterwill commented 5 months ago

I built Python 3.12.3 --with-pydebug, and now I get a crash before any corrupt output shows. Backtrace from gdb (happy to provide more if it would be helpful):

python: ./Modules/_io/textio.c:1620: _textiowrapper_writeflush: Assertion `pos == self->pending_bytes_count' failed.
[Thread 0x7ffd957ca640 (LWP 1712755) exited]
44  ./nptl/pthread_kill.c
Thread 14 "python" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffbffff640 (LWP 1712583)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140736414611008) at ./nptl/pthread_kill.c:44
: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736414611008) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140736414611008) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140736414611008, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7c2871b in __assert_fail_base (fmt=0x7ffff7ddd130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555555a6f4c8 "pos == self->pending_bytes_count", file=0x555555a6fa7a "./Modules/_io/textio.c", line=1620, 
    function=<optimized out>) at ./assert/assert.c:92
#6  0x00007ffff7c39e96 in __GI___assert_fail (assertion=assertion@entry=0x555555a6f4c8 "pos == self->pending_bytes_count", file=file@entry=0x555555a6fa7a "./Modules/_io/textio.c", line=line@entry=1620, 
    function=function@entry=0x555555a6fdd0 <__PRETTY_FUNCTION__.9> "_textiowrapper_writeflush") at ./assert/assert.c:101
#7  0x00005555559639a2 in _textiowrapper_writeflush (self=self@entry=0x7ffff7402b70) at ./Modules/_io/textio.c:1620
#8  0x00005555559645da in _io_TextIOWrapper_write_impl (self=0x7ffff7402b70, text=<optimized out>) at ./Modules/_io/textio.c:1732
#9  0x0000555555964858 in _io_TextIOWrapper_write (self=<optimized out>, arg=<optimized out>) at ./Modules/_io/clinic/textio.c.h:680
#10 0x00005555557747a1 in cfunction_vectorcall_O (func=0x7ffff41dbe30, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/methodobject.c:509
#11 0x0000555555705e5d in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775809, args=0x7fffbfffe978, callable=0x7ffff41dbe30, tstate=0x555555f76b30) at ./Include/internal/pycore_call.h:92
#12 PyObject_CallOneArg (func=func@entry=0x7ffff41dbe30, arg=arg@entry=0x7ffff752b9a0) at Objects/call.c:401
#13 0x000055555572d77b in PyFile_WriteObject (v=<optimized out>, f=f@entry=0x7ffff7402b70, flags=flags@entry=1) at Objects/fileobject.c:132
#14 0x000055555584e5ba in builtin_print_impl (module=module@entry=0x7ffff75966f0, args=args@entry=0x7ffff6fc3e30, sep=0x0, sep@entry=0x555555bdcac0 <_Py_NoneStruct>, end=0x0, end@entry=0x555555bdcac0 <_Py_NoneStruct>, file=0x7ffff7402b70, 
    file@entry=0x555555bdcac0 <_Py_NoneStruct>, flush=0) at Python/bltinmodule.c:2057
#15 0x000055555584e95e in builtin_print (module=0x7ffff75966f0, args=0x7fffbfffea60, args@entry=0x7ffff6ea0328, nargs=nargs@entry=1, kwnames=kwnames@entry=0x0) at Python/clinic/bltinmodule.c.h:1121
#16 0x0000555555870a25 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555f76b30, frame=0x7ffff6ea02b0, throwflag=throwflag@entry=0) at Python/bytecodes.c:2966
#17 0x000055555587581d in _PyEval_EvalFrame (throwflag=0, frame=<optimized out>, tstate=0x555555f76b30) at ./Include/internal/pycore_ceval.h:89
#18 _PyEval_Vector (tstate=0x555555f76b30, func=0x7ffff6f33e90, locals=locals@entry=0x0, args=0x7fffbfffece8, argcount=1, kwnames=0x0) at Python/ceval.c:1683
#19 0x0000555555705497 in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:419
#20 0x000055555570a17b in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0x7fffbfffece8, callable=0x7ffff6f33e90, tstate=0x555555f76b30) at ./Include/internal/pycore_call.h:92
#21 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:69
#22 0x000055555570811a in _PyVectorcall_Call (tstate=tstate@entry=0x555555f76b30, func=0x555555709e84 <method_vectorcall>, callable=callable@entry=0x7ffff6fa8470, tuple=tuple@entry=0x555555cd9268 <_PyRuntime+92264>, kwargs=kwargs@entry=0x0)
    at Objects/call.c:271
#23 0x00005555557084e3 in _PyObject_Call (tstate=0x555555f76b30, callable=0x7ffff6fa8470, args=0x555555cd9268 <_PyRuntime+92264>, kwargs=0x0) at Objects/call.c:354
#24 0x0000555555708540 in PyObject_Call (callable=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at Objects/call.c:379
#25 0x0000555555992dcf in thread_run (boot_raw=boot_raw@entry=0x555555f23a50) at ./Modules/_threadmodule.c:1114
#26 0x0000555555904443 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:233
#27 0x00007ffff7c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#28 0x00007ffff7d26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
elistevens commented 4 months ago

We are seeing a similar issue (python: external/python3.10/Modules/_io/textio.c:1597: _textiowrapper_writeflush: Assertionpos == self->pending_bytes_count' failed.` in a built-in-house debug build of python).

No repro for it from our side, since it's happening intermittently in a fairly complicated Ray program, so we haven't tried to minimize the issue. Just wanted to note that users unaffiliated with @sterwill are seeing the issue as well.

elistevens commented 4 months ago

https://github.com/python/cpython/blob/3.10/Modules/_io/textio.c#L1584 calls PyUnicode_DATA and PyUnicode_GET_LENGTH. https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_GET_LENGTH says "Return the length of the Unicode string, in code points. unicode has to be a Unicode object in the “canonical” representation (not checked)."

Any idea what the canonical representation is, and/or how to check it?

Applying the following patch to python 3.10.12

diff --git a/Modules/_io/textio.c b/Modules/_io/textio.c
index e1199a5691..65d0628c73 100644
--- a/Modules/_io/textio.c
+++ b/Modules/_io/textio.c
@@ -16,6 +16,8 @@
 #include "structmember.h"         // PyMemberDef
 #include "_iomodule.h"

+#include <stdio.h>
+
 /*[clinic input]
 module _io
 class _io.IncrementalNewlineDecoder "nldecoder_object *" "&PyIncrementalNewlineDecoder_Type"
@@ -1594,6 +1596,37 @@ _textiowrapper_writeflush(textio *self)
             memcpy(buf + pos, src, len);
             pos += len;
         }
+        if (pos != self->pending_bytes_count) {
+            fprintf(stderr, "_textiowrapper_writeflush %p pos %li != self->pending_bytes_count %li\n", buf, pos, self->pending_bytes_count);
+            fprintf(stderr, "_textiowrapper_writeflush %p PyList_GET_SIZE(pending) %li\n", buf, PyList_GET_SIZE(pending));
+            fflush(stderr);
+
+            Py_ssize_t pos_ = 0;
+
+            for (Py_ssize_t i = 0; i < PyList_GET_SIZE(pending); i++) {
+                PyObject *obj = PyList_GET_ITEM(pending, i);
+                char *src;
+                Py_ssize_t len;
+                if (PyUnicode_Check(obj)) {
+                    assert(PyUnicode_IS_ASCII(obj));
+                    assert(PyUnicode_IS_COMPACT_ASCII(obj));
+                    // src = PyUnicode_DATA(obj);
+                    len = PyUnicode_GET_LENGTH(obj);
+                    fprintf(stderr, "_textiowrapper_writeflush %p i %li, PyUnicode_Check, pos %li, len %li\n", buf, i, pos_, len);
+                }
+                else {
+                    assert(PyBytes_Check(obj));
+                    if (PyBytes_AsStringAndSize(obj, &src, &len) < 0) {
+                        Py_DECREF(b);
+                        return -1;
+                    }
+                    fprintf(stderr, "_textiowrapper_writeflush %p i %li, PyBytes_Check, pos %li, len %li\n", buf, i, pos_, len);
+                }
+                pos_ += len;
+            }
+            fprintf(stderr, "_textiowrapper_writeflush %p pos %li %li != self->pending_bytes_count %li\n", buf, pos, pos_, self->pending_bytes_count);
+            fflush(stderr);
+        }
         assert(pos == self->pending_bytes_count);
     }

Produces this output for a failing case:

_textiowrapper_writeflush 0x856bb50 pos 1370 != self->pending_bytes_count 2124
_textiowrapper_writeflush 0x856bb50 PyList_GET_SIZE(pending) 12
_textiowrapper_writeflush 0x856bb50 i 0, PyBytes_Check, pos 0, len 869
_textiowrapper_writeflush 0x856bb50 i 1, PyUnicode_Check, pos 869, len 1
_textiowrapper_writeflush 0x856bb50 i 2, PyUnicode_Check, pos 870, len 280
_textiowrapper_writeflush 0x856bb50 i 3, PyUnicode_Check, pos 1150, len 1
_textiowrapper_writeflush 0x856bb50 i 4, PyBytes_Check, pos 1151, len 130
_textiowrapper_writeflush 0x856bb50 i 5, PyUnicode_Check, pos 1281, len 1
_textiowrapper_writeflush 0x856bb50 i 6, PyUnicode_Check, pos 1282, len 0
_textiowrapper_writeflush 0x856bb50 i 7, PyUnicode_Check, pos 1282, len 1
_textiowrapper_writeflush 0x856bb50 i 8, PyUnicode_Check, pos 1283, len 85
_textiowrapper_writeflush 0x856bb50 i 9, PyUnicode_Check, pos 1368, len 1
_textiowrapper_writeflush 0x856bb50 i 10, PyUnicode_Check, pos 1369, len 0
_textiowrapper_writeflush 0x856bb50 i 11, PyUnicode_Check, pos 1369, len 1
_textiowrapper_writeflush 0x856bb50 pos 1370 1370 != self->pending_bytes_count 2124
python: external/python3.10/Modules/_io/textio.c:1630: _textiowrapper_writeflush: Assertion `pos == self->pending_bytes_count' failed.
*** SIGABRT received at time=1715870177 on cpu 6 ***
PC: @     0x7f5e2dfaf7bf  (unknown)  pthread_kill@@GLIBC_2.34
    @     0x7f5e2df57fd0  398712080  (unknown)
    @     0x7f5e2df57f2a         32  gsignal
    @     0x7f5e2df59090        192  abort
    @     0x7f5e2df494ae         80  __assert_fail_base
    @     0x7f5e2df49523         48  __assert_fail
    @           0x5affec        144  _textiowrapper_writeflush
    @           0x5b0531         64  _io_TextIOWrapper_flush
    @           0x6a8f0c         80  method_vectorcall_NOARGS
    @           0x453541         96  PyObject_VectorcallMethod
    @           0x548088         64  flush_io
    @           0x5490a9        112  _PyRun_SimpleFileObject
    @           0x549595         64  _PyRun_AnyFileObject
    @           0x441e0b        240  Py_RunMain
    @           0x4424bb         48  Py_BytesMain
    @     0x7f5e2df3a607        160  __libc_start_call_main
    @     0x7f5e2df3a6bd         96  __libc_start_main@GLIBC_2.2.5
    @           0x4410e1  (unknown)  _start

The length 0 entries seem suspicious to me.

While anecdotal, the issue seems to happen more often for me when I enable TQDM-style progress bars. I'm not sure if that's A) true, B) related to the code paths used like https://github.com/python/cpython/blob/3.10/Modules/_io/textio.c#L1663 , or C) just due to there being a lot more output when I do so.

elistevens commented 4 months ago

FYI I am able to reproduce the problem reliably with the following reduced script:

import logging
from concurrent.futures import ThreadPoolExecutor

logger = logging.getLogger(__name__)

NUM_THREADS = 100

def do_one_request():
    for i in range(10):
        p = 'x' * 99
        print(p)

with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    for i in range(NUM_THREADS):
        executor.submit(do_one_request)
    executor.shutdown(wait=True)

I've added slightly more debugging output to the above patch, and am now getting output like:

_textiowrapper_writeflush 0x7fc1b8000e10 i 138, PyUnicode_Check, pos 6900, len 99
_textiowrapper_writeflush 0x7fc1b8000e10 i 138, src xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
_textiowrapper_writeflush 0x7fc1b8000e10 i 139, PyUnicode_Check, pos 6999, len 1
_textiowrapper_writeflush 0x7fc1b8000e10 i 139, src ;
_textiowrapper_writeflush 0x7fc1b8000e10 i 140, PyUnicode_Check, pos 7000, len 99
_textiowrapper_writeflush 0x7fc1b8000e10 i 140, src xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
_textiowrapper_writeflush 0x7fc1b8000e10 i 141, PyUnicode_Check, pos 7099, len 1
_textiowrapper_writeflush 0x7fc1b8000e10 i 141, src ;
_textiowrapper_writeflush 0x7fc1b8000e10 pos 7100 7100 != self->pending_bytes_count 8100
python: external/python3.10/Modules/_io/textio.c:1646: _textiowrapper_writeflush: Assertion `pos == self->pending_bytes_count' failed.

142 entries / 2 writes per print * (99 'x' + 1 '\n') == 7100, so I think that the zero-length entry I mentioned above probably isn't relevant. Why it's expecting 8100 bytes and not 100k isn't clear to me.

Interestingly, it always comes up as 7100 vs. 8100. Without the barrier it varied more, but the few times I ran it it was only no error, 800 vs. 8100, or 7100 vs. 8100.

My guess at this point is that there's a threading issue with appending to and clearing the list, but I'm getting out of my depth.

elistevens commented 4 months ago

I have increased confidence that it's a threading issue now. With this patch:

diff --git a/Modules/_io/textio.c b/Modules/_io/textio.c
index daa0b2e168..9bf1e96b87 100644
--- a/Modules/_io/textio.c
+++ b/Modules/_io/textio.c
@@ -16,6 +16,8 @@
 #include "structmember.h"         // PyMemberDef
 #include "_iomodule.h"

+#include <stdio.h>
+
 /*[clinic input]
 module _io
 class _io.IncrementalNewlineDecoder "nldecoder_object *" "&PyIncrementalNewlineDecoder_Type"
@@ -1780,6 +1782,24 @@ _io_TextIOWrapper_write_impl(textio *self, PyObject *text)
             Py_DECREF(b);
             return NULL;
         }
+        if (self->pending_bytes_count != 0) {
+            fprintf(stderr, "_io_TextIOWrapper_write_impl self->pending_bytes_count: %li\n", self->pending_bytes_count);
+
+            if (self->pending_bytes == NULL) {
+                fprintf(stderr, "_io_TextIOWrapper_write_impl self->pending_bytes: NULL\n");
+            }
+            else if (PyBytes_Check(self->pending_bytes)) {
+                fprintf(stderr, "_io_TextIOWrapper_write_impl self->pending_bytes: bytes %li\n", PyBytes_GET_SIZE(self->pending_bytes));
+            }
+            else if (PyUnicode_Check(self->pending_bytes)) {
+                fprintf(stderr, "_io_TextIOWrapper_write_impl self->pending_bytes: unicode %li\n", PyUnicode_GET_LENGTH(self->pending_bytes));
+            }
+            else if (PyList_Check(self->pending_bytes)) {
+                fprintf(stderr, "_io_TextIOWrapper_write_impl self->pending_bytes: list %li\n", PyList_GET_SIZE(self->pending_bytes));
+            }
+            fflush(stderr);
+        }
+        assert(self->pending_bytes == NULL);
         self->pending_bytes = b;
     }
     else if (!PyList_CheckExact(self->pending_bytes)) {

The above script produces this on stderr:

_io_TextIOWrapper_write_impl self->pending_bytes_count: 1000
_io_TextIOWrapper_write_impl self->pending_bytes: list 20
python: external/python3.10/Modules/_io/textio.c:1732: _io_TextIOWrapper_write_impl: Assertion `self->pending_bytes == NULL' failed.

Almost certainly some other thread is coming in and adding some list entries which would then get overwritten by that self->pending_bytes = b; line.

I have also confirmed that the stdout from the script is missing some expected output (I changed the "xxxx" to be the i, j of the thread and inner loop, and some of those go missing).

elistevens commented 4 months ago

This patch fixes the issue for me on python 3.10.12, but not on main (3.14.0a0).

diff --git a/Modules/_io/textio.c b/Modules/_io/textio.c
index e1199a5691..1d1f2d050d 100644
--- a/Modules/_io/textio.c
+++ b/Modules/_io/textio.c
@@ -1551,31 +1551,21 @@ _textiowrapper_writeflush(textio *self)

     PyObject *pending = self->pending_bytes;
     PyObject *b;
+    assert(PyList_Check(pending));

-    if (PyBytes_Check(pending)) {
-        b = pending;
-        Py_INCREF(b);
-    }
-    else if (PyUnicode_Check(pending)) {
-        assert(PyUnicode_IS_ASCII(pending));
-        assert(PyUnicode_GET_LENGTH(pending) == self->pending_bytes_count);
-        b = PyBytes_FromStringAndSize(
-                PyUnicode_DATA(pending), PyUnicode_GET_LENGTH(pending));
-        if (b == NULL) {
-            return -1;
-        }
+    // We get pending size first
+    Py_ssize_t pending_size_at_start = PyList_GET_SIZE(pending);
+    // now we use the value in _count so that b is always >= size of bytes in pending[0:size]
+    b = PyBytes_FromStringAndSize(NULL, self->pending_bytes_count);
+    if (b == NULL) {
+        return -1;
     }
-    else {
-        assert(PyList_Check(pending));
-        b = PyBytes_FromStringAndSize(NULL, self->pending_bytes_count);
-        if (b == NULL) {
-            return -1;
-        }

+    if (pending_size_at_start > 0) {
         char *buf = PyBytes_AsString(b);
         Py_ssize_t pos = 0;

-        for (Py_ssize_t i = 0; i < PyList_GET_SIZE(pending); i++) {
+        for (Py_ssize_t i = 0; i < pending_size_at_start; i++) {
             PyObject *obj = PyList_GET_ITEM(pending, i);
             char *src;
             Py_ssize_t len;
@@ -1594,13 +1584,15 @@ _textiowrapper_writeflush(textio *self)
             memcpy(buf + pos, src, len);
             pos += len;
         }
-        assert(pos == self->pending_bytes_count);
+        // decrement after list delete so that _count is always >= the size of bytes in pending
+        if (PyList_SetSlice(pending, 0, pending_size_at_start, NULL) < 0) {
+            Py_DECREF(b);
+            return -1;
+        }
+        self->pending_bytes_count -= pos;  // this needs to be atomic, and I don't think it is by default
+        assert(self->pending_bytes_count >= 0);
     }

-    self->pending_bytes_count = 0;
-    self->pending_bytes = NULL;
-    Py_DECREF(pending);
-
     PyObject *ret;
     do {
         ret = PyObject_CallMethodOneArg(self->buffer, _PyIO_str_write, b);
@@ -1702,35 +1694,28 @@ _io_TextIOWrapper_write_impl(textio *self, PyObject *text)

     if (self->pending_bytes == NULL) {
         self->pending_bytes_count = 0;
-        self->pending_bytes = b;
-    }
-    else if (self->pending_bytes_count + bytes_len > self->chunk_size) {
-        // Prevent to concatenate more than chunk_size data.
-        if (_textiowrapper_writeflush(self) < 0) {
+        self->pending_bytes = PyList_New(0);
+        if (self->pending_bytes == NULL) {
             Py_DECREF(b);
             return NULL;
         }
-        self->pending_bytes = b;
     }
-    else if (!PyList_CheckExact(self->pending_bytes)) {
-        PyObject *list = PyList_New(2);
-        if (list == NULL) {
+    else if (self->pending_bytes_count + bytes_len > self->chunk_size) {
+        // Prevent to concatenate more than chunk_size data.
+        if (_textiowrapper_writeflush(self) < 0) {
             Py_DECREF(b);
             return NULL;
         }
-        PyList_SET_ITEM(list, 0, self->pending_bytes);
-        PyList_SET_ITEM(list, 1, b);
-        self->pending_bytes = list;
     }
-    else {
-        if (PyList_Append(self->pending_bytes, b) < 0) {
-            Py_DECREF(b);
-            return NULL;
-        }
+
+    // increment before append so that _count is always >= the size of bytes in pending
+    self->pending_bytes_count += bytes_len;  // this needs to be atomic, and I don't think it is by default
+    if (PyList_Append(self->pending_bytes, b) < 0) {
         Py_DECREF(b);
+        return NULL;
     }
+    Py_DECREF(b);

-    self->pending_bytes_count += bytes_len;
     if (self->pending_bytes_count >= self->chunk_size || needflush ||
         text_needflush) {
         if (_textiowrapper_writeflush(self) < 0)

This is probably an awful way to fix the issue.