Open fv-mf opened 2 years ago
Before assigning RHEL 8 the blame, can you do an apples-to-apples comparison of Python 2 vs Python 3 on the same OS release?
With cx_Oracle 8.3.0? Does that still work with python 2.7? (on pypi it says tested with 3.6+)
No. cx_Oracle 8.3 requires a minimum of Python 3.6. Can you clarify the comparison? I thought you were just running cx_Oracle 8.3 on Python 3.10 on both RHEL 7.2 and RHEL 8.5. Is that not the case? If not, please run the same code (whether Python 2 with cx_Oracle 7.3 or Python 3 with cx_Oracle 8.3) on both machines. Thanks!
RHEL 7.2 vs RHEL 8.5 was using the same python version (3.10) and cx_Oracle (8.3.0)
So that would suggest that the OS is the culprit here (or some change in the OS that interacts badly with threaded Oracle clients). I don't know what that might be, unfortunately! @cjbj might be able to help further, hopefully.
Are there any variances if you try different versions of the Oracle Client libraries? Are you using session pooling? If not, have you set the threaded
parameter?
When I asked around, it was pointed out that neither of those OS versions have a supported Python 3.10 package and that you should verify your Python binaries are compiled properly for each distro; I wasn't given any tips on how to do that!
We have investigated this further in the context of our application. The old version of our application runs on Python 2.7.10 / cxOracle 7.3.0. The new version runs on Python 3.10.0 / cxOracle 8.3.0. However, we found another major difference in that the old version runs the Python process with CPU affinity whereas the new version does not set CPU affinity.
After setting CPU affinity for the Python 3 process of the new version, we started to get performance similar to the old version (the Python 3 process is affined to a single CPU).
It seems that calls to cxOracle Cursor.close() are much faster with CPU affinity under heavy load (a large number of threads. e.g. 50 threads). Without CPU affinity these calls are really expensive especially when the Cursor was used to fetch a large number of columns (e.g. 15 - 20 columns). Maybe this is something that can be mentioned in the cxOracle docs as in our case the performance difference is quite dramatic.
For example, having 50 threads that execute 365 queries each and close the Cursor after each query, result in the following accumulated time spent in Cursor.close() per thread:
with CPU affinity: 0.0065s without CPU affinity: 17.8674s
That's great - thanks for digging into it and for sharing the results. It jogs a distant memory of a discussion about thread stickiness during development of our Node.js API. We can certainly make a note about CPU affinity in our tuning doc. It also gives us an opportunity to do an internal analysis of the behavior.
Before my colleague, @mariustantareanu, discovered that affining the python3 application to a single CPU restores performance back to what we see in python2, I looked deeply into this apparent performance degradation from cxo73/python2 to cxo83/python3.
My colleague recommended that I post these findings here in case it is of interest. I'll attached a copy of the "cx.py" test harness. Perhaps it might motivate code and/or documentation changes.
What I found was that there is a general degradation of throughput performance from python2 to python3 involving the following cofactors:
With the above cofactors in place, the key difference between python2 and python3 is the significant change to the GIL implementation from being a simple mutex to involving multiple mutexes with a goal of better "fairness".
This enhanced GIL implementation is described in the comments of "https://github.com/python/cpython/blob/v3.6.8/Python/ceval_gil.h".
In the specific case of cx_Oracle, a high amount of GIL chatter is produced during the closing of cxoCursor objects which eventually causes all of the allocated cxoVar objects to be freed.
Below is a typical stack trace of one of these cxoVar_free functions trying to regrab the GIL in python3:
Thread 18 (Thread 0x7f51fd7fa700 (LWP 249884)):
#0 0x00007f523bc4c66d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f523bc478f7 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2 0x00007f523bc4950c in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3 0x00007f523bf53c44 in take_gil () from /lib64/libpython3.6m.so.1.0
#4 0x00007f523bf53e7d in PyEval_RestoreThread () from /lib64/libpython3.6m.so.1.0
#5 0x00007f523a45a829 in cxoVar_free (var=0x7f5239d75490) at src/cxoVar.c:90
#6 0x00007f523bf45985 in list_dealloc () from /lib64/libpython3.6m.so.1.0
#7 0x00007f523a44a7d5 in _Py_DECREF (op=<optimized out>) at /opt/python/cp310-cp310/include/python3.10/object.h:500
#8 cxoCursor_close (cursor=0x7f5239d732d0, args=<optimized out>) at src/cxoCursor.c:546
#9 0x00007f523bff82e1 in call_function () from /lib64/libpython3.6m.so.1.0
And below is the part of the cxoVar_free function in play:
85 static void cxoVar_free(cxoVar *var)
86 {
87 if (var->handle) {
88 Py_BEGIN_ALLOW_THREADS
89 dpiVar_release(var->handle);
90 Py_END_ALLOW_THREADS
91 var->handle = NULL;
92 }
Normally we might look at something like this and say this is normal GIL contention behavior. But the contention is much worse under python3's new sophisticated and fairer GIL. Releasing the GIL when it isn't really necessary can be more costly in a multi-threaded high-load python3 context.
The specific table we noticed this issue on has ~70 columns, and as a result the cxoCursor object would have ~70 cxoVar objects allocated, which will then involve ~70 GIL release/regrab cycles per cursor close.
This performance degradation can also be reproduced with multiple threads looping around "time.sleep(0)" which will release/regrab the GIL without doing any real I/O.
I further investigated if this GIL release around dpiVar_release was really necessary. For example, it appears that the GIL is not released for the "cxoVar_externalSetValue -> dpiVar_release" code path.
Also, from a cursory review of the dpiVar_release implementation, no obvious signs of I/O or other lock negotiations were observed that might warrant releasing the GIL.
For testing, a binary patch of the cxo shared object was prepared that replaces the calls to release and grab the GIL around this call to dpiVar_release in cxoVar_free with NOPs.
Below are python commands to replace the GIL grab/release with NOPs for cxo73 and cxo83 respectively: (Addresses obtained with "$ objdump -d
$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo73nogil/cx_Oracle.so 2af30
$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo73nogil/cx_Oracle.so 2af44
$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo83/cx_Oracle.cpython-310-x86_64-linux-gnu.so 5b810
$ python3 -c 'import sys;f=open(sys.argv[1],"br+");f.seek(int(sys.argv[2],base=16));f.write(b"\x90"*5)' cxo83/cx_Oracle.cpython-310-x86_64-linux-gnu.so 5b824
This "nogil" version showed improved performance in both python2 and python3, but especially python3.
Below are the raw results from the test harness for all eight combinations of {py2|py3},{gil|nogil},{affined|not-affined} executing 500 iterations in each of 20 threads on a Linux VM with 8 CPU cores:
# py3
$ python3 cx.py -t 20 -r 500
pid:75216
Time Taken: 28.37s
# py3,nogil
$ python3 cx.py -n -t 20 -r 500
pid:75339
Time Taken: 5.83s
# py3,affinity
$ python3 cx.py -c 0 -t 20 -r 500
pid:75387
pid 75387's current affinity list: 0-7
pid 75387's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.91s
# py3,affinity,nogil
$ python3 cx.py -c 0 -n -t 20 -r 500
pid:75420
pid 75420's current affinity list: 0-7
pid 75420's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.91s
# py2
$ python2 cx.py -t 20 -r 500
pid:75505
Time Taken: 7.73s
# py2,nogil
$ python2 cx.py -n -t 20 -r 500
pid:75543
Time Taken: 5.61s
# py2,affinity
$ python2 cx.py -c 0 -t 20 -r 500
pid:75588
pid 75588's current affinity list: 0-7
pid 75588's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.97s
# py2,affinity,nogil
$ python2 cx.py -c 0 -n -t 20 -r 500
pid:75618
pid 75618's current affinity list: 0-7
pid 75618's new affinity list: 0
cpu affinity: 0 RC: 0
Time Taken: 2.89s
Thanks for the detailed explanation. Looking at the implementation of cxoVar_free()
the release of the GIL was intended to cover the possibility of a round trip occurring due to the call to dpiVar_release()
. I believe that should only happen, however, if the connection is being released. If so, the fact that the variable holds a reference to the connection should mean that that scenario should never take place! And if that is in fact confirmed, then those lines can safely be removed. We will look into that further. Thanks!
@david-west-at-micro-focus @mariustantareanu you may be interested in https://blogs.oracle.com/linux/post/ol7-vs-ol8-phoronix-performance
After we performed an upgrade to various software because of a switch from python 2.7 to 3.X we noticed that under heavy load in a multi-threaded application we were having a lot more active connections and increased execution time.
Eliminating our application from the equation shows slowdown when going from single-threaded to multi-threaded with a simple python script (below)
The following are some rough timings on RHEL 8.5 with python 3.10 and cxOracle 8.3.0 when performing 100k queries on a single thread vs distributed on 10 threads (10k per thread)
The table used here isn't anything special. There are about 200 entries only in it so most of the queries will return None anyway.
If I run the script on a RHEL 7.2 server (and target the same DB) I get better results compared to RHEL 8.
Before this we were not seeing any problems with python 2.7 / cxOracle 7.3.0 under RHEL 6.X or 7.X.
If you need other data from our env to help narrow things down let me know (@mariustantareanu can also respond to questions if I am not available)
As I said, the table itself doesn't really matter. I used the following python script (with some strings for custom paths and user/passwords removed)