Convoy effect with I/O bound threads and New GIL

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

BPO	7946
Nosy	@gvanrossum, @loewis, @arigo, @gpshead, @jcea, @mdickinson, @ncoghlan, @pitrou, @scoder, @larryhastings, @ericvsmith, @tarekziade, @carljm, @coderanger, @phsilva, @merwok, @alex, @jab, @briancurtin, @florentx, @cool-RR, @dimaqq, @thedrow, @corona10, @Sophist-UK
Files	udp-ioclient.py udp-iotest.py gilprio2.patch prioritygil.tar.gz: Priority GIL implementation from PyCON linux-7946.patch: Linux (libc) patch for py32 gilinter2.patch writes.py schedtest.py: Scheduling test dabeaz_gil.patch: Dave's GIL patch nir-ccbench-xp32.log nir-ccbench-linux.log ccbench-osx.log bfs.patch: Updated implementation of the BFS scheduler.

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', '3.10', 'performance'] title = 'Convoy effect with I/O bound threads and New GIL' updated_at = user = 'https://bugs.python.org/dabeaz' ``` bugs.python.org fields: ```python activity = actor = 'Sophist' assignee = 'none' closed = True closed_date = closer = 'dabeaz' components = ['Interpreter Core'] creation = creator = 'dabeaz' dependencies = [] files = ['16315', '16316', '16323', '16324', '16570', '16676', '16968', '17095', '17106', '17194', '17370', '17393', '17504'] hgrepos = [] issue_num = 7946 keywords = ['patch'] message_count = 100.0 messages = ['99438', '99439', '99459', '99461', '99477', '99814', '99815', '99838', '99840', '99846', '99875', '100094', '101116', '101118', '101120', '101125', '101127', '101141', '101142', '101144', '101146', '101148', '101192', '101197', '101612', '101613', '101697', '101700', '101707', '101711', '101714', '101715', '101724', '101737', '101772', '101773', '101821', '101825', '101854', '102043', '102649', '102890', '103154', '103186', '103320', '103414', '103460', '104194', '104195', '104197', '104198', '104203', '104219', '104233', '104234', '104288', '104293', '104303', '104309', '104312', '104317', '104319', '104320', '104324', '104325', '104330', '104335', '104369', '104452', '104453', '104473', '104478', '104613', '104899', '105687', '105835', '105874', '106030', '106780', '106782', '156883', '223109', '223110', '346399', '346495', '347621', '347640', '347641', '377825', '377865', '378187', '378200', '378236', '385088', '385109', '415693', '415698', '415703', '415708', '415715'] nosy_count = 49.0 nosy_names = ['gvanrossum', 'loewis', 'jhylton', 'arigo', 'gregory.p.smith', 'jcea', 'mark.dickinson', 'ncoghlan', 'pitrou', 'scoder', 'movement', 'larry', 'eric.smith', 'kevinwatters', 'tarek', 'karld', 'carljm', 'coderanger', 'phsilva', 'durin42', 'eric.araujo', 'nirai', 'alex', 'stuaxo', 'andrix', 'konryd', 'jab', 'brian.curtin', 'hozn', 'victorpoluceno', 'flox', 'DazWorrall', 'cool-RR', 'rh0dium', 'rcohen', 'dabeaz', 'mahmoudimus', 'portante', 'aconrad', 'ysj.ray', 'thouis', 'donaldjeo', 'Michele', 'jmehnle', 'Dima.Tisnek', 'Omer.Katz', 'corona10', 'Sophist', 'maartenbreddels'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue7946' versions = ['Python 3.10'] ```

gpshead commented 14 years ago

Nice dabeaz.

One potential concern with "dabeaz_gil.patch 2010-04-25 21:13" is that it appears to always leave the gil_monitor thread running. This is bad on mobile/embedded platforms where waking up at regular intervals prevents advanced sleep states and wastes power/battery. (practical example: the OLPC project has run into this issue in other code in the past)

Could this be modified so that gil_monitor stops looping (blocks) so long as there are only IO bound Python threads running or while no python thread owns the GIL?

In that situation a multithreaded python process that has either reverted to one thread or has all threads blocked in IO would be truly idle rather than leaving the gil_monitor polling.

pitrou commented 14 years ago

Dave,

In the current implementation, threads perform a timed-wait on a condition variable. If time expires and no thread switches have occurred, the currently running thread is forced to drop the GIL.

A problem, as far as I can see, is that these timeout sleeps run periodically, regardless of the actual times at which thread switching takes place. I'm not sure it's really an issue but it's a bit of a departure from the "ideal" behaviour of the switching interval.

A new attribute 'cpu_bound' is added to the PyThreadState structure. If a thread is ever forced to drop the GIL, this attribute is simply set True (1). If a thread gives up the GIL voluntarily, it is set back to False (0). This attribute is used to set up simple scheduling (described next).

Ok, so it's not very different, at least in principle, from what gilinter.patch does, right? (and actually, the benchmark results look very similar)

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

Greg,

I like the idea of the monitor suspending if no thread owns the GIL. Let me work on that. Good point on embedded systems.

Antoine,

Yes, the gil monitor is completely independent and simply ticks along every 5 ms. A worst case scenario is that an I/O bound thread is scheduled shortly after the 5ms tick and then becomes CPU-bound afterwards. In that case, the monitor might let it run up to about 10ms before switching it. Hard to say if it's a real problem though---the normal timeslice on many systems is 10 ms so it doesn't seem out of line.

As for the priority part, this patch should have similar behavior to the glinter patch except for very subtle differences in thread scheduling due to the use of the GIL monitor. For instance, since threads never time out on the condition variable anymore, they tend to cycle execution in a purely round-robin fashion.

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

I've updated the GIL patch to reflect concerns about the monitor thread running forever. This version has a suspension mechanism where the monitor goes to sleep if nothing is going on for awhile. It gets resumed if threads try to acquire the GIL, but timeout for some reason.

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

I've also attached a new file schedtest.py that illustrates a subtle difference between having the GIL monitor thread and not having the monitor.

Without the monitor, every thread is responsible for its own scheduling. If you have a lot of threads running, you may have a lot of threads all performing a timed wait and then waking up only to find that the GIL is locked and that they have to go back to waiting. One side effect is that certain threads have a tendency to starve.

For example, if you run the schedtest.py with the original GIL, you get a trace where three CPU-bound threads run like this:

Thread-3 16632 Thread-2 16517 Thread-1 31669 Thread-2 16610 Thread-1 16256 Thread-2 16445 Thread-1 16643 Thread-2 16331 Thread-1 16494 Thread-3 16399 Thread-1 17090 Thread-1 20860 Thread-3 16306 Thread-1 19684 Thread-3 16258 Thread-1 16669 Thread-3 16515 Thread-1 16381 Thread-3 16600 Thread-1 16477 Thread-3 16507 Thread-1 16740 Thread-3 16626 Thread-1 16564 Thread-3 15954 Thread-2 16727 ...

You will observe that Threads 1 and 2 alternate, but Thread 3 starves. Then at some point, Threads 1 and 3 alternate, but Thread 2 starves.

By having a separate GIL monitor, threads are no longer responsible for making scheduling decisions concerning timeouts. Instead, the monitor is what times out and yanks threads off the GIL. If you run the same test with the GIL monitor, you get scheduling like this:

Thread-1 33278 Thread-2 32278 Thread-3 31981 Thread-1 33760 Thread-2 32385 Thread-3 32019 Thread-1 32700 Thread-2 32085 Thread-3 32248 Thread-1 31630 Thread-2 32200 Thread-3 32054 Thread-1 32721 Thread-2 32659 Thread-3 34150

Threads nicely cycle round-robin. There also appears to be about half as much thread switching (for reasons I don't quite understand).

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

Dave, there seems to be a bug in your patch on Windows XP. It crashes in ccbench.py with the following output:

python_d.exe y:\ccbench.py == CPython 3.2a0.0 (py3k) == == x86 Windows on 'x86 Family 6 Model 23 Stepping 10, GenuineIntel' ==

--- Throughput ---

Pi calculation (Python)

threads= 1:   840 iterations/s. balance
Fatal Python error: ReleaseMutex(mon_mutex) failed
threads= 2:   704 ( 83%)        0.8167
threads= 3:   840 (100%)        1.6706
threads= 4:   840 (100%)        2.0000

and the following stack trace:

ntdll.dll!7c90120e()    
[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] 
python32_d.dll!Py_FatalError(const char * msg)  Line 2033   C

python32_d.dll!gil_monitor(void arg) Line 314 + 0x24 bytes C python32_d.dll!bootstrap(void call) Line 122 + 0x7 bytes C msvcr100d.dll!_callthreadstartex() Line 314 + 0xf bytes C msvcr100d.dll!_threadstartex(void * ptd) Line 297 C kernel32.dll!7c80b729()

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

New version of patch that will probably fix Windows-XP problems. Was doing something stupid in the monitor (not sure how it worked on Unix).

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 14 years ago

@dabeaz I'm getting random segfaults with your patch (even with the last one), pretty much everywhere malloc or free is called. Ater skimming through the code, I think the problem is due to gil_last_holder: In drop_gil and take_gil, you dereference gil_last_holder->cpu_bound, but it might very well happen that gil_last_holder points to a thread that has been deleted (through tstate_delete_common). Dereferencing is not risky, because there's a high chance that the address is still valid, but in drop_gil, you do this: / Make the thread as CPU-bound or not depending on whether it was forced off \/ gil_last_holder->cpu_bound = gil_drop_request;

Here, if the thread has been deleted in meantine, you end up writting to a random location on the heap, and probably corrupting malloc administration data, which would explain why I get segfaults sometimes later on unrelated malloc() or free() calls. I looked at it really quickly though, so please forgive me if I missed something obvious ;-)

@nirai I have some more remarks on your patch:

/* Diff timestamp capping results to protect against clock differences
- between cores. */ _LOCAL(long double) _bfs_diff_ts(long double ts1, long double ts0) {

I'm not sure I understand. You can have problem with multiple cores when reading directly the TSC register, but that doesn't affect gettimeofday. gettimeofday should be reliable and accurate (unless the OS is broken of course), the only issue is that since it's wall clock time, if a process like ntpd is running, then you'll run into problem

pretty much all your variables are declared as volatile, but volatile was never meant as a thread-synchronization primitive. Since your variables are protected by mutexes, you already have all necessary memory barriers and synchronization, so volatile just prevents optimization
you use some funtions just to perform a comparison or substraction, maybe it would be better to just remove those functions and perform the substractions/comparisons inline (you declared the functions inline but there's no garantee that the compiler will honor it).
did you experiment with the time slice ? I tried some higher values and got better results, without penalizing the latency. Maybe it could be interesting to look at it in more detail (and on various platforms).

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

Added extra pointer check to avoid possible segfault.

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 14 years ago

I don't see segfaults anymore, but there's still an unsafe dereference of gil_last_holder inside take_gil:

    /* Wait on the appropriate GIL depending on thread's classification */
    if (!tstate->cpu_bound) {
      /* We are I/O bound.  If the current thread is CPU-bound, force it off now! */
      if (gil_last_holder->cpu_bound) {
    SET_GIL_DROP_REQUEST();
      }

You're still accessing a location that may have been free()'d previously: while it will work most of the time (that's why I said it's not as risky), if the page gets unmapped between the time the current thread is deleted and the next thread takes over, you'll get a segfault. And that's undefined behaviour anyway ;-)

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

That second access of gil_last_holder->cpu_bound is safe because that block of code is never entered unless some other thread currently holds the GIL. If a thread holds the GIL, then gil_last_holder is guaranteed to have a valid value.

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 14 years ago

Didn't have much sleep last night, so please forgive me if I say something stupid, but:

Python/pystate.c:
void
PyThreadState_DeleteCurrent()
{
    PyThreadState *tstate = _PyThreadState_Current;
    if (tstate == NULL)
        Py_FatalError(
            "PyThreadState_DeleteCurrent: no current tstate");
    _PyThreadState_Current = NULL;
    tstate_delete_common(tstate);
    if (autoTLSkey && PyThread_get_key_value(autoTLSkey) == tstate)
        PyThread_delete_key_value(autoTLSkey);
    PyEval_ReleaseLock();
}

the current tstate is deleted and freed before releasing the GIL, so if another thread calls take_gil after the current thread has called tstate_delete_common but before it calls PyEval_ReleaseLock (which calls drop_gil and set gil_locked to 0), then it will enter this section and dereference gil_last_holder. I just checked with valgrind, and he also reports an illegal dereference at this precise line.

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

I stand corrected. However, I'm going to have to think of a completely different approach for carrying out that functionality as I don't know how the take_gil() function is able to determine whether gil_last_holder has been deleted or not. Will think about it and post an updated patch later.

Do you have any examples or insight you can provide about how these segfaults have shown up in Python code? I'm not able to observe any such behavior on OS-X or Linux. Is this happening while running the ccbench program? Some other program?

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 14 years ago

Do you have any examples or insight you can provide about how these segfaults have shown up in Python code? I'm not able to observe any such behavior on OS-X or Linux. Is this happening while running the ccbench program? Some other program?

If you're talking about the first issue (segfaults due to writting to gil_last_holder->cpu_bound), it was occuring quite often during ccbench (pretty much anywhere malloc/free was called). I'm running a regular dual-core Linux box, nothing special.

For the second one, I didn't observe any segfault, I just figured this out reading the code and confirmed it with valgrind, but it's much less likely because the race window is very short and it also requires that the page is unmmaped in between.

If someone really wanted to get segfaults, I guess a good start would be:

get a fast machine, multi-core is a bonus
use a kernel with full preemption
use a lot of threads (-n option with ccbench)
use purify or valgrind --free-fill option so that you're sure to jump to noland if you dereference a previously-free'd pointer

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

One more attempt at fixing tricky segfaults. Glad someone had some eagle eyes on this :-).

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

On Tue, Apr 27, 2010 at 12:23 PM, Charles-Francois Natali wrote:

@nirai I have some more remarks on your patch:

/* Diff timestamp capping results to protect against clock differences

between cores. */ _LOCAL(long double) _bfs_diff_ts(long double ts1, long double ts0) {

I'm not sure I understand. You can have problem with multiple cores when reading directly the TSC register, but that doesn't affect gettimeofday. gettimeofday should be reliable and accurate (unless the OS is broken of course), the only issue is that since it's wall clock time, if a process like ntpd is running, then you'll run into problem

I think gettimeofday() might return different results on different cores as result of kernel/hardware problems or clock drift issues in VM environments: http://kbase.redhat.com/faq/docs/DOC-7864 https://bugzilla.redhat.com/show_bug.cgi?id=461640

In Windows the high-precision counter might return different results on different cores in some hardware configurations (older multi-core processors). I attempted to alleviate these problems by using capping and by using a "python time" counter constructed from accumulated slices, with the assumption that IO bound threads are unlikely to get migrated often between cores while running. I will add references to the patch docs.

did you experiment with the time slice ? I tried some higher values and got better results, without penalizing the latency. Maybe it could be interesting to look at it in more detail (and on various platforms).

Can you post more details on your findings? It is possible that by using a bigger slice, you helped the OS classify CPU bound threads as such and improved "synchronization" between BFS and the OS scheduler.

Notes on optimization of code taken, thanks.

pitrou commented 14 years ago

I stand corrected. However, I'm going to have to think of a completely different approach for carrying out that functionality as I don't know how the take_gil() function is able to determine whether gil_last_holder has been deleted or not.

Please note take_gil() currently doesn't depend on the validity of the pointer. gil_last_holder is just used as an opaque value, equivalent to a thread id.

larryhastings commented 14 years ago

In Windows the high-precision counter might return different results on different cores in some hardware configurations (older multi-core processors).

More specifically: some older multi-core processors where the HAL implements QueryPerformanceCounter using the TSC from the CPU, and the HAL doesn't keep the cores in sync and QPC doesn't otherwise account for it. This is rare; frequently QPC is implemented using another source of time.

But it's true: QPC is not 100% reliable. QPC can unfortunately jump backwards (when using TSC and you switch cores), jump forwards (when using TSC and you switch cores, or when using the PCI bus timer on P3-era machines with a specific buggy PCI south bridge controller), speed up or slow down (when using TSC and not accounting for changing CPU speed via SpeedStep &c). The simple solution: give up QPC and use timeGetTime() with timeBeginPeriod(1), which is totally reliable but only has millisecond accuracy at best.

http://www.virtualdub.org/blog/pivot/entry.php?id=106 http://support.microsoft.com/default.aspx?scid=KB;EN-US;Q274323&

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

On Wed, Apr 28, 2010 at 12:41 AM, Larry Hastings wrote:

The simple solution: give up QPC and use timeGetTime() with timeBeginPeriod(1), which is totally reliable but only has millisecond accuracy at best.

It is preferable to use a high precision clock and I think the code addresses the multi-core time skew problem (pending testing).

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

Dave, there seems to be some problem with your patch on Windows:

F:\dev>z:\dabeaz-wcg\PCbuild\python.exe y:\ccbench.py -b == CPython 3.2a0.0 (py3k) == == x86 Windows on 'x86 Family 6 Model 23 Stepping 10, GenuineIntel' ==

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 8551.2 packets/s. CPU threads=1: 26.1 ( 0 %) CPU threads=2: 26.0 ( 0 %) CPU threads=3: 37.2 ( 0 %) CPU threads=4: 33.2 ( 0 %)

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 14 years ago

Wow, that is a *really* intriguing performance result with radically different behavior than Unix. Do you have any ideas of what might be causing it?

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

On Thu, Apr 29, 2010 at 2:03 AM, David Beazley wrote:

Wow, that is a *really* intriguing performance result with radically different behavior than Unix. Do you have any ideas of what might be causing it?

Instrument the code and I'll send you a trace.

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

Dave,

The behavior of your patch on Windows XP/2003 (and earlier) might be related to the way Windows boosts thread priority when it is signaled.

Try to increase priority of monitor thread and slice size. Another thing to look at is how to prevent Python CPU bound threads from (starving) messing up scheduling of threads of other processes. Maybe increasing slice significantly can help in this too (50ms++ ?).

XP/NT/CE scheduling and thread boosting affect all patches and the current GIL undesirably (in different ways). Maybe it is possible to make your patch work nicely on these systems: http://www.sriramkrishnan.com/blog/2006/08/tale-of-two-schedulers-win_115489794858863433.html

Vista and Windows 7 involve CPU cycle counting which results in more sensible scheduling: http://technet.microsoft.com/en-us/magazine/2007.02.vistakernel.aspx

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

I updated bfs.patch with improvements on Windows XP.

The update disables priority boosts associated with the scheduler condition on Windows for CPU bound threads.

Here is a link to ccbench results: http://bugs.python.org/file17194/nir-ccbench-xp32.log

Summary:

Windows XP 32bit q9400 2.6GHz Release build (no PG optimizations). Test runs in background, ccbench modified to run both bz2 and sha1.

bfs.patch - seems to behave.

gilinter2.patch single core: high latency, low IO bandwidth.

dabeaz_gil.patch single core: low IO bandwidth. 4 cores: throughput threads starvation (balance), some latency, low IO bandwidth.

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

Duck, here comes another update to bfs.patch.

This one with some cleanups which simplify the code and improve behavior (on Windows XP), shutdown code, comments, and "experimental" use of TSC for timestamps, which eliminates timestamp reading overhead.

TSC (http://en.wikipedia.org/wiki/Time_Stamp_Counter) is a fast way to get high precision timing read. On some systems this is what gettimeofday() uses under the hood while on other systems it will use HPET or another source which is slower, typically \~1usec, but can be higher (e.g. my core 2 duo laptop occasionally goes into a few hours of charging 3usec per HPET gettimeofday() call - god knows why)

This overhead is incurred twice for every GIL release/acquire pair and can be eliminated with: 1) Hack the scheduler not to call gettimeofday() when no other threads are waiting to run, or 2) Use TSC on platforms it is available (the Linux BFS scheduler uses TSC).

I took cycle.h pointed by the Wikipedia article on TSC for a spin and it works well on my boxes. It is BSD, (un)maintained? and includes implementation for a gazillion of platforms (I did not yet modify configure.in as it recommends).

If it breaks on your system please ping with details.

Some benchmarks running (Florent's) writes.py on Core 2 Quad q9400 Ubuntu 64bit:

bfs.patch - 35K context switches per second, threads balanced, runtime is 3 times that of running IO thread alone:

\~/dev/python$ \~/build/python/bfs/python writes.py t1 1.60293507576 1 t2 1.78533816338 1 t1 2.88939499855 2 t2 3.19518113136 2 t1 4.38062310219 3 t2 4.70725703239 3 t1 6.26874804497 4 t2 6.4078810215 4 t1 7.83273100853 5 t2 7.92976212502 5 t1 9.4341750145 6 t2 9.57891893387 6 t1 11.077393055 7 t2 11.164755106 7 t2 12.8495900631 8 t1 12.8979620934 8 t1 14.577999115 9 t2 14.5791089535 9 t1 15.9246580601 10 t2 16.1618289948 10 t1 17.365830183 11 t2 17.7345991135 11 t1 18.9782481194 12 t2 19.2790091038 12 t1 20.4994370937 13 t2 20.5710251331 13 21.0179870129

dabeaz_gil.patch - sometimes runs well but sometimes goes into high level of context switches (250K/s) and produces output such as this:

\~/dev/python$ \~/build/python/dabeaz/python writes.py t1 0.742760896683 1 t1 7.50052189827 2 t2 8.63794493675 1 t1 10.1924870014 3 17.9419858456

gilinter2.patch - 300K context switches per second, bg threads starved:

\~/dev/python$ \~/build/python/gilinter/python writes.py t2 6.1153190136 1 t2 11.7834780216 2 14.5995650291

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

Updated bfs.patch to patch cleanly updated py3k branch. Use: $ patch -p1 \< bfs.patch

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

A link to ccbench results comparing old GIL, old GIL with long check interval, new GIL and BFS: http://bugs.python.org/file17370/nir-ccbench-linux.log

Summary:

Results for ccbench latency and bandwidth test run on Ubuntu Karmic 64bit, q9400 2.6GHz, all Python versions built with computed gotos optimization.

Old GIL: Hi level of context switching and reduced performance. \~90ms IO latency with pure Python CPU bound background threads and low IO bandwidth results.

Old GIL with sys.setcheckinterval(2500) as done by Zope: Context switching level back to normal. IO latency shoots through the roof. \~950ms (avg) is the maximum recordable value in this test since CPU load duration is 2sec.

New GIL: The expected 5ms wait related IO latency and low IO bandwidth.

BFS patch: Behaves.

e9a95bbe-04c2-4f7d-8c01-d54b5777343d commented 14 years ago

Attached ccbench-osx.log made today on OSX on latest svn checkout. Hope it helps

7b124e7b-b61d-49f6-9762-fd36b627e237 commented 14 years ago

Updated bfs.patch with BSD license and copyright notice.

! Current version patches cleanly and builds with Python revision svn r81201.

bpo-7946 and proposed patches were put on hold indefinitely following this python-dev discussion: http://mail.python.org/pipermail/python-dev/2010-May/100115.html

I would like to thank the Python developer community and in particular David and Antoine for a most interesting ride.

Any party interested in sponsoring further development or porting patch to Python 2.x is welcome to contact me directly at nir@winpdb.org

Nir

gpshead commented 14 years ago

Thanks for all your work Nir! I personally think the BFS approach is the best we've seen yet for this problem!

Having read the thread you linked to in full (ignoring the tagents bikeshedding and mudslinging that went on there), it sounds like the general consensus is that we should take thread scheduling changes slowly and let the existing new implementation bake in the 3.2 release. That puts this issue as a possibility for 3.3 if users demonstrate real world application problems in 3.2.

(personally I'd say it is already obvious that there are problems an wde should go ahead with your BFS based approach but realistically the we're still better off in 3.2 than we were in 3.1 and 2.x as is)

vstinner commented 12 years ago

gettimeofday returns you wall clock time: if a process that modifies time is running, e.g. ntpd, you'll likely to run into trouble. the value returned is _not_ monotonic, ...

The issue bpo-12822 asks to use monotonic clocks when available.

1e3445ee-1357-417d-a204-0863cf864f91 commented 10 years ago

What happened to this bug and patch?

pitrou commented 10 years ago

Not much :) The patch is complex and the issue hasn't proved to be significant in production code. Do you have a (real-world) workload where this shows up?

Le 15/07/2014 09:52, Dima Tisnek a écrit :

Dima Tisnek added the comment:

What happened to this bug and patch?

df848ffd-db83-462b-aaf9-9b77fd750c8b commented 5 years ago

Celery 5 is going async and in order to isolate the main event loop from task execution, the tasks are going to be executed in a different thread with it's own event loop.

This thread may or may not be CPU bound. The main thread is I/O bound.

This patch should help a lot.

I like Nir's approach a lot (although I haven't looked into the patch itself yet). It's pretty novel. David's patch is also very interesting.

I'm willing to help.

7aa6e20b-8983-474f-b2ae-de7eff1caa04 commented 5 years ago

Note that PyPy has implemented a GIL which does not suffer from this problem, possibly using a simpler approach than the patches here do. The idea is described and implemented here:

https://bitbucket.org/pypy/pypy/src/default/rpython/translator/c/src/thread_gil.c

df848ffd-db83-462b-aaf9-9b77fd750c8b commented 5 years ago

FYI I can verify that the original benchmark is still valid on Python 3.7.3. I'm running the client on an 8 core CPU. The result is 30.702 seconds (341534.322 bytes/sec).

I'll need somebody to decide how we're going to fix this problem. I can do the legwork.

gpshead commented 5 years ago

I suggest: (1) turning one of the patches (probably the last BFS one?) into a PR against the github master branch (3.9) and, (2) if none of the existing pyperformance workloads already demonstrates the problems with the existing GIL implementation, adopt one of the benchmarks here into an additional pyperformance workload. (3) demonstrate pyperformance results (overall and targeted tests) before and after the change.

Simultaneously, it'd also be interesting to see someone create an alternate PR using a PyPy inspired GIL implementation as that could prove to be a lot easier to maintain.

Lets make a data driven decision here.

People lost interest in actually landing a fix to this issue in the past because it wasn't impacting their daily lives or applications (or if it was, they already adopted a workaround). Someone being interested enough to do the work to justify it going in is all it should take to move forward.

gpshead commented 5 years ago

(unassigning as it doesn't make sense to assign to anyone unless they're actually working on it)

larryhastings commented 4 years ago

FWIW: I think David's cited behavior proves that the GIL is de facto a scheduler. And, in case you missed it, scheduling is a hard problem, and not a solved problem. There are increasingly complicated schedulers with new approaches and heuristics. They're getting better and better... as well as more and more complex. BFS is an example of that trend from ten years ago. But the Linux kernel has been shy about merging it, not sure why (technical deficiency? licensing problem? personality conflict? the name?).

I think Python's current thread scheduling approach is almost certainly too simple. My suspicion is that we should have a much more elaborate scheduler--which hopefully would fix most (but not all!) of these sorts of pathological performance regressions. But that's going to be a big patch, and it's going to need a champion, and that champion would have to be more educated about it than I am, so I don't think it's gonna be me.

0d6a79a7-4e74-41c2-8eed-8d04ea475e57 commented 4 years ago

About nine years ago, I stood in front of a room of Python developers, including many core developers, and gave a talk about the problem described in this issue. It included some live demos and discussion of a possible fix.

https://www.youtube.com/watch?v=fwzPF2JLoeU

Based on subsequent interest, I think it's safe to say that this issue will never be fixed. Probably best to close this issue.

gpshead commented 4 years ago

It's a known issue and has been outlined very well and still comes up from time to time in real world applications, which tend to see this issue and Dave's presentation and just work around it in any way possible for their system and move on with life.

Keeping it open even if nobody is actively working on it makes sense to me as it is still a known issue that could be resolved should someone have the motivation to complete the work.

1e3445ee-1357-417d-a204-0863cf864f91 commented 4 years ago

My 2c as Python user:

Back in 2010, I've used multithreading extensively, both for concurrency and performance. Others used multiprocessing or just shelled out. People talked about using **the other** core, or sometimes the other socket on a server.

Now in 2020, I'm using asyncio exclusively. Some colleagues occasionally still shell out 🙈. None talking about using all cores on a single machine, rather, we'd spin up dozens of identical containers, which are randomly distributed across N machines, and the synchronisation is offloaded to some database (e.g. atomic ops in redis; transactions in sql).

In my imagination, I see future Python as single-threaded (from user's point of view, that is without multithreading api), that features speculative out-of-order async task execution (using hardware threads, maybe pinned) that's invisible to the user.

vstinner commented 4 years ago

If someone wants to close this issue, I suggest to write a short section in the Python documentation to give some highlights on the available options and stategies to maximize performances and list drawbacks of each method. Examples:

Multiple threads (threading): limited by the GIL
Multiple processes (concurrent.futures, multiprocessing, distributed application): limited by shared data
Concurrent programming (asyncio): limited to 1 thread

These architectures are not exclusive. asyncio can use multiple threads and be distributed in multiple processes.

I would be bad to go too deep into the technical details, but I think that we can describe some advantages and drawbacks which are common on all platforms.

535b3e58-336c-4b74-b11b-be2bef7dadd4 commented 3 years ago

Catching up on the comments on this, it seems like nobody has enough certainty to say it will work well enough.

In Linux, the scheduler is pluggable, which lets other non-default schedulers be shipped and tried in the real world.

See schedutil, introduced in Linux 4.7 in 2016 default for some architectures in 2020, in 2021 still not default everywhere.

Similarly I think this needs more testing than it will get living here as a bug. If, like Linux the scheduler was pluggable, it could be shipped and enabled by real users that were brave enough and data could be collected.

b292f0cc-58c7-4e57-b51c-d3d9c13f1f29 commented 3 years ago

In case someone finds it useful, I've written a blog post on how to visualize the GIL: https://www.maartenbreddels.com/perf/jupyter/python/tracing/gil/2021/01/14/Tracing-the-Python-GIL.html

In the comments (or at https://github.com/maartenbreddels/fastblog/issues/3#issuecomment-760891430 ) I looked specifically at the iotest.py example (no solutions though).

215fdcd3-0ca2-4b1c-958d-eabd0229964f commented 2 years ago

Please see also https://github.com/faster-cpython/ideas/discussions/328 for a proposal for a simple (much simpler than BFS) GIL scheduler only allocating the GIL between runable O/S threads waiting to have ownership of the GIL, and using the O/S scheduler for scheduling the threads.

df848ffd-db83-462b-aaf9-9b77fd750c8b commented 2 years ago

I think that we should focus our efforts on removing the GIL, now that we have a feasible solution for doing so without breaking anything (hopefully) however the removal of the GIL is still far from being complete and will need to be rebased upon the latest Python version to be merged.

This issue would probably hurt Celery since some users use it with a thread pool and it uses a few threads itself but it seems like fixing it is too much effort so if we were to invest a lot of effort, I'd focus on removing the problem entirely.

Instead, I suggest we document this with a warning in the relevant place so that people would know to avoid or workaround the problem entirely.

On Mon, Mar 21, 2022, 20:32 Guido van Rossum \report@bugs.python.org\ wrote:

Change by Guido van Rossum \guido@python.org\:

---------- nosy: +gvanrossum

Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue7946\

215fdcd3-0ca2-4b1c-958d-eabd0229964f commented 2 years ago

I think that we should focus our efforts on removing the GIL, now that we have a feasible solution for doing so without breaking anything

Is this really a thing? Something that is definitely happening in a reasonable timescale?

Or are there some big compatibility issues likely to rear up and at best create delays, and at worse derail it completely?

Can someone give me some links about this please?

gvanrossum commented 2 years ago

Start here: https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit

AFAICT the SC hasn't made up their minds about this.

215fdcd3-0ca2-4b1c-958d-eabd0229964f commented 2 years ago

https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit

The steering committee hasn't given the go ahead for this yet, and we have no idea when such a decision will be made nor whether the decision with be yes or no.
Even after the decision is made "Removing the GIL will be a large, multi-year undertaking with increased risk for introducing bugs and regressions."
The promised performance gains are actually small by comparison to the existing project that Guido is running is hoping to achieve. It is unclear whether the no-gil implementation would impact those gains, and if so whether a small reduction in the large planned performance gains would actually more than wipe out the modest performance gains promised by the no-gil project.
Given that this "Removing the GIL will require making trade-offs and taking on substantial risk", it is easy to see that this project could come across an unacceptable risk or insurmountable technical problem at any point and thus fall by the wayside.

Taking all of the above points together, I think that there is still merit in considering the pros and cons of a GIL scheduler.

python / cpython

Convoy effect with I/O bound threads and New GIL #52194