Deletion of large sets of strings is extra slow

terryjreedy commented 6 years ago

BPO	32846
Nosy	@malemburg, @tim-one, @Yhg1s, @rhettinger, @terryjreedy, @vstinner, @luispedro, @methane, @serhiy-storchaka
Files	bench_del.py time-set.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['3.8', 'performance'] title = 'Deletion of large sets of strings is extra slow' updated_at = user = 'https://github.com/terryjreedy' ``` bugs.python.org fields: ```python activity = actor = 'tim.peters' assignee = 'none' closed = True closed_date = closer = 'tim.peters' components = [] creation = creator = 'terry.reedy' dependencies = [] files = ['47447', '47448'] hgrepos = [] issue_num = 32846 keywords = [] message_count = 20.0 messages = ['312188', '312209', '312215', '312216', '312226', '312236', '312249', '312250', '312263', '312264', '312265', '312278', '312280', '312350', '337655', '345652', '345653', '345654', '345657', '345720'] nosy_count = 9.0 nosy_names = ['lemburg', 'tim.peters', 'twouters', 'rhettinger', 'terry.reedy', 'vstinner', 'luis@luispedro.org', 'methane', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue32846' versions = ['Python 3.8'] ```

terryjreedy commented 6 years ago

https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/, a blog post on cpython speed, clains "deleting a set of 1 billion strings takes >12 hours". (No other details provided.)

I don't have the 100+ gigabytes of ram needed to confirm this, but with installed 64 bit 3.7.0b1 with Win10 and 12 gigabyes, I confirmed that there is a pronounced super-linear growth in string set deletion (unlike with an integer set). At least half of ram was available.

  Seconds to create and delete sets

millions integers strings of items create delete create delete 1 .08 .02 .36 .08 2 .15 .03 .75 .17 4 .30 .06 1.55 .36 8 .61 .12 3.18 .76 16 1.22 .24 6.48 1.80 \< slightly more than double 32 2.4 .50 13.6 5.56 \< more than triple 64 4.9 1.04 28 19 \< nearly quadruple 128 10.9 2.25 \<too large> 100 56 80 \< quadruple with 1.5 x size

For 100 million strings, I got about the same 56 and 80 seconds when timing with a clock, without the timeit gc suppression. I interrupted the 128M string run after several minutes. Even if there is swapping to disk during creation, I would not expect it during deletion.

The timeit code:

import timeit

for i in (1,2,4,8,16,32,64,128):
    print(i, 'int')
    print(timeit.Timer(f's = {{n for n in range({i}*1000000)}}')
          .timeit(number=1))
    print(timeit.Timer('del s', f's = {{n for n in range({i}*1000000)}}')
          .timeit(number=1))

for i in (1,2,4,8,16,32,64,100):
    print(i, 'str')
    print(timeit.Timer(f's = {{str(n) for n in range({i}*1000000)}}')
          .timeit(number=1))
    print(timeit.Timer('del s', f's = {{str(n) for n in range({i}*1000000)}}')
          .timeit(number=1))

Raymond, I believe you monitor the set implementation, and I know Victor is interested in timing and performance.

serhiy-storchaka commented 6 years ago

For lists and dicts the time of deletion is virtually linear.

1 int 5.2 9.8 2 int 5.8 9.7 4 int 5.6 9.8 8 int 5.8 10.0 16 int 5.5 9.7 32 int 5.4 9.6 64 int 5.6 9.0 128 int 5.6 8.7 1 str 7.6 13.3 2 str 8.0 13.8 4 str 7.4 14.0 8 str 7.5 14.3 16 str 7.6 14.4 32 str 8.0 14.7 64 str 7.9 14.3 100 str 8.1 14.0

1 int 60.4 10.5 2 int 61.0 11.1 4 int 61.1 10.4 8 int 60.1 11.1 16 int 61.1 10.1 32 int 61.5 9.9 64 int 60.7 9.6 128 int 60.8 9.4 1 str 204.9 15.4 2 str 247.4 15.8 4 str 275.3 16.1 8 str 299.3 14.8 16 str 312.8 14.8 32 str 329.2 14.2 64 str 344.5 14.1 100 str 386.2 14.4

(third and forth columns are time in nanoseconds per item, for creation and deletion)

But when shuffle the input data before creating a collection, the deletion time becomes superlinear (and much slower).

1 int 17.4 38.9 2 int 19.1 41.3 4 int 24.4 46.3 8 int 28.0 49.5 16 int 28.1 53.4 32 int 29.8 67.2 64 int 35.2 90.6 128 int 42.6 143.1 1 str 21.1 56.3 2 str 24.3 58.2 4 str 27.1 63.6 8 str 27.3 73.9 16 str 29.4 99.2 32 str 34.3 144.8 64 str 41.8 229.5 100 str 46.3 338.4

In former examples (list, ordered dict) objects are iterated and deleted in order of creation. But in the set of strings items are deleted in non-predicable random order. I suppose the non-linear effect is related to memory management.

rhettinger commented 6 years ago

In former examples (list, ordered dict) objects are iterated and deleted in order of creation. But in the set of strings items are deleted in non-predicable random order. I suppose the non-linear effect is related to memory management.

That makes sense. We're likely seeing an artifact of a favorable correspondence between the container pointer order and the order that the referred-to objects were created in memory in this somewhat non-representative test case. That is why lists lose the favorable timing when the data is shuffled.

If the test bench creates all the referred-to objects in consecutive memory locations, then any container accessing those objects consecutively will benefit from spatial cache locality. (i.e. If there is more than one datum per cache line, the second read is virtually free. Likewise, memory controller read-ahead makes a long series of consecutive accesses cheaper.)

It would be nice is the timing were run on strings instead of integers. Ints are somewhat weird in that the hashes of consecutive integers are themselves consecutive, you can fit two ints in one cache line.

I'm not sure whether it applies here, but there may also be a branch-prediction effect as well: https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array

FWIW, the deallocation routine for sets isn't doing anything special. It just calls Py_DECREF in a loop:

static void
set_dealloc(PySetObject *so)
{
    setentry *entry;
    Py_ssize_t used = so->used;

    /* bpo-31095: UnTrack is needed before calling any callbacks */
    PyObject_GC_UnTrack(so);
    Py_TRASHCAN_SAFE_BEGIN(so)
    if (so->weakreflist != NULL)
        PyObject_ClearWeakRefs((PyObject *) so);

    for (entry = so->table; used > 0; entry++) {
        if (entry->key && entry->key != dummy) {
                used--;
                Py_DECREF(entry->key);
        }
    }
    if (so->table != so->smalltable)
        PyMem_DEL(so->table);
    Py_TYPE(so)->tp_free(so);
    Py_TRASHCAN_SAFE_END(so)
}

rhettinger commented 6 years ago

The 4th graph in this article may be showing the reason for the growth in random access times as the size gets bigger:

https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

rhettinger commented 6 years ago

Random idea: For some range of array sizes (bigger than L3 cache), there might be a net benefit to sorting L1-sized clumps of pointers before making the Py_DECREF calls. Possibly, the cost of sorting would be offset by improved memory access patterns.

On the other hand, this might just optimize an artificial benchmark that isn't representative of real code where the data is actually being used. In that case, the program is already going to have to hop all over memory just to access the referred-to objects.

d6f13593-65bd-4010-9539-41d26a2cccf7 commented 6 years ago

Original poster here.

The benchmark is artificial, but the problem setting is not. I did have a problem that is roughly:

    interesting = set(line.strip() for line in open(...))
    for line in open(...):
        key,rest = line.split('\t', 1)
        if key in interesting:
             process(rest)

Deleting the set (when it goes out of scope) was a significant chunk of the time. Surprisingly, deleting a very large set takes much longer than creating it.

Here are my controlled measurements (created with the attached script, which itself uses Jug http://jug.rtfd.io and assumes a file input.txt is present).

N create (s) delete (s) 1 0.00 0.00 10 0.00 0.00 100 0.00 0.00 1000 0.00 0.00 10000 0.01 0.00 100000 0.15 0.01 1000000 1.14 0.12 10000000 11.44 2.24 100000000 126.41 70.34 200000000 198.04 258.44 300000000 341.27 646.81 400000000 522.70 1044.97 500000000 564.07 1744.54 600000000 1380.04 3364.06 700000000 1797.82 3300.20 800000000 1294.20 4410.22 900000000 3045.38 7646.59 1000000000 3467.31 11042.97 1100000000 5162.35 13750.22 1200000000 6581.90 12544.67 1300000000 1612.60 17204.67 1400000000 1788.13 23772.84 1500000000 6922.16 25068.49

rhettinger commented 6 years ago

Thanks for the info. Here is what I surmise from the data.

The shape of the curve matches the expected memory latency curve for random memory accesses for a given working set size. The graph of your measurements looks similar in shape to this one: https://www.extremetech.com/wp-content/uploads/2014/08/latency.png
The last line of the measurements has N=1,500,000,000. That would give an approx 70GB setobject that refers to another 100Gb (or more) of strings. Unless you are working with an extraordinary machine, most of that data will have been swapped to disk.
Dividing the 1,500,000,000 elements by the time of 25,068 seconds gives about 60,000 deletions per second. This number coincides nicely with the random access performance of a typical SSD rated at 50,000 IOPs.

If this reasoning is sound, all it proves is that random accesses to virtual memory are limited by the speed of the hardware for that virtual memory.

terryjreedy commented 6 years ago

Luís, what you really need for the problem you outline is an immutable set with one operation, 'in', aside from create and delete. If specialized to strings only, it could internally stores only character sequences, rather than Python objects. I suspect someone has written such a class (in C), but did not find anything on pypi.python.org.

Lacking that, the following experiment suggests that you might be able to restore near O(n) time by partitioning the set of keys into a collection of sets. My 16M set took 6.48 and 1.80 seconds. Times 4 is 25.9 and 7.2. My 64M set took 28 and 19 seconds, but 4 16M sets take 26.3 and 7.5 seconds, only about 5% more than the x4 target.

print(timeit.Timer( 's = [{str(n) for n in range(i*1000000, (i+16)1000000)}' ' for i in (0, 16, 32, 48)]') .timeit(number=1)) print(timeit.Timer( 'del s', 's = [{str(n) for n in range(i*1000000, (i+16)1000000)}' ' for i in (0, 16, 32, 48)]') .timeit(number=1))

tim-one commented 6 years ago

Surprisingly, deleting a very large set takes much longer than creating it.

Luis, that's not surprising ;-) When you create it, it's mostly the case that there's a vast chunk of raw memory from which many pieces are passed out in address order (to hold all the newly created Python objects). Memory access is thus mostly sequential. But when you delete it, that vast chunk of once-raw memory is visited in essentially random order (string hashes impose a pseudo-random order on where (pointers to) string objects are stored within a set's vector), defeating all the hardware features that greatly benefit from sequential access.

More precisely, the set's internal vector is visited sequentially during deletion, but the string objects the pointers point _at_ are all over the place. Even if nothing is swapped to disk, it's likely that visiting a string object during deletion will miss on all cache levels, falling back to (much slower) RAM. Note that all the string objects must be visited during set deletion, in order to decrement their reference counts.

methane commented 6 years ago

One possible advice; try dict instead of set. Dict is (surprisingly) little smaller than set. And dict cleans items in insertion order when the dict is deleted.

methane commented 6 years ago

Dict is (surprisingly) little smaller than set.

I'm sorry, I was wrong. dict is smaller than set only when len(d) is small. (typical case for namespace dict)

In case of massive keys, dict is larger than set.

d6f13593-65bd-4010-9539-41d26a2cccf7 commented 6 years ago

I think some of this conversation is going off-topic, but there is no disk-swapping in my case.

I realize ours is not a typical setup, but our normal machines have 256GB of RAM and the "big memory" compute nodes are >=1TB. Normally, swap is outright disabled.

This really is an impressive case study on how much difference cache-locality can make.

malemburg commented 6 years ago

Reminds of some experiments someone did a while ago as part of the GIL removal attempts where the ref count integers are all kept in a separate array. The intent there was to be able to do locking on a single array rather than on individual decref cells.

This would solve the issue with having to jump around in memory to decref all objects, but I'm not sure whether the overall win would be a lot, since deallocation of the memory blocks typically requires accessing the block itself as well (to update the block chain list pointers), unless the memory allocator uses some smart cache local block management as well (I believe that pymalloc does, but could be wrong).

In any case, this sounds like a fun experiment for a GSoC student. Perhaps the PSF could donate an AWS EC2 instance with enough RAM to do the experiments.

methane commented 6 years ago

@Luis, would you try dict instead of set? It's little larger than set, but delete elements by insertion order.

But I don't think builtin data structure can be optimized for such workload. Maybe, LMBD1 or some other KVS can help you.

methane commented 5 years ago

I thought compact set implementation similar to dict sicne Python 3.6 may fix this issue. But as discussion with Raymond on Python-Dev ML, I conclude the merits of compact implementation is not significant enough. I abandoned compact set branch.

Now I don't have any idea to fix this issue in foreseeable future. Please use dict instead.

tim-one commented 5 years ago

Looks likely that the _major_ cause of the quadratic-time delete behavior was due to that obmalloc used a linear-time method to keep its linked list of usable arenas sorted in order of number of free pools. When a pool became unused, its arena's count of free pools increased by one, and then order was restored by moving the arena "to the right" in the linked list, one slot at a time.

When there were only a few hundred arenas, nobody noticed. But programs with thousands of arenas could suffer very noticeable sloth, and programs with hundreds of thousands could appear to hang (could require days to complete deleting).

This was fixed in issue bpo-37029, using a constant-time method to restore sorted order, scheduled for release in Python 3.8.

rhettinger commented 5 years ago

Can we close this now? ISTM the issue has less to do with sets and more to do with memory allocation quirks and that on modern CPUs random memory accesses are slower than sequential memory accesses. It is not a bug that sets are unordered collections that iterate over elements in a difference order than they were inserted.

tim-one commented 5 years ago

Raymond, please read my very recent comment one above yours. A (overall) quadratic-time algorithm (O(A**2) where A is the number of arenas) in obmalloc.c is (to my eyes) probably the _primary_ cause of the sloth here. That's been fixed for 3.8, but I don't have enough RAM even to run Terry's test code to confirm it.

I can confirm that there's no quadratic-time behavior anymore deleting large sets of strings, but only until I run out of RAM. Neither is the behavior linear, but it's much closer to linear than quadratic now.

If someone with more RAM can confirm this for larger sets, then fine by me if this is closed again.

terryjreedy commented 5 years ago

I reran the code in msg312188. Ints as before, string deletion +- linear up to 100 million, much better than before.

millions old strings new strings of items create delete create delete 4 1.55 .36 1.7 .38 8 3.18 .76 3.6 .78 16 6.48 1.80 7.3 1.71 32 13.6 5.56 14.8 3.8 64 28 19 30 8.4 100 56 80 50 14.3

tim-one commented 5 years ago

Thanks, Terry! Based on your latest results, "quadratic time" isn't plausible here anymore, so I'm closing this. Nasty cache effects certainly played a role, but they were just a flea on the dog ;-)

python / cpython

Deletion of large sets of strings is extra slow #77027