sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.48k stars 488 forks source link

Fix a non x86 (armv7l) bug in src/sage/cpython/dict_del_by_value.pyx #28941

Open jaapspies opened 4 years ago

jaapspies commented 4 years ago

Doctesting sage-9.0 on a Raspberry Pi 4B on Raspbian Buster

sage -t --long src/sage/cpython/dict_del_by_value.pyx
**********************************************************************
File "src/sage/cpython/dict_del_by_value.pyx", line 396, in sage.cpython.dict_del_by_value.test_del_dictitem_by_exact_value
Failed example:
    for i in range(100000):        # long time
        ki=L[floor(random()*B)]
        vi=L[floor(random()*B)]
        D1[ki]=vi
        D2[ki]=vi
        ko=L[floor(random()*B)]
        if ko in D1:
            vo=D1[ko]
            del D1[ko]
            test_del_dictitem_by_exact_value(D2,vo,hash(ko))
        assert D1 == D2
Exception raised:
    Traceback (most recent call last):
      File "/home/pi/sagemath/sage-9.0/local/lib/python3.7/site-packages/sage/doctest/forker.py", line 681, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/pi/sagemath/sage-9.0/local/lib/python3.7/site-packages/sage/doctest/forker.py", line 1123, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.cpython.dict_del_by_value.test_del_dictitem_by_exact_value[5]>", line 11, in <module>
        assert D1 == D2
    AssertionError
**********************************************************************
1 item had failures:
   1 of  12 in sage.cpython.dict_del_by_value.test_del_dictitem_by_exact_value
    [33 tests, 1 failure, 0.10 s]
----------------------------------------------------------------------
sage -t --long src/sage/cpython/dict_del_by_value.pyx  # 1 doctest failed
----------------------------------------------------------------------
Total time for all tests: 0.2 seconds
    cpu time: 0.1 seconds
    cumulative wall time: 0.1 seconds

CC: @nbruin @embray

Component: porting

Keywords: raspberry

Issue created by migration from https://trac.sagemath.org/ticket/28941

mkoeppe commented 4 years ago
comment:1

Moving tickets to milestone sage-9.2 based on a review of last modification date, branch status, and severity.

fchapoton commented 4 years ago

Changed keywords from none to raspberry

dimpase commented 4 years ago
comment:3

28941, #28940, #28939 are all related

dimpase commented 4 years ago

Changed author from jsp to none

dimpase commented 4 years ago
comment:4

here is a much easier example

from sage.cpython.dict_del_by_value import test_del_dictitem_by_exact_value
L=list(range(10))
A={L[0]:L[1]}                                                                                      
B={L[0]:L[1]}                                                                                      
del A[L[0]]                                                                                        
test_del_dictitem_by_exact_value(B,L[1],hash(L[0]))                                                
print(A, B) 

prints {} {0: 1} on ARM (Raspberry Pi, armv7l) and {} {} on x86_64.


This seems to be recent - I didn't check the exact history, but on Sage 8.6 it still works correctly. (Although perhaps it just never worked with Python 3 in the 1st place).

dimpase commented 4 years ago
comment:6

The OS in question is 32-bit, in particular is has SIZE_VOID_P=4 (whereas x86_64 has 8). In cpython's Objects/dictobject.c, where code in dict_del_by_value.pyx is said to be copied from, one has (in cpython 3.8.5)

#if SIZEOF_VOID_P > 4
#define DK_IXSIZE(dk)                          \
    (DK_SIZE(dk) <= 0xff ?                     \
        1 : DK_SIZE(dk) <= 0xffff ?            \
            2 : DK_SIZE(dk) <= 0xffffffff ?    \
                4 : sizeof(int64_t))
#else
#define DK_IXSIZE(dk)                          \
    (DK_SIZE(dk) <= 0xff ?                     \
        1 : DK_SIZE(dk) <= 0xffff ?            \
            2 : sizeof(int32_t))
#endif

whereas the corresponding code in dict_del_by_value.pyx is

cdef inline int DK_IXSIZE(MyPyDictKeysObject *keys):
    cdef Py_ssize_t s = keys.dk_size
    if s <= 0xff:
        return 1
    elif s <= 0xffff:
        return 2
    elif s <= 0xffffffff:
        return 4
    else:
        return 8

This looks like a bug to me (not sure ATM it's the same bug we deal in here, or not).

embray commented 4 years ago
comment:7

That looks like a good guess to me. This code was reworked pretty recently (I think as of Python 3.6) when adding the new dict implementation (https://bugs.python.org/issue27350) which is likely more recent than this code was last touched. It's bad enough that we're mucking with dict internals, but as long as we are this needs to be updated.

embray commented 4 years ago
comment:8

Replying to @embray:

That looks like a good guess to me. This code was reworked pretty recently (I think as of Python 3.6) when adding the new dict implementation (https://bugs.python.org/issue27350) which is likely more recent than this code was last touched. It's bad enough that we're mucking with dict internals, but as long as we are this needs to be updated.

Correction, it appears Jeroen did update this code a while back to be compatible with the new dict implementation (I was just reading the code and thinking "huh...this looks like it was designed to work with the new implementation already"). But probably didn't carry over this particular detail.

The set_dk_index function also lacks some SIZEOF_VOID_P checks that are present in the upstream code.

dimpase commented 4 years ago
comment:9

however, on 32-bit x86 this error does not show, and the example in comment 4 works well, too.

so it must be something else.

embray commented 4 years ago
comment:10

True, looking at the way Jeroen wrote it, it looks like it should work on 32-bit without requiring the extra macro.

embray commented 4 years ago
comment:11

This has me wondering if this shouldn't be proposed as a feature to upstream to Python. Maybe it already has been proposed and rejected, but I don't know. Otherwise I could bring it up. Since we already have a prototype implementation that might help give the proposition some weight (though an upstream version would want to implement it as a dict method I guess.

embray commented 4 years ago
comment:12

Replying to @embray:

True, looking at the way Jeroen wrote it, it looks like it should work on 32-bit without requiring the extra macro.

I'm mistaken, Nils wrote it and Jeroen made some updates later.

dimpase commented 4 years ago
comment:14

I can say that for comment:4 example the error, i.e. incorrect computation by del_dictitem_by_exact_value(), not finding key 0, occurs in dk_get_index(); it returns -1, which is DKIX_EMPTY.

I don't know why.

The potential issue in comment:6 does not occur here, as the dictionary is very small (I still don't get why on 32-bit x86 this works, what kind of arch difference is at play here).

nbruin commented 4 years ago
comment:15

No clue what's going on there. I'm not really an expert on CPython's dictionaries. I just read the CPython source and engineered the appropriate code for it (or at least what I though was appropriate). It could well be that the 32bit/64bit is a problem. I didn't see an easy way of testing and adjusting for it in cython (although I'm sure that's my limitation). I probably didn't test on that, trusting that the bots would catch problems (the doctests of this code are supposed to be pretty thorough). Another thing that comes to mind, due to the "<=" is a signed/unsigned problem?

There's the issue with split/nonsplit dicts in Py3. This code needs nonsplit, and it tries to force that by having a numerical key in the dict at some point (split dicts only work with string keys, if I understand it correctly). If somehow that were different on ARM, that could explain things. However, I do recall that when I read the CPython code it looked like pretty clean C to me, with few conditionals on architecture.

I think it'll take someone just sitting down and low-level tracing what's going on on ARM. Looking at some memory dumps of small dictionaries probably clears this up pretty quickly. I don't have time to do it.

dimpase commented 4 years ago
comment:16

Replying to @nbruin:

It could well be that the 32bit/64bit is a problem. I didn't see an easy way of testing and adjusting for it in cython (although I'm sure that's my limitation). I probably didn't test on that, trusting that the bots would catch problems (the doctests of this code are supposed to be pretty thorough). Another thing that comes to mind, due to the "<=" is a signed/unsigned problem?

It appears to be char (aka int8) is a problem - it is signed on x86, and unsigned on ARM. This leads to a weird distortion on how the dict keys (dk_indices) are interpreted.

That's a truly unpleasant design of dict() in CPython, where these int8, int16 etc are meant to be unsigned, but negative values used to indicate that it's e.g. empty.leading to complications of this sort.

moreover, in cpython these indices are held in an array of char, unions are not used. Probably copying C code from cpython rather than rewriting it in Cython might be more practical.

dimpase commented 4 years ago
comment:17

Questions :

  1. Can del_dictitem_by_exact_value({},foo,bar) throw an exception, or it must always return normally? (looks like a missing doctest)
  2. Why there is no simple doctest for removing an item, only one that is tagged #long ?
  3. Should such a doctest be using dicts from weakref ? From Sage's weak_dict?
  4. Do I understand correctly that the lookup procedure for the item to be removed does the same as a standard (as in CPython) dictionary lookup, plus checking that the item found by key is as needed, so this does not need (much) custom code - I understand that the removal is non-standard then, by a kind of buffered removal: moving the item to a temp dict with subsequent deletion there, right?
dimpase commented 4 years ago
comment:18

Replying to @embray:

This has me wondering if this shouldn't be proposed as a feature to upstream to Python. Maybe it already has been proposed and rejected, but I don't know. Otherwise I could bring it up. Since we already have a prototype implementation that might help give the proposition some weight (though an upstream version would want to implement it as a dict method I guess.

IMHO this method is in fact only needed for src/sage/misc/weak_dict.pyx - which is an alternative implementation of Python's dicts in weakref. I presume it won't work on the latter, anyway, and it's not needed for "normal" dicts.

Nils, is this right?

nbruin commented 4 years ago
comment:19

Replying to @dimpase:

Replying to @embray:

This has me wondering if this shouldn't be proposed as a feature to upstream to Python. Maybe it already has been proposed and rejected, but I don't know. Otherwise I could bring it up. Since we already have a prototype implementation that might help give the proposition some weight (though an upstream version would want to implement it as a dict method I guess.

IMHO this method is in fact only needed for src/sage/misc/weak_dict.pyx - which is an alternative implementation of Python's dicts in weakref.

Correct -- that was the reason for implementing it. However, there are good reasons for an alternative to weakref.WeakKeyDict, at least in Py2, since it is buggy! Perhaps this is changed in py3, but even then I expect that the full cython implementation we provide is much faster. If this is going to be proposed upstream, then I only think it has a chance of being accepted for that reason. Otherwise, the routine doesn't match the abstraction that dictionaries are supposed to supply.

I presume it won't work on the latter, anyway, and it's not needed for "normal" dicts.

It would as long as it's not a split dict, and one could make this work on a split dict too. Probably, a generic implementation should. Presently we force our WeakDict to be shared.

nbruin commented 4 years ago
comment:20

Replying to @dimpase:

  1. Can del_dictitem_by_exact_value({},foo,bar) throw an exception, or it must always return normally? (looks like a missing doctest)

No, it's almost always called in garbage collection, so exceptions would be suppressed. This routine should not raise exceptions.

  1. Why there is no simple doctest for removing an item, only one that is tagged #long ?

No reason I am aware of. I definitely wanted the "long" one. I think there's some more varied, not #long, testing in WeakDict, which should catch irregularities. The #long test was supposed to be a real stress test that catches any unexpected changes/difference, such as the one on this ticket.

  1. Should such a doctest be using dicts from weakref ? From Sage's weak_dict?

Those are already there. To avoid circularity, I figured it would be clearer if the tests here (for a building block of weak_dict!) does not rely on weak_dict.

  1. Do I understand correctly that the lookup procedure for the item to be removed does the same as a standard (as in CPython) dictionary lookup, plus checking that the item found by key is as needed, so this does not need (much) custom code

The idea is that the hash is given so that the probe sequence can be replicated and that the identity of the value is given, which can be used to match the find. So we're matching on identity of the value here, not on equality of the key (the intended application being the identity of a weakref object). That's quite different from standard lookup! The shared code is in generating the probe sequence and getting the actual key/value pairs (which is the complicated bit in modern Py3 dicts)

No, not really. There's a well-documented problem with recursive deletions: possible stack overflow for deep ones. The py3 weakdict is actually susceptible to it, and it's well-illustrated in sage's weakdict's doctests. In CPython's internals, there's a macro construct called the "Trashcan" that avoids the stackoverflow by flattening the stack if it gets too deep. I didn't know how to utilize that macro in cython (it's a tricky one that breaks through control flow, so I don't know if it's even possible to wrap it so that it can be used in cython) so I ended up placing the value in a temporary tuple/list and deleting THAT, because that's already participating in the trashcan.

Additionally, for WeakValueDict there is a mechanism where deletions can be stored if the dict is currently locked for iteration. This was a problem in Py2 and could actually lead to deletions: Iterate over a WeakDict and have items inadvertently disappear from it due to GCs. That only affects how this routine is used there; it doesn't factor into the design of this routine particularly. Py3 dicts also have better iteration protection now, so it may be that this mechanism can be redesigned now.

nbruin commented 4 years ago
comment:21

Clarification for terminology from: https://www.python.org/dev/peps/pep-0412/

There are split-table dictionaries and combined-table dictionaries. In split-table dictionaries, there is room for sharing the key table for several dictionaries. Hence, split-table dictionaries can be shared-key dictionaries. So "split=shared". I might not have been consistent with this terminology. For this ticket, only "combined-table" dicts are relevant.

mkoeppe commented 3 years ago
comment:23

Moving to 9.4, as 9.3 has been released.