Win32 micropython builds corrupt memory upon repeated execution of a particular function

ddiminnie commented 5 years ago

(I apologize in advance for not finding a simpler example/use case for this.) While running tests on our own fork of micropython, we encountered a bug that crashed the python interpreter (on Windows 32-bit builds only). This bug is also present in the 'stock' 1.10 release of Micropython. Rather than upload our entire test setup, the following (somewhat simpler) example should serve to illustrate the problem. Using the python module listed at the end of this comment, try the following (may need to repeat the last instruction a few times to see the issue):

MicroPython v1.10 on 2019-03-28; win32 version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> from sys import path
>>> path.append(r'<path where flt_hex.py is stored>')
>>> from flt_hex import flt_hex
>>> for idx in range(20000):  exec('flt_hex(float({:d}))'.format(idx), locals(), globals())

After one or more iterations of the final step, the following assertion is triggered (followed by shutdown of the micropython.exe process):

>>> for idx in range(20000):  exec('flt_hex(float({:d}))'.format(idx), locals(), globals())
Assertion failed: ATB_GET_KIND(block) == AT_HEAD, file <redacted>\micropython\py\gc.c, line 587

Listing for file 'flt_hex.py':

from array import array
from math import isinf, isnan
from sys import byteorder

_IS_DOUBLE = (1e-100 > 0)

def flt_hex(flt):
    """
    Mimics the behavior of the 'hex()' float instance method, for platforms where this method 
    is not implemented.

    :param flt: floating-point value to be converted.
    :return: hexadecimal string representation of flt.
    """
    if not isinstance(flt, float):
        raise TypeError('first argument must be of type "float"')

    if isnan(flt) or isinf(flt):
        result = str(flt)
    else:
        # Form the string 
        #   s0xc.mm...pe
        # where 
        # s(ign) = '-' if flt is negative else '',
        # c(haracteristic) = 1 if flt is normalized else 0,
        # each m represent one digit of the fractional part of the significand (the 'mantissa')
        # e(xponent) is the power of 2

        # Convert to a list of integers (bytes objects are not trivially reversible in 
        # MicroPython)
        bv = list(bytes(array('d' if _IS_DOUBLE else 'f', [flt])))
        if byteorder == 'little':
            bv = bv[::-1]

        bv_len = len(bv)  # 8 for double; 4 for single

        # From IEEE-754 (1985), float layouts (big endian) are
        # 0bseee eeee efff ffff ffff ffff ffff ffff  for single precision
        # 0bseee eeee eeee ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff for 
        # double precision
        s = '-' if (bv[0] & 0x80) else ''

        ee = ((((bv[0] & 0x7F) << 4) + ((bv[1] & 0xF0) >> 4)) if _IS_DOUBLE 
              else (((bv[0] & 0x7F) << 1) + ((bv[1] & 0x80) >> 7)))
        ff = (bv[1] & (0x0F if _IS_DOUBLE else 0x7F)) << (8*(bv_len - 2))
        ff += sum((val << (8*(bv_len - 3 - idx))) for idx, val in enumerate(bv[2:]))

        if ee == 0:
            # Zero or denormalized
            characteristic = '0'
            if ff:
                exponent = '-1022' if _IS_DOUBLE else '-126'
                # Since there are 23 bits after the decimal point for single precision, we 
                # need to shift left by one bit to fit in hex format (the last bit in the 
                # output should be ignored)
                m = '{:=013x}'.format(ff) if _IS_DOUBLE else '{:=06x}'.format(ff << 1)
            else:
                exponent = '+0'
                m = '0'
        else:
            # Normalized floats
            characteristic = '1'
            exponent = '{:+d}'.format(ee - (1023 if _IS_DOUBLE else 127))
            m = '{:=013x}'.format(ff) if _IS_DOUBLE else '{:=06x}'.format(ff << 1)

        result = '{s}0x{characteristic}.{m}p{exponent}'.format(s=s, 
                                                               characteristic=characteristic, 
                                                               m=m,
                                                               exponent=exponent)

    return result

ddiminnie commented 5 years ago

If it helps, this issue can also be reproduced on the 'win32' build of Micropython 1.9.4. I have not seen the issue in 64-bit builds, nor have I reproduced the problem on the Gemma M0 (CircuitPython) or our own Atmel-based distribution.

pfalcon commented 5 years ago

The first port to reproduce this issue would be "unix" port (under Linux).

ddiminnie commented 5 years ago

Sorry... I wasn't clear (enough) in my first comment. I cannot reproduce the problem on 64-bit Windows builds, 64-bit OS X builds, or the various embedded ports I have access to (and hardware for). I can try to compile a 32-bit OS X build (but that will have to wait until next week). I don't currently have a linux box to play with :(.

dpgeorge commented 5 years ago

Thanks for the report. I can't reproduce any issue with the code running under a 32-bit unix MicroPython executable.

It could be related to GC scanning of the machine registers (and possibly C stack). You'll need to determine if you use MICROPY_GCREGS_SETJMP or not, and try disabling/enabling it to see if that fixes it.

ddiminnie commented 5 years ago

Here's the process I used to compile the (Release) win32 port:

I obtained the code from GitHub via the 'Clone or Download' link.
I opened the 'micropython.vcxproj' project (under micropython\ports\windows) in Visual Studio 2015.
I set the solution configuration to 'Release' and the solution target to 'x86' and built the application.

That said, according to Visual Studio, MICROPY_GCREGS_SETJMP is indeed set (value (1)) for this particular port. I've rebuilt the solution with MICROPY_GCREGS_SETJMP set to (0), with the same result:

ddiminnie commented 5 years ago

We'll certainly try debugging this on our end - I was hoping that someone more familiar with the code (that doesn't raise the bar much in my case, unfortunately :( ) might have some insight into the problem.

dpgeorge commented 5 years ago

Thanks for checking. Since such a problem was never seen/reported before, and it's not (so far) reproducible on the unix port or Win 64-bit, it's most likely something to do with the specific configuration of the Win 32-bit build. And my guess is that the GC is freeing some memory that should not be freed, because the code crashes when it explicitly tries to free a GC region, only to find that it's already freed. Thus it's likely to do with scanning of root pointers in the Win 32 build.

ddiminnie commented 5 years ago

Well... we have a line number for our first breakpoint (gc.c, line 587), and can try to unwind the stack from there. Looks like my team has some fun ahead of it....

stinos commented 5 years ago

Hmm, can't reproduce this in x86 nor x64 builds. Using VS2017 though. And the last commit. I'd be happy to debug this but I tried a bunch of things like more iterations but no avail. Just from the looks of it I'd guess it has something to do with using exec, the code itself alone shouldn't cause this, at least I think.

ddiminnie commented 5 years ago

... and I can't reproduce the problem in a debug 'win32' build. Since the debug builds run more slowly, it's possible this is timing related (?! Oh, joy...).

To add further fuel to the 'timing-related' hypothesis, sometimes I have to repeat the last instruction several times before the crash:

stinos commented 5 years ago

Not very likely this has anything to do with timing (at least not in the sense of multithreaded/race condition problems where 'insert a sleep statement somwehere' seems to 'solve' the problem :P), there's only one thread which does everything. And there are quite some other differences in debug builds. I'll try again later with executing the loop more than once.

stinos commented 5 years ago

Tried this a thousand times, no avail. Any chance you can test this with VS2017? And/or check whether it also happens with the current version of the uPy code? Or come up with a more clear reproduction case?

ddiminnie commented 5 years ago

@stinos: First, thanks for taking the effort to try to reproduce the problem. It's greatly appreciated!

We use the same toolchain across all our Windows builds, so I'm afraid I'm stuck with VS 2015 right now (I could try to spin up a VM with VS 2017, but that will almost certainly have to wait awhile, unfortunately...).

The next step for me might be to try to run the original test suite that spawned the problem without the test harness (we weren't "exec"ing anything in that case - the steps in the description above were the simplest I could find in the limited time I had available to reliably reproduce the problem on my end).

I spent a bit of time running the steps above in the debugger - I haven't fully wrapped my head around the way the garbage collector works (there appears to be a memory allocation table, a space reserved for 'finalizers', and the actual memory pool itself, and for the example in this issue once the pool is exhausted, the code sweeps through the allocation table and marks blocks as free... the assertion is thrown during (or right after) one of these sweeps, but I still haven't figured out what the trigger is. It doesn't help that the code in question is called thousands of times before the problem occurs... but that's why we have computers manage the memory instead of humans ;-) )

dpgeorge commented 5 years ago

the assertion is thrown during (or right after) one of these sweeps, but I still haven't figured out what the trigger is.

From the assertion that fails, it happens in gc_free() which is never called during a garbage collection phase/sweep. gc_free() is only called by code when it explicitly knows that it doesn't need the memory anymore. This is usually in the compile phase, so probably happens when exec() is compiling its argument. It'd be good if you could see a backtrace at the point of the assert failure.

stinos commented 5 years ago

@ddiminnie tried again, this time building MicroPython with the VS2015 (v140) toolset (which should be what you are using, right?), still cannot reproduce it with x86 release builds, also not for v1.10 :(

ddiminnie commented 5 years ago

@stinos: Hmm... Maybe we need more configuration information (should have listed this in the first place, especially since the problem appears to be hard to reproduce outside of the systems we have here):

Our build and test systems are Windows 7 (x64)-based machines.

We've run into the cited problem on

the Windows port of CircuitPython 3.0.0 (this is the code we've forked for our project), which is pre-MicroPython 1.9.4
MicroPython 1.9.4, win32 release build. (Downloaded from GitHub Master branch on 19 November 2018)
MicroPython 1.10.0, win32 release build. (Downloaded from GitHub Master branch on 28 March 2019.)

Toolchain: IDE: Visual Studio Professional 2015, Version 14.0.25431.01 Update 3 Compiler/Linker: Microsoft Visual C++ 2015

Build Configuration:

@dpgeorge: I'm still having trouble getting a stack trace at the time the assertion failure occurs.

The 'assert' macro used in the project maps to the Microsoft CRT '_wassert' macro, which prints a message and then immediately aborts. (When running outside the debugger, there's a dialog box that allows us to 'Debug'... but that simply opens the debugger and then aborts... so no stack trace for me :-( )
I've been trying to set a conditional breakpoint on other variables active at the time - the prime candidate would be the 'block' variable in gc_free, but that variable is optimized away in the x86 release configuration, so any breakpoint set there can never be hit.
Setting an unconditional breakpoint on the line with the failing assertion means clicking 'Continue' thousands of times
I'm not seeing the problem at all in 'debug' configurations (I suspect it's still there, but has moved, and thus I don't have a procedure for reproducing the problem).

I'm going to try to switch to the '_ASSERT' macro from 'crtdbg.h' (and link in the corresponding debug runtime libraries) for the offending line (unlike 'assert', '_ASSERT' actually calls DebugBreak, so it should be possible to get a stack trace). (This is just part of the fun of debugging in Windows Land...)

ddiminnie commented 5 years ago

@dpgeorge: I lied: I can get some stack information at assert time from ProcessExplorer (doesn't have line numbers, but does at least indicate which functions were called... and I may be able to use the disassembly in VS to map the offsets to at least some of the code lines... later today...). Here's the stack from bottom to the _wassert call (everything above that is system level calls for the crash dialog, etc.):

ucrtbase.DLL!_wassert+0x16 micropython.exe!gc_free+0x83 micropython.exe!mp_parse+0xac2 micropython.exe!mp_parse_compile_execute+0x76 micropython.exe!eval_exec_helper+0x11b micropython.exe!mp_builtin_exec+0x10 micropython.exe!fun_builtin_var_call+0x80 micropython.exe!mp_execute_bytecode+0x1de8 micropython.exe!fun_bc_call+0xc9 micropython.exe!mp_call_function_0+0x40 micropython.exe!execute_from_lexer+0x18e micropython.exe!dorepl+0x3a3 micropython.exe!main+0x996 micropython.exe!main+0x17 micropython.exe!__scrt_common_main_seh+0xf9 kernel32.dll!BaseThreadInitThunk+0x12 ntdll.dll!RtlInitializeExceptionChain+0x63 ntdll.dll!RtlInitializeExceptionChain+0x36

ddiminnie commented 5 years ago

... and here is that stack trace in more human-readable form:

ucrtbase.DLL!_wassert+0x16 micropython.exe!gc_free(void ptr) Line 587 micropython.exe!mp_parse(_mp_lexer_t lex, mp_parse_input_kind_t input_kind) Line 1168 micropython.exe!mp_parse_compile_execute(_mp_lexer_t lex, mp_parse_input_kind_t parse_input_kind, _mp_obj_dict_t globals, _mp_obj_dict_t locals) Line 1420 micropython.exe!eval_exec_helper(unsigned int n_args, void const args, mp_parse_input_kind_t parse_input_kind) Line 146 micropython.exe!mp_builtin_exec(unsigned int n_args, void const args) Line 155 micropython.exe!fun_builtin_var_call(void self_in, unsigned int n_args, unsigned int n_kw, void const args) Line 127 micropython.exe!mp_execute_bytecode(_mp_code_state_t code_state, void volatile inject_exc) Line 922 micropython.exe!fun_bc_call(void self_in, unsigned int n_args, unsigned int n_kw, void const args) Line 288 micropython.exe!mp_call_function_0(void fun) Line 599 micropython.exe!execute_from_lexer(int source_kind, const void source, mp_parse_input_kind_t input_kind, bool is_repl) Line 147 micropython.exe!dorepl() Line 259 micropython.exe!main(int argc, char * argv) Line 638 micropython.exe!main(int argc, char** argv) Line 420 micropython.exe!__scrt_common_main_seh+0xf9 kernel32.dll!BaseThreadInitThunk+0x12 ntdll.dll!RtlInitializeExceptionChain+0x63 ntdll.dll!RtlInitializeExceptionChain+0x36

Of course this isn't the whole picture (I need to see if I can find a way to capture the gc heap at the time of the assertion failure, but even then the source of the problem may have occurred several steps earlier in the execution sequence).

ddiminnie commented 5 years ago

Welp... I'm an idiot (which should be apparent from following this thread). I added the following lines to gc_free just before the offending assertion (allowing me to set a breakpoint that is only hit when the following assertion will fail), and can now confirm that the stack trace above is correct.

        size_t block = BLOCK_FROM_PTR(ptr);
        // Debugging code FIXME!!!
        if (ATB_GET_KIND(block) != AT_HEAD) {
            printf("Foo!\n");
        }
        // End of debugging code FIXME!!!
        assert(ATB_GET_KIND(block) == AT_HEAD);

Moreover, I've been able to dump the 1Mb memory space allocated for the garbage collector: gcmem.zip

stinos commented 5 years ago

@ddiminnie I also tested this on a Windows 7 system with the specs you mention. But can you state the actual version of the compiler please (i.e. output of cl on a VS2015 commandline)? And while we're at it: what is your Windows SDK version (I highly doubt this matters, but still..). Thats the WindowsSDKLibVersion environment variable, or what is selected under Project properties->General->Target Platform Version.

The 'assert' macro used in the project maps to the Microsoft CRT '_wassert' macro, which prints a message and then immediately aborts.

I don't think _wassert is a macro, but a function, so all you needed to do was setting a breakpoint on _wassert. Or else, break on abort: as uasual assert is a macro which eventually calls C's abort() function.. You see, debugging in Windows really isn't that different or more difficult than on another OS :)

(When running outside the debugger, there's a dialog box that allows us to 'Debug'... but that simply opens the debugger and then aborts... so no stack trace for me :-( )

This is a bit more tricky but with some digging you could still have figured this out: you have a notion of what the CRT is, you know _ASSERT, you can see there's different ways of handling what the program does on asserts, because yes, it's strange that assert just aborts without the more typcial 'Assertion failed' dialog and ways to hook the debugger, right? The documentation of _ASSERT (for instance) has the clue: turns out there's a _CrtSetReportMode function which alters what happens on asserts exactly in the way witnessed here. And indeed, that gets used in MicroPython, see init.c. Just remove the offending lines and you will get a chance to hook the debugger when running the exe directly.

ddiminnie commented 5 years ago

@stinos: Just to level-set, I'm not really a software developer (its MUCH worse than that - I'm a mathematician... If you've ever worked in an academic environment, you'll know what that entails... :-) ). My day job consists of validation of algorithms (when possible), or at the very least, testing of implementations of mathematical software. So, I know my way around (some of) the basics of an IDE, can write (simple) programs in C, and basically know just enough to get myself in trouble. (I do the vast majority of my programming in python, for the record.)

I say this to make communications simpler as we proceed forward.

Now, as for your questions: The actual version of the compiler we're using is Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x86 The Target Platform Version (from the micropython project) is 8.1 (And yes, _wassert and abort are functions, but they live in the CRT, and I don't have the source for that. I could try to find them in the disassembly and break there, but I found a much kludgier way to achieve the same end - see my previous comment... though, looking back it occurs to me that I didn't say which block gc_free was attempting to free, so the information I gave wasn't overly useful. Your last suggestion to check out init.c was useful - I'll keep it in mind for the future!)

ddiminnie commented 5 years ago

@stinos: In response to your much earlier request for a simpler example, I'm still struggling to find one that triggers this issue (well... if it were simple, lots of users would have complained by now, and any underlying problem would have been fixed long ago).

We originally ran into the problem on our own customized Windows port of CircuitPython, which sets the heap size to 20480 bytes, and includes a few other tweaks that (hopefully) don't affect the behavior we're chasing here. We had backported a pull request (MicroPython issue #4133), and were running tests to verify we had merged the changes properly. The suite in question used a mix of decimal values from CPython's 'floating_points.txt' file, along with some additional values of interest to us. We ran the tests by writing to the stdin of our python executable and reading from stdout of the executable.

Thus, the original motivating example was somewhat simpler (but more complex to communicate):

>>> <Instruction to disable our custom serial handshaking code>
>>> from <larger module containing flt_hex> import flt_hex
>>> flt_hex(8.988465674311582e+307)
>>> flt_hex(1.0726246343954082e+155)
>>> flt_hex(274877906944.00024)
<<< 1031 other examples omitted >>>

At (or around) the 302nd test we consistently hit the ATB_GET_KIND(block) != AT_HEAD assertion on our custom win32 build (but NOT our custom win64 build).

(We use the 'flt_hex' function to allow us to reconstruct the exact floating-point value that MicroPython created for each example within our external test harness, to allow us to compare the value to reference IEEE-754 binary64 equivalents of the corresponding decimal values... just in case anyone was wondering at this point...)

Naturally, the same set of examples doesn't trigger the problem on any of the MicroPython builds we've created (even after modifying the heap size), so we had to come up with another example. The result of some tinkering was the (awful) example in the original description above (which has the added benefit that it also triggers the issue in our custom build).

Unfortunately, it appears that the cited problem has (so far) only been reproduced on builds created at our site. (At least we can reproduce the problem on executables built on our official build machine, on my laptop, and on the laptop of one of the (actual) developers assigned to this project...)

stinos commented 5 years ago

If you've ever worked in an academic environment, you'll know what that entails... :-) ).

I've been in academic environments more than anywhere else so yes I get your point :P

The actual version of the compiler we're using is Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x86 The Target Platform Version (from the micropython project) is 8.1

19.00.24234.1 here but for the rest it's the same.

And yes, _wassert and abort are functions, but they live in the CRT, and I don't have the source for that.

Sorry should have been more clear. You don't need the source, you can set breakpoints at any address, and VS helps you with translating functions to addresses: use Debug->New Breakpoint->Function Breakpoint... and enter 'abort'..

Unfortunately, it appears that the cited problem has (so far) only been reproduced on builds created at our site

Well, as I was typing this I thought "let's try it one more time to be sure" and for reasons I don't know I can now suddenly but reliably reproduce the issue (on a Windows 10 VM, both with VS2015 and VS2017 toolsets!, but still nothing on my main pc). Yay. So it's like really really random and maybe it's colder today and things were set loose :)

dpgeorge commented 5 years ago

I can now suddenly but reliably reproduce the issue (on a Windows 10 VM, both with VS2015 and VS2017 toolsets!, but still nothing on my main pc).

Ok, that's good news, that it's now more reproducible.

@stinos can you confirm which NLR setting the build is using, either MICROPY_NLR_SETJMP or MICROPY_NLR_X86? It might be that it needs to trace more of the registers in py/nlrx86.c, eg ecx and edx.

stinos commented 5 years ago

This is all with MICROPY_NLR_SETJMP: the assembly syntax of the other implementations is incompatible with cl.

ddiminnie commented 5 years ago

@stinos: Awesome! (And thanks for the pointers!) This is (hopefully) the last in a (seemingly) unending string of hard-to-reproduce errors I've had to deal with on the multiple simultaneous projects I've been working (the rest have been in our own code... nothing to do with MicroPython)...

stinos commented 5 years ago

@dpgeorge the assert happens when freeing the lexer's indent_level member in mp_lexer_free so I assume either something goes wrong in the previous call to gc_collect. I don't really know how to figure out what though, any pointers on how to debug this? I was thinking a 32bit unix build with setjmp might also exhibit the problem.

dpgeorge commented 5 years ago

@stinos I tried a 32-bit unix build with MICROPY_NLR_SETJMP enabled, but didn't see any issues with executing the flt_hex() function. Can you please confirm if the build you see the error on has !MICROPY_GCREGS_SETJMP enabled or not? And see if changing this option still has the problem (according to @ddiminnie it does).

If we assume it's failing to mark and/or trace the mp_lexer_t object during a GC, that would lead to the crash described here. And that is a possibility: for exec() calls the lexer is allocated on the heap, and passed as the first argument (likely tail-call optimised) to mp_parse_compile_execute(), which doesn't do any memory allocations and passes the lexer to mp_parse() as the first argument, and never uses this object again. So no GC should have occurred between allocating indent_level for the first time in mp_lexer_new() and getting to mp_parse(). Then, if the lexer object is not being traced, it's because in mp_parse() the stack and/or regs are not being correctly traced. Because most functions in parse.c are static, the compiler could heavily optimise mp_parse() to the point where the lexer object is contained only in a register (ie not on the stack). In such a case we should inspect what register the lexer is held in.

Note that the problem could also lie in mp_lexer_new(): it may trigger a GC just after allocating indent_level and before returning. And the lexer pointer in this function most likely lives in a register.

@stinos are you able to provide me with a complete disassembly of lexer.c and parse.c, for the binary that crashes?

If tracing as described above is not the problem then it's probably something like a memory corruption.

stinos commented 5 years ago

@dpgeorge This is all with MICROPY_GCREGS_SETJMP set to 1 (just like for nlr the assembly code for __x86_64__ isn't compatible, so even when setting it to 0 in mpconfigport.h for instance, MICROPY_GCREGS_SETJMP gets reset to 1 in gccollect.c so that is probably what @ddiminnie witnessed).

Thanks a lot for the explanation; when testing some more yesterday I was slowly coming to a similar conclusion as I saw gc_collect occurrences where indent_level wasn't getting marked (but the lexer struct itself was, if I remember correctly, but don't take my word for it) which would inevitably be followed by the assertion failure, but I didn't figure out yet why, nor where to go from there.

So now I simply checked for gc_collect() calls in between the allocation of indent_level and the first call to m_new() in mp_parse(), and those occur (like for instance in the vstr_init() call in mp_lexer_new() as you mention), and when they do they always lead to the assertion failing so your analysis so far is spot on :) I also got the assertion a couple of other times but that might be because I'm not checking for all gc_collect() calls, I'd rather place my bets on that instead of memory corruption. I haven't got time yet to see why indent_level isn't getting marked.

I'm attaching the compiler-generated assembly/source listings which might not be the exact same thing as disassembling the object files but should be close enough I think, and fairly readable.

parse.txt gc.txt gccollect.txt lexer.txt

ddiminnie commented 5 years ago

@dpgeorge: For the benefit of anyone browsing the bug list, should I change the title of this thread to something more appropriate (e.g. "Win32 micropython builds trigger assertion failure upon repeated..." etc.)?

dpgeorge commented 5 years ago

Thanks @stinos for the disassembly. Some points to note:

it looks like the compiler/linker is doing some form of LTO because memory functions (m_malloc, m_free) are inlined in the parser code
it's using fastcall conventions, so first two arguments are passed in ecx and edx, rest on the stack
in both mp_lexer_new() and mp_parse() the lexer pointer is indeed stored in a register during the course of the whole function, in edi
it looks like gc_collect() is working correctly

@stinos are you able to provide the disassembly of __setjmp3? That's the key part here, to see if it's saving edi

@ddiminnie the title is ok for now.

stinos commented 5 years ago

Indeed 'Link Time Code Generation' aka LTO is enabled (that's also one of the reasons I can't just disassemble the object files it seems).

Btw I enjoy debugging this but there's only some limited amount of time I can spend on this and wasted hours because adding a bunch of statements to make debugging easier often results in an executable which doesn't exhibit the problem. So by now I'm out of time for today which is a pitty because I'm getting closer. The problem I'm seeing now is:

the mp_lexer_t *lex allocated in mp_lexer_new() gets stored in edi
still in mp_lexer_new, indent_level is allocated by gc_alloc() which in turn pushes edi on the stack
there's no free memory and gc_collect() gets triggered

right before the call to __setjmp3() the stack looks like this:

0x0049F4E8  0049f510  -> pointer to `arr` in gc_collect_regs_and_stack, i.e. regs in gc_collect_regs_and_stack
0x0049F4EC  00000000  
0x0049F4F0  0049f558  
0x0049F4F4  0034d7a6  
0x0049F4F8  002cffc0  -> this is mp_lexer_t *lex from mp_lexer_new, pushed here by gc_alloc
0x0049F4FC  00000014  
0x0049F500  002cffa0  
0x0049F504  00000010  
0x0049F508  00000000  
0x0049F50C  00000000  
0x0049F510  00000000  -> regs allocated in gc_collect_regs_and_stack
....

__setjmp3() stores everything including edi but that's of not much use here unfortunately
gc_collect_root() gets called but is passed a pointer to regs which is later on the stack than the pointer to mp_lexer_t so the latter doesn't get marked
memory for indent_level gets allocated, everything continues and assertion fails in gc_free()

Disassembly for __setjmp3():

0FE53DA0 8B 54 24 04          mov         edx,dword ptr [esp+4]  
0FE53DA4 89 2A                mov         dword ptr [edx],ebp  
0FE53DA6 89 5A 04             mov         dword ptr [edx+4],ebx  
0FE53DA9 89 7A 08             mov         dword ptr [edx+8],edi  
0FE53DAC 89 72 0C             mov         dword ptr [edx+0Ch],esi  
0FE53DAF 89 62 10             mov         dword ptr [edx+10h],esp  
0FE53DB2 8B 04 24             mov         eax,dword ptr [esp]  
0FE53DB5 89 42 14             mov         dword ptr [edx+14h],eax  
0FE53DB8 C7 42 20 30 32 43 56 mov         dword ptr [edx+20h],56433230h  
0FE53DBF C7 42 24 00 00 00 00 mov         dword ptr [edx+24h],0  
0FE53DC6 64 A1 00 00 00 00    mov         eax,dword ptr fs:[00000000h]  
0FE53DCC 89 42 18             mov         dword ptr [edx+18h],eax  
0FE53DCF 83 F8 FF             cmp         eax,0FFFFFFFFh  
0FE53DD2 75 09                jne         0FE53DDD  
0FE53DD4 C7 42 1C FF FF FF FF mov         dword ptr [edx+1Ch],0FFFFFFFFh  
0FE53DDB EB 3B                jmp         0FE53E18  
0FE53DDD 8B 4C 24 08          mov         ecx,dword ptr [esp+8]  
0FE53DE1 0B C9                or          ecx,ecx  
0FE53DE3 74 0A                je          0FE53DEF  
0FE53DE5 8B 44 24 0C          mov         eax,dword ptr [esp+0Ch]  
0FE53DE9 89 42 24             mov         dword ptr [edx+24h],eax  
0FE53DEC 49                   dec         ecx  
0FE53DED 75 08                jne         0FE53DF7  
0FE53DEF 8B 40 0C             mov         eax,dword ptr [eax+0Ch]  
0FE53DF2 89 42 1C             mov         dword ptr [edx+1Ch],eax  
0FE53DF5 EB 21                jmp         0FE53E18  
0FE53DF7 8B 44 24 10          mov         eax,dword ptr [esp+10h]  
0FE53DFB 89 42 1C             mov         dword ptr [edx+1Ch],eax  
0FE53DFE 49                   dec         ecx  
0FE53DFF 74 17                je          0FE53E18  
0FE53E01 56                   push        esi  
0FE53E02 57                   push        edi  
0FE53E03 8D 74 24 1C          lea         esi,[esp+1Ch]  
0FE53E07 8D 7A 28             lea         edi,[edx+28h]  
0FE53E0A 83 F9 06             cmp         ecx,6  
0FE53E0D 76 05                jbe         0FE53E14  
0FE53E0F B9 06 00 00 00       mov         ecx,6  
0FE53E14 F3 A5                rep movs    dword ptr es:[edi],dword ptr [esi]  
0FE53E16 5F                   pop         edi  
0FE53E17 5E                   pop         esi  
0FE53E18 2B C0                sub         eax,eax  
0FE53E1A C3                   ret

Definition of __JUMP_BUFFER which is what I assume setjmp uses as internal representation:

typedef struct __JUMP_BUFFER
{
    unsigned long Ebp;
    unsigned long Ebx;
    unsigned long Edi;
    unsigned long Esi;
    unsigned long Esp;
    unsigned long Eip;
    unsigned long Registration;
    unsigned long TryLevel;
    unsigned long Cookie;
    unsigned long UnwindFunc;
    unsigned long UnwindData[6];
} _JUMP_BUFFER;

dpgeorge commented 5 years ago

Thanks @stinos, you were indeed very close. It seems that the LTO is inlining ports/unix/gccollect.c:{gc_collect,gc_collect_regs_and_stack} in py/gc.c:gc_collect and allocating the regs array too early on the stack, and registers with live root pointers are being pushed after it, and hence not traced (and not put into regs because the register is replaced with a different value by the point regs is populated).

The assumption made by uPy that is violated here by LTO is that gc_collect_regs_and_stack() is a "leaf" function that is actually called (not inlined) and hence would have its stack pointer at the very start of the stack.

Maybe the way to fix this is to mark this function as non-inlinable, or at least not touchable by LTO.

ddiminnie commented 5 years ago

@stinos , @dpgeorge: Thanks for taking the time to analyze this issue. In the short time I've worked with MicroPython and CircuitPython, I've been deeply impressed at both the quality of the implementation and the community around it. I'm grateful that the two of you have devoted so mcuh time to a rather obscure bug in a (seemingly) less active configuration.

As a (very) quick-and-dirty experiment, I've turned off 'whole program optimization' (/GL compiler switch) and 'link time code generation' (/LTGC:Incremental linker switch) in the Visual Studio 'micropython' project and rebuilt the win32 release configuration (this is akin to driving a tack with a pile driver, but...). After 11000 iterations of the loop from the original description I have not seen the cited failure. Obviously this is not a long-term fix (though it may be good enough for my team, as performance is not really an issue for us, and this change has the advantage of not affecting the other ports we're building).

Also, I've learned a few things from this exchange (which is always welcome!) - I look forward to seeing the proposed (official) fix (should you decide to pursue it)!

stinos commented 5 years ago

@ddiminnie indeed disabling those optimisations should prohibit the linker from inlining gc_collect_regs_and_stack() in other files and that might be the fix here. To be sure you'd probably have to look at the generated assembly code. I'll try to come up with a fix shortly.

ddiminnie commented 5 years ago

Sorry for the late comment - I thought I had replied, but it turns out I had not. With the /GL compiler switch and the /LTGC linker switch disabled, gc_collect() appears in the link map, but gc_collect_regs_and_stack() does not, which suggests (unless I'm completely mistaken... which is a distinct possibility) that the latter is still inlined.

Further, the disassembly of the gccollect module shows no call to gc_collect_regs_and_stack() (we can see the preceding call to gc_start() and the subsequent call to gc_end()), which serves as further evidence that gc_collect_regs_and_stack() has been inlined:

void gc_collect(void) {
00EEBF80  push        ebp  
00EEBF81  mov         ebp,esp  
00EEBF83  sub         esp,44h  
00EEBF86  mov         eax,dword ptr [__security_cookie (0F04004h)]  
00EEBF8B  xor         eax,ebp  
00EEBF8D  mov         dword ptr [ebp-4],eax  
    //gc_dump_info();

    gc_collect_start();
00EEBF90  call        gc_collect_start (0ECBA40h)  
    gc_collect_regs_and_stack();
00EEBF95  lea         eax,[ebp-44h]  
00EEBF98  push        eax  
00EEBF99  call        gc_helper_get_regs (0EEBFD0h)  
00EEBF9E  mov         eax,dword ptr [mp_state_ctx (0EF4000h)]  
00EEBFA3  lea         ecx,[ebp-44h]  
00EEBFA6  sub         eax,ecx  
00EEBFA8  shr         eax,2  
00EEBFAB  push        eax  
00EEBFAC  mov         eax,ecx  
00EEBFAE  push        eax  
00EEBFAF  call        gc_collect_root (0ECB9C0h)  
00EEBFB4  add         esp,0Ch  
    #if MICROPY_PY_THREAD
    mp_thread_gc_others();
    #endif
    #if MICROPY_EMIT_NATIVE
    mp_unix_mark_exec();
    #endif
    gc_collect_end();
00EEBFB7  call        gc_collect_end (0ECB990h)

Although disabling these switches does make the problem in the original description go away, I'm afraid the underlying problem has just moved around in "memory allocation/de-allocation space" (meaning another as yet undiscovered example will still cause the problem), assuming inlining of gc_collect_regs_and_stack() is the source of the issue.

dpgeorge commented 5 years ago

It's not a small patch (it's over 1000 lines) but #4723 is a comprehensive fix for the issue here. The alternative is do some trickery to ensure that all registers are indeed captured when doing a GC scan.

stinos commented 5 years ago

Marking functions with __declspec(noinline) should as far as I know always force a function to not be inlined so it's an easier and less drastic way to get the wanted results, for example here's a noinline gc_collect() with gc_collect_start() and gc_collect_regs_and_stack() also marked as noinline:

push        ecx  
call        gc_collect_start (0ABD3A0h)  
call        gc_collect_regs_and_stack (0AE1840h)  
call        gc_collect_end (0ABD490h)  
pop         ecx
ret

I verified this results in the 'correct' stack layout where lex is on the stack which gets scanned and hence gets marked and hence no problem. This should fix all similar cases (and maybe even all possible problematic cases?). Probably just making sure gc_collect_regs_and_stack() doesn't get inlined should be sufficient?

@dpgeorge ok to make a patch for this?

dpgeorge commented 5 years ago

ok to make a patch for this?

Yes, sounds good.

dpgeorge commented 5 years ago

This should be resolved by 34a7d7ebebc93bf9c4f166b0b523ceab844c7d91, so I'll close the issue.

ddiminnie commented 5 years ago

To confirm, I've applied the patch and run through 50000 iterations of the example in the original description without seeing the issue. Thanks again for all your hard work!

stinos commented 5 years ago

Thank you for discovering it as well, saves future headaches :)

micropython / micropython

Win32 micropython builds corrupt memory upon repeated execution of a particular function #4652