Closed ddiminnie closed 5 years ago
If it helps, this issue can also be reproduced on the 'win32' build of Micropython 1.9.4. I have not seen the issue in 64-bit builds, nor have I reproduced the problem on the Gemma M0 (CircuitPython) or our own Atmel-based distribution.
The first port to reproduce this issue would be "unix" port (under Linux).
Sorry... I wasn't clear (enough) in my first comment. I cannot reproduce the problem on 64-bit Windows builds, 64-bit OS X builds, or the various embedded ports I have access to (and hardware for). I can try to compile a 32-bit OS X build (but that will have to wait until next week). I don't currently have a linux box to play with :(.
Thanks for the report. I can't reproduce any issue with the code running under a 32-bit unix MicroPython executable.
It could be related to GC scanning of the machine registers (and possibly C stack). You'll need to determine if you use MICROPY_GCREGS_SETJMP
or not, and try disabling/enabling it to see if that fixes it.
Here's the process I used to compile the (Release) win32 port:
I obtained the code from GitHub via the 'Clone or Download' link.
I opened the 'micropython.vcxproj' project (under micropython\ports\windows) in Visual Studio 2015.
I set the solution configuration to 'Release' and the solution target to 'x86' and built the application.
That said, according to Visual Studio, MICROPY_GCREGS_SETJMP is indeed set (value (1)) for this particular port. I've rebuilt the solution with MICROPY_GCREGS_SETJMP set to (0), with the same result:
We'll certainly try debugging this on our end - I was hoping that someone more familiar with the code (that doesn't raise the bar much in my case, unfortunately :( ) might have some insight into the problem.
Thanks for checking. Since such a problem was never seen/reported before, and it's not (so far) reproducible on the unix port or Win 64-bit, it's most likely something to do with the specific configuration of the Win 32-bit build. And my guess is that the GC is freeing some memory that should not be freed, because the code crashes when it explicitly tries to free a GC region, only to find that it's already freed. Thus it's likely to do with scanning of root pointers in the Win 32 build.
Well... we have a line number for our first breakpoint (gc.c, line 587), and can try to unwind the stack from there. Looks like my team has some fun ahead of it....
Hmm, can't reproduce this in x86 nor x64 builds. Using VS2017 though. And the last commit. I'd be happy to debug this but I tried a bunch of things like more iterations but no avail. Just from the looks of it I'd guess it has something to do with using exec, the code itself alone shouldn't cause this, at least I think.
... and I can't reproduce the problem in a debug 'win32' build. Since the debug builds run more slowly, it's possible this is timing related (?! Oh, joy...).
To add further fuel to the 'timing-related' hypothesis, sometimes I have to repeat the last instruction several times before the crash:
Not very likely this has anything to do with timing (at least not in the sense of multithreaded/race condition problems where 'insert a sleep statement somwehere' seems to 'solve' the problem :P), there's only one thread which does everything. And there are quite some other differences in debug builds. I'll try again later with executing the loop more than once.
Tried this a thousand times, no avail. Any chance you can test this with VS2017? And/or check whether it also happens with the current version of the uPy code? Or come up with a more clear reproduction case?
@stinos: First, thanks for taking the effort to try to reproduce the problem. It's greatly appreciated!
We use the same toolchain across all our Windows builds, so I'm afraid I'm stuck with VS 2015 right now (I could try to spin up a VM with VS 2017, but that will almost certainly have to wait awhile, unfortunately...).
The next step for me might be to try to run the original test suite that spawned the problem without the test harness (we weren't "exec"ing anything in that case - the steps in the description above were the simplest I could find in the limited time I had available to reliably reproduce the problem on my end).
I spent a bit of time running the steps above in the debugger - I haven't fully wrapped my head around the way the garbage collector works (there appears to be a memory allocation table, a space reserved for 'finalizers', and the actual memory pool itself, and for the example in this issue once the pool is exhausted, the code sweeps through the allocation table and marks blocks as free... the assertion is thrown during (or right after) one of these sweeps, but I still haven't figured out what the trigger is. It doesn't help that the code in question is called thousands of times before the problem occurs... but that's why we have computers manage the memory instead of humans ;-) )
the assertion is thrown during (or right after) one of these sweeps, but I still haven't figured out what the trigger is.
From the assertion that fails, it happens in gc_free()
which is never called during a garbage collection phase/sweep. gc_free()
is only called by code when it explicitly knows that it doesn't need the memory anymore. This is usually in the compile phase, so probably happens when exec()
is compiling its argument. It'd be good if you could see a backtrace at the point of the assert failure.
@ddiminnie tried again, this time building MicroPython with the VS2015 (v140) toolset (which should be what you are using, right?), still cannot reproduce it with x86 release builds, also not for v1.10 :(
@stinos: Hmm... Maybe we need more configuration information (should have listed this in the first place, especially since the problem appears to be hard to reproduce outside of the systems we have here):
Our build and test systems are Windows 7 (x64)-based machines.
We've run into the cited problem on
Toolchain: IDE: Visual Studio Professional 2015, Version 14.0.25431.01 Update 3 Compiler/Linker: Microsoft Visual C++ 2015
Build Configuration:
@dpgeorge: I'm still having trouble getting a stack trace at the time the assertion failure occurs.
I'm going to try to switch to the '_ASSERT' macro from 'crtdbg.h' (and link in the corresponding debug runtime libraries) for the offending line (unlike 'assert', '_ASSERT' actually calls DebugBreak, so it should be possible to get a stack trace). (This is just part of the fun of debugging in Windows Land...)
@dpgeorge: I lied: I can get some stack information at assert time from ProcessExplorer (doesn't have line numbers, but does at least indicate which functions were called... and I may be able to use the disassembly in VS to map the offsets to at least some of the code lines... later today...). Here's the stack from bottom to the _wassert call (everything above that is system level calls for the crash dialog, etc.):
ucrtbase.DLL!_wassert+0x16 micropython.exe!gc_free+0x83 micropython.exe!mp_parse+0xac2 micropython.exe!mp_parse_compile_execute+0x76 micropython.exe!eval_exec_helper+0x11b micropython.exe!mp_builtin_exec+0x10 micropython.exe!fun_builtin_var_call+0x80 micropython.exe!mp_execute_bytecode+0x1de8 micropython.exe!fun_bc_call+0xc9 micropython.exe!mp_call_function_0+0x40 micropython.exe!execute_from_lexer+0x18e micropython.exe!dorepl+0x3a3 micropython.exe!main+0x996 micropython.exe!main+0x17 micropython.exe!__scrt_common_main_seh+0xf9 kernel32.dll!BaseThreadInitThunk+0x12 ntdll.dll!RtlInitializeExceptionChain+0x63 ntdll.dll!RtlInitializeExceptionChain+0x36
... and here is that stack trace in more human-readable form:
ucrtbase.DLL!_wassert+0x16 micropython.exe!gc_free(void ptr) Line 587 micropython.exe!mp_parse(_mp_lexer_t lex, mp_parse_input_kind_t input_kind) Line 1168 micropython.exe!mp_parse_compile_execute(_mp_lexer_t lex, mp_parse_input_kind_t parse_input_kind, _mp_obj_dict_t globals, _mp_obj_dict_t locals) Line 1420 micropython.exe!eval_exec_helper(unsigned int n_args, void const args, mp_parse_input_kind_t parse_input_kind) Line 146 micropython.exe!mp_builtin_exec(unsigned int n_args, void const args) Line 155 micropython.exe!fun_builtin_var_call(void self_in, unsigned int n_args, unsigned int n_kw, void const args) Line 127 micropython.exe!mp_execute_bytecode(_mp_code_state_t code_state, void volatile inject_exc) Line 922 micropython.exe!fun_bc_call(void self_in, unsigned int n_args, unsigned int n_kw, void const args) Line 288 micropython.exe!mp_call_function_0(void fun) Line 599 micropython.exe!execute_from_lexer(int source_kind, const void source, mp_parse_input_kind_t input_kind, bool is_repl) Line 147 micropython.exe!dorepl() Line 259 micropython.exe!main(int argc, char * argv) Line 638 micropython.exe!main(int argc, char** argv) Line 420 micropython.exe!__scrt_common_main_seh+0xf9 kernel32.dll!BaseThreadInitThunk+0x12 ntdll.dll!RtlInitializeExceptionChain+0x63 ntdll.dll!RtlInitializeExceptionChain+0x36
Of course this isn't the whole picture (I need to see if I can find a way to capture the gc heap at the time of the assertion failure, but even then the source of the problem may have occurred several steps earlier in the execution sequence).
Welp... I'm an idiot (which should be apparent from following this thread). I added the following lines to gc_free just before the offending assertion (allowing me to set a breakpoint that is only hit when the following assertion will fail), and can now confirm that the stack trace above is correct.
size_t block = BLOCK_FROM_PTR(ptr);
// Debugging code FIXME!!!
if (ATB_GET_KIND(block) != AT_HEAD) {
printf("Foo!\n");
}
// End of debugging code FIXME!!!
assert(ATB_GET_KIND(block) == AT_HEAD);
Moreover, I've been able to dump the 1Mb memory space allocated for the garbage collector: gcmem.zip
@ddiminnie I also tested this on a Windows 7 system with the specs you mention. But can you state the actual version of the compiler please (i.e. output of cl
on a VS2015 commandline)? And while we're at it: what is your Windows SDK version (I highly doubt this matters, but still..). Thats the WindowsSDKLibVersion environment variable, or what is selected under Project properties->General->Target Platform Version.
The 'assert' macro used in the project maps to the Microsoft CRT '_wassert' macro, which prints a message and then immediately aborts.
I don't think _wassert is a macro, but a function, so all you needed to do was setting a breakpoint on _wassert. Or else, break on abort: as uasual assert is a macro which eventually calls C's abort() function.. You see, debugging in Windows really isn't that different or more difficult than on another OS :)
(When running outside the debugger, there's a dialog box that allows us to 'Debug'... but that simply opens the debugger and then aborts... so no stack trace for me :-( )
This is a bit more tricky but with some digging you could still have figured this out: you have a notion of what the CRT is, you know _ASSERT, you can see there's different ways of handling what the program does on asserts, because yes, it's strange that assert just aborts without the more typcial 'Assertion failed' dialog and ways to hook the debugger, right? The documentation of _ASSERT (for instance) has the clue: turns out there's a _CrtSetReportMode function which alters what happens on asserts exactly in the way witnessed here. And indeed, that gets used in MicroPython, see init.c. Just remove the offending lines and you will get a chance to hook the debugger when running the exe directly.
@stinos: Just to level-set, I'm not really a software developer (its MUCH worse than that - I'm a mathematician... If you've ever worked in an academic environment, you'll know what that entails... :-) ). My day job consists of validation of algorithms (when possible), or at the very least, testing of implementations of mathematical software. So, I know my way around (some of) the basics of an IDE, can write (simple) programs in C, and basically know just enough to get myself in trouble. (I do the vast majority of my programming in python, for the record.)
I say this to make communications simpler as we proceed forward.
Now, as for your questions: The actual version of the compiler we're using is Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x86 The Target Platform Version (from the micropython project) is 8.1 (And yes, _wassert and abort are functions, but they live in the CRT, and I don't have the source for that. I could try to find them in the disassembly and break there, but I found a much kludgier way to achieve the same end - see my previous comment... though, looking back it occurs to me that I didn't say which block gc_free was attempting to free, so the information I gave wasn't overly useful. Your last suggestion to check out init.c was useful - I'll keep it in mind for the future!)
@stinos: In response to your much earlier request for a simpler example, I'm still struggling to find one that triggers this issue (well... if it were simple, lots of users would have complained by now, and any underlying problem would have been fixed long ago).
We originally ran into the problem on our own customized Windows port of CircuitPython, which sets the heap size to 20480 bytes, and includes a few other tweaks that (hopefully) don't affect the behavior we're chasing here. We had backported a pull request (MicroPython issue #4133), and were running tests to verify we had merged the changes properly. The suite in question used a mix of decimal values from CPython's 'floating_points.txt' file, along with some additional values of interest to us. We ran the tests by writing to the stdin of our python executable and reading from stdout of the executable.
Thus, the original motivating example was somewhat simpler (but more complex to communicate):
>>> <Instruction to disable our custom serial handshaking code>
>>> from <larger module containing flt_hex> import flt_hex
>>> flt_hex(8.988465674311582e+307)
>>> flt_hex(1.0726246343954082e+155)
>>> flt_hex(274877906944.00024)
<<< 1031 other examples omitted >>>
At (or around) the 302nd test we consistently hit the ATB_GET_KIND(block) != AT_HEAD assertion on our custom win32 build (but NOT our custom win64 build).
(We use the 'flt_hex' function to allow us to reconstruct the exact floating-point value that MicroPython created for each example within our external test harness, to allow us to compare the value to reference IEEE-754 binary64 equivalents of the corresponding decimal values... just in case anyone was wondering at this point...)
Naturally, the same set of examples doesn't trigger the problem on any of the MicroPython builds we've created (even after modifying the heap size), so we had to come up with another example. The result of some tinkering was the (awful) example in the original description above (which has the added benefit that it also triggers the issue in our custom build).
Unfortunately, it appears that the cited problem has (so far) only been reproduced on builds created at our site. (At least we can reproduce the problem on executables built on our official build machine, on my laptop, and on the laptop of one of the (actual) developers assigned to this project...)
If you've ever worked in an academic environment, you'll know what that entails... :-) ).
I've been in academic environments more than anywhere else so yes I get your point :P
The actual version of the compiler we're using is Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x86 The Target Platform Version (from the micropython project) is 8.1
19.00.24234.1 here but for the rest it's the same.
And yes, _wassert and abort are functions, but they live in the CRT, and I don't have the source for that.
Sorry should have been more clear. You don't need the source, you can set breakpoints at any address, and VS helps you with translating functions to addresses: use Debug->New Breakpoint->Function Breakpoint...
and enter 'abort'..
Unfortunately, it appears that the cited problem has (so far) only been reproduced on builds created at our site
Well, as I was typing this I thought "let's try it one more time to be sure" and for reasons I don't know I can now suddenly but reliably reproduce the issue (on a Windows 10 VM, both with VS2015 and VS2017 toolsets!, but still nothing on my main pc). Yay. So it's like really really random and maybe it's colder today and things were set loose :)
I can now suddenly but reliably reproduce the issue (on a Windows 10 VM, both with VS2015 and VS2017 toolsets!, but still nothing on my main pc).
Ok, that's good news, that it's now more reproducible.
@stinos can you confirm which NLR setting the build is using, either MICROPY_NLR_SETJMP or MICROPY_NLR_X86? It might be that it needs to trace more of the registers in py/nlrx86.c
, eg ecx
and edx
.
This is all with MICROPY_NLR_SETJMP: the assembly syntax of the other implementations is incompatible with cl.
@stinos: Awesome! (And thanks for the pointers!) This is (hopefully) the last in a (seemingly) unending string of hard-to-reproduce errors I've had to deal with on the multiple simultaneous projects I've been working (the rest have been in our own code... nothing to do with MicroPython)...
@dpgeorge the assert happens when freeing the lexer's indent_level
member in mp_lexer_free
so I assume either something goes wrong in the previous call to gc_collect. I don't really know how to figure out what though, any pointers on how to debug this? I was thinking a 32bit unix build with setjmp might also exhibit the problem.
@stinos I tried a 32-bit unix build with MICROPY_NLR_SETJMP enabled, but didn't see any issues with executing the flt_hex()
function. Can you please confirm if the build you see the error on has !MICROPY_GCREGS_SETJMP enabled or not? And see if changing this option still has the problem (according to @ddiminnie it does).
If we assume it's failing to mark and/or trace the mp_lexer_t
object during a GC, that would lead to the crash described here. And that is a possibility: for exec()
calls the lexer is allocated on the heap, and passed as the first argument (likely tail-call optimised) to mp_parse_compile_execute()
, which doesn't do any memory allocations and passes the lexer to mp_parse()
as the first argument, and never uses this object again. So no GC should have occurred between allocating indent_level
for the first time in mp_lexer_new()
and getting to mp_parse()
. Then, if the lexer object is not being traced, it's because in mp_parse()
the stack and/or regs are not being correctly traced. Because most functions in parse.c are static, the compiler could heavily optimise mp_parse()
to the point where the lexer object is contained only in a register (ie not on the stack). In such a case we should inspect what register the lexer is held in.
Note that the problem could also lie in mp_lexer_new()
: it may trigger a GC just after allocating indent_level
and before returning. And the lexer pointer in this function most likely lives in a register.
@stinos are you able to provide me with a complete disassembly of lexer.c and parse.c, for the binary that crashes?
If tracing as described above is not the problem then it's probably something like a memory corruption.
@dpgeorge This is all with MICROPY_GCREGS_SETJMP set to 1 (just like for nlr the assembly code for __x86_64__ isn't compatible, so even when setting it to 0 in mpconfigport.h for instance, MICROPY_GCREGS_SETJMP gets reset to 1 in gccollect.c so that is probably what @ddiminnie witnessed).
Thanks a lot for the explanation; when testing some more yesterday I was slowly coming to a similar conclusion as I saw gc_collect occurrences where indent_level
wasn't getting marked (but the lexer struct itself was, if I remember correctly, but don't take my word for it) which would inevitably be followed by the assertion failure, but I didn't figure out yet why, nor where to go from there.
So now I simply checked for gc_collect()
calls in between the allocation of indent_level
and the first call to m_new()
in mp_parse()
, and those occur (like for instance in the vstr_init()
call in mp_lexer_new()
as you mention), and when they do they always lead to the assertion failing so your analysis so far is spot on :) I also got the assertion a couple of other times but that might be because I'm not checking for all gc_collect()
calls, I'd rather place my bets on that instead of memory corruption. I haven't got time yet to see why indent_level
isn't getting marked.
I'm attaching the compiler-generated assembly/source listings which might not be the exact same thing as disassembling the object files but should be close enough I think, and fairly readable.
@dpgeorge: For the benefit of anyone browsing the bug list, should I change the title of this thread to something more appropriate (e.g. "Win32 micropython builds trigger assertion failure upon repeated..." etc.)?
Thanks @stinos for the disassembly. Some points to note:
m_malloc
, m_free
) are inlined in the parser codeecx
and edx
, rest on the stackmp_lexer_new()
and mp_parse()
the lexer pointer is indeed stored in a register during the course of the whole function, in edi
gc_collect()
is working correctly@stinos are you able to provide the disassembly of __setjmp3
? That's the key part here, to see if it's saving edi
@ddiminnie the title is ok for now.
Indeed 'Link Time Code Generation' aka LTO is enabled (that's also one of the reasons I can't just disassemble the object files it seems).
Btw I enjoy debugging this but there's only some limited amount of time I can spend on this and wasted hours because adding a bunch of statements to make debugging easier often results in an executable which doesn't exhibit the problem. So by now I'm out of time for today which is a pitty because I'm getting closer. The problem I'm seeing now is:
mp_lexer_t *lex
allocated in mp_lexer_new()
gets stored in edi
mp_lexer_new
, indent_level
is allocated by gc_alloc()
which in turn pushes edi
on the stackgc_collect()
gets triggered__setjmp3()
the stack looks like this:
0x0049F4E8 0049f510 -> pointer to `arr` in gc_collect_regs_and_stack, i.e. regs in gc_collect_regs_and_stack
0x0049F4EC 00000000
0x0049F4F0 0049f558
0x0049F4F4 0034d7a6
0x0049F4F8 002cffc0 -> this is mp_lexer_t *lex from mp_lexer_new, pushed here by gc_alloc
0x0049F4FC 00000014
0x0049F500 002cffa0
0x0049F504 00000010
0x0049F508 00000000
0x0049F50C 00000000
0x0049F510 00000000 -> regs allocated in gc_collect_regs_and_stack
....
__setjmp3()
stores everything including edi
but that's of not much use here unfortunatelygc_collect_root()
gets called but is passed a pointer to regs
which is later on the stack than the pointer to mp_lexer_t
so the latter doesn't get markedindent_level
gets allocated, everything continues and assertion fails in gc_free()
Disassembly for __setjmp3()
:
0FE53DA0 8B 54 24 04 mov edx,dword ptr [esp+4]
0FE53DA4 89 2A mov dword ptr [edx],ebp
0FE53DA6 89 5A 04 mov dword ptr [edx+4],ebx
0FE53DA9 89 7A 08 mov dword ptr [edx+8],edi
0FE53DAC 89 72 0C mov dword ptr [edx+0Ch],esi
0FE53DAF 89 62 10 mov dword ptr [edx+10h],esp
0FE53DB2 8B 04 24 mov eax,dword ptr [esp]
0FE53DB5 89 42 14 mov dword ptr [edx+14h],eax
0FE53DB8 C7 42 20 30 32 43 56 mov dword ptr [edx+20h],56433230h
0FE53DBF C7 42 24 00 00 00 00 mov dword ptr [edx+24h],0
0FE53DC6 64 A1 00 00 00 00 mov eax,dword ptr fs:[00000000h]
0FE53DCC 89 42 18 mov dword ptr [edx+18h],eax
0FE53DCF 83 F8 FF cmp eax,0FFFFFFFFh
0FE53DD2 75 09 jne 0FE53DDD
0FE53DD4 C7 42 1C FF FF FF FF mov dword ptr [edx+1Ch],0FFFFFFFFh
0FE53DDB EB 3B jmp 0FE53E18
0FE53DDD 8B 4C 24 08 mov ecx,dword ptr [esp+8]
0FE53DE1 0B C9 or ecx,ecx
0FE53DE3 74 0A je 0FE53DEF
0FE53DE5 8B 44 24 0C mov eax,dword ptr [esp+0Ch]
0FE53DE9 89 42 24 mov dword ptr [edx+24h],eax
0FE53DEC 49 dec ecx
0FE53DED 75 08 jne 0FE53DF7
0FE53DEF 8B 40 0C mov eax,dword ptr [eax+0Ch]
0FE53DF2 89 42 1C mov dword ptr [edx+1Ch],eax
0FE53DF5 EB 21 jmp 0FE53E18
0FE53DF7 8B 44 24 10 mov eax,dword ptr [esp+10h]
0FE53DFB 89 42 1C mov dword ptr [edx+1Ch],eax
0FE53DFE 49 dec ecx
0FE53DFF 74 17 je 0FE53E18
0FE53E01 56 push esi
0FE53E02 57 push edi
0FE53E03 8D 74 24 1C lea esi,[esp+1Ch]
0FE53E07 8D 7A 28 lea edi,[edx+28h]
0FE53E0A 83 F9 06 cmp ecx,6
0FE53E0D 76 05 jbe 0FE53E14
0FE53E0F B9 06 00 00 00 mov ecx,6
0FE53E14 F3 A5 rep movs dword ptr es:[edi],dword ptr [esi]
0FE53E16 5F pop edi
0FE53E17 5E pop esi
0FE53E18 2B C0 sub eax,eax
0FE53E1A C3 ret
Definition of __JUMP_BUFFER
which is what I assume setjmp
uses as internal representation:
typedef struct __JUMP_BUFFER
{
unsigned long Ebp;
unsigned long Ebx;
unsigned long Edi;
unsigned long Esi;
unsigned long Esp;
unsigned long Eip;
unsigned long Registration;
unsigned long TryLevel;
unsigned long Cookie;
unsigned long UnwindFunc;
unsigned long UnwindData[6];
} _JUMP_BUFFER;
Thanks @stinos, you were indeed very close. It seems that the LTO is inlining ports/unix/gccollect.c:{gc_collect,gc_collect_regs_and_stack}
in py/gc.c:gc_collect
and allocating the regs
array too early on the stack, and registers with live root pointers are being pushed after it, and hence not traced (and not put into regs
because the register is replaced with a different value by the point regs
is populated).
The assumption made by uPy that is violated here by LTO is that gc_collect_regs_and_stack()
is a "leaf" function that is actually called (not inlined) and hence would have its stack pointer at the very start of the stack.
Maybe the way to fix this is to mark this function as non-inlinable, or at least not touchable by LTO.
@stinos , @dpgeorge: Thanks for taking the time to analyze this issue. In the short time I've worked with MicroPython and CircuitPython, I've been deeply impressed at both the quality of the implementation and the community around it. I'm grateful that the two of you have devoted so mcuh time to a rather obscure bug in a (seemingly) less active configuration.
As a (very) quick-and-dirty experiment, I've turned off 'whole program optimization' (/GL compiler switch) and 'link time code generation' (/LTGC:Incremental linker switch) in the Visual Studio 'micropython' project and rebuilt the win32 release configuration (this is akin to driving a tack with a pile driver, but...). After 11000 iterations of the loop from the original description I have not seen the cited failure. Obviously this is not a long-term fix (though it may be good enough for my team, as performance is not really an issue for us, and this change has the advantage of not affecting the other ports we're building).
Also, I've learned a few things from this exchange (which is always welcome!) - I look forward to seeing the proposed (official) fix (should you decide to pursue it)!
@ddiminnie indeed disabling those optimisations should prohibit the linker from inlining gc_collect_regs_and_stack()
in other files and that might be the fix here. To be sure you'd probably have to look at the generated assembly code. I'll try to come up with a fix shortly.
Sorry for the late comment - I thought I had replied, but it turns out I had not.
With the /GL compiler switch and the /LTGC linker switch disabled, gc_collect()
appears in the link map, but gc_collect_regs_and_stack()
does not, which suggests (unless I'm completely mistaken... which is a distinct possibility) that the latter is still inlined.
Further, the disassembly of the gccollect module shows no call
to gc_collect_regs_and_stack()
(we can see the preceding call to gc_start()
and the subsequent call to gc_end()
), which serves as further evidence that gc_collect_regs_and_stack()
has been inlined:
void gc_collect(void) {
00EEBF80 push ebp
00EEBF81 mov ebp,esp
00EEBF83 sub esp,44h
00EEBF86 mov eax,dword ptr [__security_cookie (0F04004h)]
00EEBF8B xor eax,ebp
00EEBF8D mov dword ptr [ebp-4],eax
//gc_dump_info();
gc_collect_start();
00EEBF90 call gc_collect_start (0ECBA40h)
gc_collect_regs_and_stack();
00EEBF95 lea eax,[ebp-44h]
00EEBF98 push eax
00EEBF99 call gc_helper_get_regs (0EEBFD0h)
00EEBF9E mov eax,dword ptr [mp_state_ctx (0EF4000h)]
00EEBFA3 lea ecx,[ebp-44h]
00EEBFA6 sub eax,ecx
00EEBFA8 shr eax,2
00EEBFAB push eax
00EEBFAC mov eax,ecx
00EEBFAE push eax
00EEBFAF call gc_collect_root (0ECB9C0h)
00EEBFB4 add esp,0Ch
#if MICROPY_PY_THREAD
mp_thread_gc_others();
#endif
#if MICROPY_EMIT_NATIVE
mp_unix_mark_exec();
#endif
gc_collect_end();
00EEBFB7 call gc_collect_end (0ECB990h)
Although disabling these switches does make the problem in the original description go away, I'm afraid the underlying problem has just moved around in "memory allocation/de-allocation space" (meaning another as yet undiscovered example will still cause the problem), assuming inlining of gc_collect_regs_and_stack()
is the source of the issue.
It's not a small patch (it's over 1000 lines) but #4723 is a comprehensive fix for the issue here. The alternative is do some trickery to ensure that all registers are indeed captured when doing a GC scan.
Marking functions with __declspec(noinline)
should as far as I know always force a function to not be inlined so it's an easier and less drastic way to get the wanted results, for example here's a noinline gc_collect()
with gc_collect_start()
and gc_collect_regs_and_stack()
also marked as noinline:
push ecx
call gc_collect_start (0ABD3A0h)
call gc_collect_regs_and_stack (0AE1840h)
call gc_collect_end (0ABD490h)
pop ecx
ret
I verified this results in the 'correct' stack layout where lex
is on the stack which gets scanned and hence gets marked and hence no problem. This should fix all similar cases (and maybe even all possible problematic cases?). Probably just making sure gc_collect_regs_and_stack()
doesn't get inlined should be sufficient?
@dpgeorge ok to make a patch for this?
ok to make a patch for this?
Yes, sounds good.
This should be resolved by 34a7d7ebebc93bf9c4f166b0b523ceab844c7d91, so I'll close the issue.
To confirm, I've applied the patch and run through 50000 iterations of the example in the original description without seeing the issue. Thanks again for all your hard work!
Thank you for discovering it as well, saves future headaches :)
(I apologize in advance for not finding a simpler example/use case for this.) While running tests on our own fork of micropython, we encountered a bug that crashed the python interpreter (on Windows 32-bit builds only). This bug is also present in the 'stock' 1.10 release of Micropython. Rather than upload our entire test setup, the following (somewhat simpler) example should serve to illustrate the problem. Using the python module listed at the end of this comment, try the following (may need to repeat the last instruction a few times to see the issue):
After one or more iterations of the final step, the following assertion is triggered (followed by shutdown of the micropython.exe process):
Listing for file 'flt_hex.py':