Closed WillClinger closed 9 years ago
This segmentation fault happens during garbage collection. With the --stopcopy
option, the Fence/ARM port of Larceny runs the CLisp number tests successfully.
Using the --annoy-user-greatly
option with the default generational collector to see what happens with and without the test
symbol preceding the input's second bignum, the output is the same (modulo heap addresses) until after the first bignum has been printed. When test
is present within the input file, it is printed following the first bignum and then a garbage collection occurs before the second bignum is printed:
Generation 1: Size=5595136, Live=32896, Remset live=0.
Memory usage: heap 3196928, remset 114688, RTS 25600 words
Max heap usage: 3196928 words
-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150313939864593445583603568771077260174826348411367609521412133720180359748539721570562669201065857989876521301209899829037444385
test
urs_add_elems rs[]=0x00958738 bot=0xb686f000 top=0xb686f00c
urs_add_elems rs[]=0x00958738 bot=0xb685f000 top=0xb685f004
Dynamic area: garbage collection.
alloc=4194304 size=5595136 used=32896
Promoting into generation 1.
REMSET @0x0x958758: scan
REMSET @0x958758: removed 1 elements (total 1).
REMSET @0x0x958868: scan
REMSET @0x958868: removed 0 elements (total 0).
Collection finished.
REMSET @0x0x958758: clear
REMSET @0x0x9587e0: clear
Generation 1: Size=5595136, Live=39184, Remset live=0.
Memory usage: heap 3196928, remset 114688, RTS 25600 words
Max heap usage: 3196928 words
-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150
When the test
symbol is commented out, a segmentation fault occurs before that garbage collection is complete:
Generation 1: Size=5595136, Live=32896, Remset live=0.
Memory usage: heap 3196928, remset 114688, RTS 25600 words
Max heap usage: 3196928 words
-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150313939864593445583603568771077260174826348411367609521412133720180359748539721570562669201065857989876521301209899829037444385
Dynamic area: garbage collection.
alloc=4194312 size=5595136 used=32896
Promoting into generation 1.
REMSET @0x0x439758: scan
REMSET @0x439758: removed 0 elements (total 0).
REMSET @0x0x439868: scan
REMSET @0x439868: removed 0 elements (total 0).
Segmentation fault (core dumped)
This could be a bug in the generational collector, but it could also be a bug within millicode or generated code that leaves a corrupt bit pattern in some root register or fails to initialize some root register, stack slot, or object slot. With the latter possibilities, using the stop-and-copy collector instead of the generational collector would not eliminate the segmentation faults; it would just make them harder to reproduce.
Correction: With the --stopcopy
option, Larceny is able to run the CLisp number tests 3 out of 10 times. With --stopcopy --annoy-user-greatly
, it:
and so on. With --stopcopy --nocontract --annoy-user
, it either runs to completion or segfaults during a garbage collection. So it seems, anyway.
On about 10% of the collections, the RESULT
register (which is printed as GLOB 0
) contains an untagged pointer, which will be treated as a fixnum, when the collection commences. That's okay so long as RESULT
is dead. The code generated for make-cell
does leave an untagged pointer in RESULT
when the destination register is other than RESULT
, but make-cell
shouldn't be used much at all by this program. There might be a benign explanation for this, but it looks wrong.
Comparing --stopcopy --nocontract --annoy-user
outputs on the CLisp number tests, the ARM port performs 428 collections compared to only 61 by IAssassin. With the ARM port, 51 collections occur when RESULT
(GLOB 0
) contains a pointer without its proper tag bits. With IAssassin, only 9 collections occur when RESULT
(GLOB 10
) contains a pointer, and that pointer is always tagged.
Some variation is to be expected because the ARM port uses different millicode and generated code sequences, which might allocate more storage, but 428 collections versus 61 can't be right, and the untagged pointers in RESULT
look very wrong.
In my previous comment, I said "51 collections occur when RESULT
(GLOB 0
) contains a pointer without its proper tag bits." I should have ended that sentence after the word pointer.
On a subsequent run, which ran to a successful completion, there were 47 collections in which RESULT
held a pointer, and all were initiated by mc_stack_overflow
. On the first 24 of those 47 collections, RESULT
contained a tagless pointer, and REG5
(GLOB 9
) held that same pointer with a pair tag. For the final 23 of those 47, the pointer in RESULT
was tagged.
This could indicate some change of state within the runtime system, but I'm more inclined to suspect a change of phase within the program. From measurements with IAssassin (because it doesn't crash), roughly half of the collections occur while reading and compiling the source code.
A change of phase argues for this benign interpretation: Cons operations that create an argument list in the last register (currently REG5
) will leave the untagged pointer in RESULT
, which will still have that value on entry to a known local procedure. I can test that hypothesis by modifying the code for cons
so it leaves a distinctive value in the RESULT
register.
The untagged pointers in RESULT
do not come from cons
or make-cell
.
After modifying src/Rts/Sys/cheney.h
by defining CHECK_EVERY_WORD
as 1 instead of 0, running test/R7RS/Numbers/clisp-number-tests.sps
with --stopcopy --nocontract --annoy-user
reliably detects a problem such as:
Bytevector pointer 0xb45cd925 at offset 23 in object 0xb45cdc98 does not point to a bytevector header (0x00000000)!
The addresses vary depending on the memory Linux gives to Larceny, but 4 out of 5 runs reproduced the ...925
, offset 23
, ...c98
, and (0x00000000)
parts of that message during the 301st of 428 garbage collections. (The exceptional run produced a similar message during the 205th garbage collection.) Allowing the computation to proceed despite this violation of heap invariants often allows the program to complete normally, with no further errors reported, passing all 2473 tests.
Here is the full output produced by --annoy-user
for one of those 301st collections:
Stop-and-copy heap: Garbage collection.
Avail=-1952, Used=8656800
Initiated by mc_alloc ( 8 )
GLOB 0 = 00000008
GLOB 1 = b6545327
GLOB 2 = b6534497
GLOB 3 = 00000002
GLOB 4 = b6548e47
GLOB 5 = b3c2ae75
GLOB 6 = 00000008
GLOB 7 = 00000000
GLOB 8 = b654df4b
GLOB 9 = 0000000a
GLOB 36 = b6534007
GLOB 37 = b6534019
GLOB 38 = b3c2b1db
GLOB 39 = 00000008
GLOB 40 = 00000002
GLOB 41 = 0000000a
GLOB 42 = b3b6c0c1
GLOB 43 = 00000000
Bytevector pointer 0xb3c2ad65 at offset 18 in object 0xb3c2b3e0 does not point to a bytevector header (0x00000000)!
Abort (yes/no)?
Low-level allocator: Sliding page table by 1 entries.
New heap size by policy should be 12977232 bytes
Stop-and-copy heap: Collection finished; Live=1864104
Memory usage: heap 6484992, remset 0, RTS 33792 words
Max heap usage: 12450816 words
The Initiated by
line was produced by a temporary modification to sc-heap.c
.
I had modified emit-save0
to place a distinctive fixnum into the RESULT
register immediately before the trap that transfers control to mc_stack_overflow
. Prior to the reported violation of a heap invariant, all garbage collections initiated by mc_stack_overflow
showed that fixnum as the contents of the RESULT
register. Afterwards, however, 17 of the 29 subsequent garbage collections initiated by mc_stack_overflow
showed a bytevector in RESULT
, while 12 continued to show the distinctive fixnum.
Stack overflow handling can be initiated by stack underflows and by throws to continuations, but neither of those would have produced a message saying the collection was initiated by mc_stack_overflow
. (To be absolutely certain, I had added additional instrumentation not described here.) So far as I can tell, the only way to produce a collection initiated by mc_stack_overflow
is to execute a $save
instruction, which should always have produced the distinctive fixnum in RESULT
.
Lars has suspected we are not invalidating the instruction cache often enough. Another possibility is that some segments of machine code have been corrupted.
It looks as though a vector of length 28 is being allocated without initializing the elements at index 17 and following. Element 0 is the fixnum 83444, and element 1 is a vector that was allocated immediately following the problematic vector (so it was stored into the vector using vector-set!
). Element 2 is zero, and element 3 is the fixnum 4332. Element 4 is a procedure, element 5 is the fixnum 8, element 6 the fixnum 154, element 7 the fixnum 133. Element 8 is #f
. Element 9 is a bytevector, and elements 10 through 16 are zero.
Element 17 is a bytevector that points to a (headerless) pair whose car is a bytevector. I know this is a pair because the cdr of the pair that follows it in memory is a properly tagged pair pointer pointing to it; the car of that pair is the fixnum 60.
Element 18 is a pair that points to element 15 of the vector, which is a violation of heap invariants. Element 19 is a bignum header created to seal the heap, and element 20 is the distinctive bit pattern used by make_space_for
in sc-heap.c
and seal_heap
in cheney.c
when sealing the heap.
The situation described above is moderately reproducible. In one subsequent run, element 17 is the fixnum 2 instead of a bytevector, but element 19 was again a bignum header created to seal the heap.
The problematic vector described in my previous comment looks like a stack frame that wasn't fully initialized before it was flushed to the heap by a garbage collection. 0x517d0
(decimal 33776=4*83444) is a plausible return address in C or millicode (because element 2, the saved value of register 0, is zero), and element 1 is the dynamic link to the next heap-allocated stack frame. The distinctive bignum header used to seal the heap was originally near the end of the heap, where stack frames are allocated.
The frame size (28 words) matches the S2S_FRAMESIZE
defined in fence-millicode.c
for the special stack frame that points to the millicode procedure internal_scheme_return
. Element 3 (0x43b0
= decimal 17328=4*4332) is the return offset within the procedure found in element 4, which is the saved R0. Elements 5 through 9 are the saved contents of R1 through R5.
Now we have to speculate a bit, because the comment in fence-millicode.c
that describes the layout of this stack frame shows saved registers R0 through R31, followed by saved registers F0 through Fn, followed by the saved RESULT
register and ending with the saved RESULT
restore flag (which is #f
in the problematic vector). That can't be right, because the frame isn't that big. I believe the comment should say the Scheme return offset is followed by saved registers R0 through R5, followed by saved registers F0 through F3 (reserving four words for each), followed by the saved RESULT
and ending with the restore flag. If so, then vector elements 10 through 25 should contain saved registers F0 through F3.
The code for mc_scheme_callout
is responsible for initializing this special stack frame. It's supposed to chop the 64-bit contents of each floating point register into 16-bit pieces, convert those pieces into fixnums, and store them into the stack frame. The loop that's supposed to do that contains bugs that cause it to initialize stack slots 10 through 16 instead of 10 through 25.
Ouch :-)
BTW, This is epic
Fixed by changeset 630366c21901f6c5a075d9bd9872d5a6827c5fff
That commit also fixed an unrelated bug (in the Fence back end) that reliably crashed compiler tests in which local-optimization
was turned off.
I am still seeing intermittent segfaults. One occurred during the compiler tests at combination 109 of the default optimization switches. Another occurred during the test/R7RS/Lib
test of getenv
. In both cases, the tests passed when I tried to reproduce the segfault.
As of this writing, the ARM port segfaults during several tests, but the CLisp number tests are especially interesting because the segfaults occur while reading the bignum-heavy source code, making it easier to create this small test program that reproduces the segmentation faults:
The input file:
That
temp3.scm
file can be extended by repeating the first bignum, et cetera.To run
temp4.scm
:The program appears to behave the same if it is loaded as follows:
Opening the file interactively and calling the
read
procedure repeatedly does not always produce the same behavior as loading the program.Making small changes to the source code of the Fence/ARM port may also change the behavior of the small test program shown above, but reading the source code for
clisp-number-tests.sps
seems to be a reliable way to generate segmentation faults.