larcenists / larceny

Larceny Scheme implementation
Other
202 stars 32 forks source link

Fence/ARM port segfaults while reading CLisp number tests #723

Closed WillClinger closed 9 years ago

WillClinger commented 9 years ago

As of this writing, the ARM port segfaults during several tests, but the CLisp number tests are especially interesting because the segfaults occur while reading the bignum-heavy source code, making it easier to create this small test program that reproduces the segmentation faults:

% cat temp4.scm
(define p (open-input-file "temp3.scm"))

(define (loop)
  (let ((x (read p)))
    (if (not (eof-object? x))
    (begin (write x)
           (newline)
           (loop)))))

(loop)

(exit)

The input file:

% cat temp3.scm
;;; The Fence/ARM version of Larceny usually segfaults when it reads this file.
;;; Which of the following data trigger the segfault is sensitive to fairly
;;; small changes in the Fence/ARM source code.  We have also seen that the
;;; file may be read successfully when all of the following data are present,
;;; but a segfault occurs when one of the data is commented out.

;;; As of 24 April 2015, this entire file can be read without a segfault,
;;; but small changes, described below, will generate a segfault.

-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150313939864593445583603568771077260174826348411367609521412133720180359748539721570562669201065857989876521301209899829037444385

;;; If the following symbol is commented out, a segfault will occur.

test

;;; If, in addition to commenting out the symbol above, the last zero
;;; of the following bignum is commented out along with the four digits
;;; that follow it, then the entire file is read without a segfault.

-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150;3139

That temp3.scm file can be extended by repeating the first bignum, et cetera.

To run temp4.scm:

% larceny -- temp4.scm

The program appears to behave the same if it is loaded as follows:

% larceny
> (load "temp4.scm")

Opening the file interactively and calling the read procedure repeatedly does not always produce the same behavior as loading the program.

Making small changes to the source code of the Fence/ARM port may also change the behavior of the small test program shown above, but reading the source code for clisp-number-tests.sps seems to be a reliable way to generate segmentation faults.

WillClinger commented 9 years ago

This segmentation fault happens during garbage collection. With the --stopcopy option, the Fence/ARM port of Larceny runs the CLisp number tests successfully.

Using the --annoy-user-greatly option with the default generational collector to see what happens with and without the test symbol preceding the input's second bignum, the output is the same (modulo heap addresses) until after the first bignum has been printed. When test is present within the input file, it is printed following the first bignum and then a garbage collection occurs before the second bignum is printed:

  Generation 1: Size=5595136, Live=32896, Remset live=0.
  Memory usage: heap 3196928, remset 114688, RTS 25600 words
  Max heap usage: 3196928 words
-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150313939864593445583603568771077260174826348411367609521412133720180359748539721570562669201065857989876521301209899829037444385
test
urs_add_elems rs[]=0x00958738 bot=0xb686f000 top=0xb686f00c
urs_add_elems rs[]=0x00958738 bot=0xb685f000 top=0xb685f004
Dynamic area: garbage collection.
  alloc=4194304  size=5595136  used=32896
  Promoting into generation 1.
REMSET @0x0x958758: scan
REMSET @0x958758: removed 1 elements (total 1).
REMSET @0x0x958868: scan
REMSET @0x958868: removed 0 elements (total 0).
Collection finished.
REMSET @0x0x958758: clear
REMSET @0x0x9587e0: clear
  Generation 1: Size=5595136, Live=39184, Remset live=0.
  Memory usage: heap 3196928, remset 114688, RTS 25600 words
  Max heap usage: 3196928 words
-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150

When the test symbol is commented out, a segmentation fault occurs before that garbage collection is complete:

  Generation 1: Size=5595136, Live=32896, Remset live=0.
  Memory usage: heap 3196928, remset 114688, RTS 25600 words
  Max heap usage: 3196928 words
-30985722824355332972176356513316569304601382411274079243859710673739383446566598659878378034375348869471278415635671865753349734809209959160389615096293457362383744562507969316522225741589739150453090393424063226271167062127000223628785686999799282795143706407082119829140399988180879618548495395684946331608899565543458192773899200054228140747414544792128323269250618482622488195333106891323515989863192944848391405358725993695671970811097285270641251816244586360288952156538400321933146150313939864593445583603568771077260174826348411367609521412133720180359748539721570562669201065857989876521301209899829037444385
Dynamic area: garbage collection.
  alloc=4194312  size=5595136  used=32896
  Promoting into generation 1.
REMSET @0x0x439758: scan
REMSET @0x439758: removed 0 elements (total 0).
REMSET @0x0x439868: scan
REMSET @0x439868: removed 0 elements (total 0).
Segmentation fault (core dumped)

This could be a bug in the generational collector, but it could also be a bug within millicode or generated code that leaves a corrupt bit pattern in some root register or fails to initialize some root register, stack slot, or object slot. With the latter possibilities, using the stop-and-copy collector instead of the generational collector would not eliminate the segmentation faults; it would just make them harder to reproduce.

WillClinger commented 9 years ago

Correction: With the --stopcopy option, Larceny is able to run the CLisp number tests 3 out of 10 times. With --stopcopy --annoy-user-greatly, it:

and so on. With --stopcopy --nocontract --annoy-user, it either runs to completion or segfaults during a garbage collection. So it seems, anyway.

On about 10% of the collections, the RESULT register (which is printed as GLOB 0) contains an untagged pointer, which will be treated as a fixnum, when the collection commences. That's okay so long as RESULT is dead. The code generated for make-cell does leave an untagged pointer in RESULT when the destination register is other than RESULT, but make-cell shouldn't be used much at all by this program. There might be a benign explanation for this, but it looks wrong.

WillClinger commented 9 years ago

Comparing --stopcopy --nocontract --annoy-user outputs on the CLisp number tests, the ARM port performs 428 collections compared to only 61 by IAssassin. With the ARM port, 51 collections occur when RESULT (GLOB 0) contains a pointer without its proper tag bits. With IAssassin, only 9 collections occur when RESULT (GLOB 10) contains a pointer, and that pointer is always tagged.

Some variation is to be expected because the ARM port uses different millicode and generated code sequences, which might allocate more storage, but 428 collections versus 61 can't be right, and the untagged pointers in RESULT look very wrong.

WillClinger commented 9 years ago

In my previous comment, I said "51 collections occur when RESULT (GLOB 0) contains a pointer without its proper tag bits." I should have ended that sentence after the word pointer.

On a subsequent run, which ran to a successful completion, there were 47 collections in which RESULT held a pointer, and all were initiated by mc_stack_overflow. On the first 24 of those 47 collections, RESULT contained a tagless pointer, and REG5 (GLOB 9) held that same pointer with a pair tag. For the final 23 of those 47, the pointer in RESULT was tagged.

This could indicate some change of state within the runtime system, but I'm more inclined to suspect a change of phase within the program. From measurements with IAssassin (because it doesn't crash), roughly half of the collections occur while reading and compiling the source code.

A change of phase argues for this benign interpretation: Cons operations that create an argument list in the last register (currently REG5) will leave the untagged pointer in RESULT, which will still have that value on entry to a known local procedure. I can test that hypothesis by modifying the code for cons so it leaves a distinctive value in the RESULT register.

WillClinger commented 9 years ago

The untagged pointers in RESULT do not come from cons or make-cell.

WillClinger commented 9 years ago

After modifying src/Rts/Sys/cheney.h by defining CHECK_EVERY_WORD as 1 instead of 0, running test/R7RS/Numbers/clisp-number-tests.sps with --stopcopy --nocontract --annoy-user reliably detects a problem such as:

Bytevector pointer 0xb45cd925 at offset 23 in object 0xb45cdc98 does not point to a bytevector header (0x00000000)!

The addresses vary depending on the memory Linux gives to Larceny, but 4 out of 5 runs reproduced the ...925, offset 23, ...c98, and (0x00000000) parts of that message during the 301st of 428 garbage collections. (The exceptional run produced a similar message during the 205th garbage collection.) Allowing the computation to proceed despite this violation of heap invariants often allows the program to complete normally, with no further errors reported, passing all 2473 tests.

Here is the full output produced by --annoy-user for one of those 301st collections:

Stop-and-copy heap: Garbage collection.
  Avail=-1952,  Used=8656800
  Initiated by mc_alloc ( 8 )
GLOB 0 = 00000008
GLOB 1 = b6545327
GLOB 2 = b6534497
GLOB 3 = 00000002
GLOB 4 = b6548e47
GLOB 5 = b3c2ae75
GLOB 6 = 00000008
GLOB 7 = 00000000
GLOB 8 = b654df4b
GLOB 9 = 0000000a
GLOB 36 = b6534007
GLOB 37 = b6534019
GLOB 38 = b3c2b1db
GLOB 39 = 00000008
GLOB 40 = 00000002
GLOB 41 = 0000000a
GLOB 42 = b3b6c0c1
GLOB 43 = 00000000
Bytevector pointer 0xb3c2ad65 at offset 18 in object 0xb3c2b3e0 does not point to a bytevector header (0x00000000)!
Abort (yes/no)?
Low-level allocator: Sliding page table by 1 entries.
New heap size by policy should be 12977232 bytes
Stop-and-copy heap: Collection finished; Live=1864104
  Memory usage: heap 6484992, remset 0, RTS 33792 words
  Max heap usage: 12450816 words

The Initiated by line was produced by a temporary modification to sc-heap.c.

I had modified emit-save0 to place a distinctive fixnum into the RESULT register immediately before the trap that transfers control to mc_stack_overflow. Prior to the reported violation of a heap invariant, all garbage collections initiated by mc_stack_overflow showed that fixnum as the contents of the RESULT register. Afterwards, however, 17 of the 29 subsequent garbage collections initiated by mc_stack_overflow showed a bytevector in RESULT, while 12 continued to show the distinctive fixnum.

Stack overflow handling can be initiated by stack underflows and by throws to continuations, but neither of those would have produced a message saying the collection was initiated by mc_stack_overflow. (To be absolutely certain, I had added additional instrumentation not described here.) So far as I can tell, the only way to produce a collection initiated by mc_stack_overflow is to execute a $save instruction, which should always have produced the distinctive fixnum in RESULT.

Lars has suspected we are not invalidating the instruction cache often enough. Another possibility is that some segments of machine code have been corrupted.

WillClinger commented 9 years ago

It looks as though a vector of length 28 is being allocated without initializing the elements at index 17 and following. Element 0 is the fixnum 83444, and element 1 is a vector that was allocated immediately following the problematic vector (so it was stored into the vector using vector-set!). Element 2 is zero, and element 3 is the fixnum 4332. Element 4 is a procedure, element 5 is the fixnum 8, element 6 the fixnum 154, element 7 the fixnum 133. Element 8 is #f. Element 9 is a bytevector, and elements 10 through 16 are zero.

Element 17 is a bytevector that points to a (headerless) pair whose car is a bytevector. I know this is a pair because the cdr of the pair that follows it in memory is a properly tagged pair pointer pointing to it; the car of that pair is the fixnum 60.

Element 18 is a pair that points to element 15 of the vector, which is a violation of heap invariants. Element 19 is a bignum header created to seal the heap, and element 20 is the distinctive bit pattern used by make_space_for in sc-heap.c and seal_heap in cheney.c when sealing the heap.

The situation described above is moderately reproducible. In one subsequent run, element 17 is the fixnum 2 instead of a bytevector, but element 19 was again a bignum header created to seal the heap.

WillClinger commented 9 years ago

The problematic vector described in my previous comment looks like a stack frame that wasn't fully initialized before it was flushed to the heap by a garbage collection. 0x517d0 (decimal 33776=4*83444) is a plausible return address in C or millicode (because element 2, the saved value of register 0, is zero), and element 1 is the dynamic link to the next heap-allocated stack frame. The distinctive bignum header used to seal the heap was originally near the end of the heap, where stack frames are allocated.

The frame size (28 words) matches the S2S_FRAMESIZE defined in fence-millicode.c for the special stack frame that points to the millicode procedure internal_scheme_return. Element 3 (0x43b0 = decimal 17328=4*4332) is the return offset within the procedure found in element 4, which is the saved R0. Elements 5 through 9 are the saved contents of R1 through R5.

Now we have to speculate a bit, because the comment in fence-millicode.c that describes the layout of this stack frame shows saved registers R0 through R31, followed by saved registers F0 through Fn, followed by the saved RESULT register and ending with the saved RESULT restore flag (which is #f in the problematic vector). That can't be right, because the frame isn't that big. I believe the comment should say the Scheme return offset is followed by saved registers R0 through R5, followed by saved registers F0 through F3 (reserving four words for each), followed by the saved RESULT and ending with the restore flag. If so, then vector elements 10 through 25 should contain saved registers F0 through F3.

The code for mc_scheme_callout is responsible for initializing this special stack frame. It's supposed to chop the 64-bit contents of each floating point register into 16-bit pieces, convert those pieces into fixnums, and store them into the stack frame. The loop that's supposed to do that contains bugs that cause it to initialize stack slots 10 through 16 instead of 10 through 25.

lars-t-hansen commented 9 years ago

Ouch :-)

pnkfelix commented 9 years ago

BTW, This is epic

WillClinger commented 9 years ago

Fixed by changeset 630366c21901f6c5a075d9bd9872d5a6827c5fff

That commit also fixed an unrelated bug (in the Fence back end) that reliably crashed compiler tests in which local-optimization was turned off.

I am still seeing intermittent segfaults. One occurred during the compiler tests at combination 109 of the default optimization switches. Another occurred during the test/R7RS/Lib test of getenv. In both cases, the tests passed when I tried to reproduce the segfault.