stardot / b-em

An opensource BBC Micro emulator for Win32 and Linux
http://stardot.org.uk/forums/viewtopic.php?f=4&t=10823
GNU General Public License v2.0
119 stars 58 forks source link

Crash when repeatedly changing mode #82

Closed ZornsLemma closed 5 years ago

ZornsLemma commented 5 years ago

I don't think the mode changing is directly responsible but it seems a good way to provoke the issue. I think I've had these crashes occur occasionally in normal use, but I can't be sure.

I am using the current master branch (72e615114a9e240de3aadc3219314ec1b034bb82) in Master 128 mode on Ubuntu 18.04.2 x86-64. This is running on bare metal, not in a VM; it's an i5-2500K CPU and I'm using the integrated graphics. I've attached the versions of the allegro libraries in case that's helpful: allegro-libs.txt

If I type "REPEAT:MODE RND(8)-1:UNTIL FALSE", after a while - usually within about five minutes - b-em crashes. I've run it under the debugger and I get this backtrace, which doesn't mean a huge amount to me at the moment: `(gdb) run Starting program: /home/steven/src/b-em5/b-em [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7fffeab3b700 (LWP 28929)] [New Thread 0x7fffe853d700 (LWP 28930)] [New Thread 0x7fffe7d3c700 (LWP 28931)] [New Thread 0x7fffe753b700 (LWP 28932)] [New Thread 0x7fffe68f3700 (LWP 28933)] linux: b-em found at /home/steven/.config/b-em/b-em.cfg [New Thread 0x7fffd6d9f700 (LWP 28934)] [New Thread 0x7fffd659e700 (LWP 28935)] [New Thread 0x7fffbf42d700 (LWP 28936)] [New Thread 0x7fffbec2c700 (LWP 28937)] [New Thread 0x7fffbe42b700 (LWP 28938)] [New Thread 0x7fffbdc2a700 (LWP 28939)] [New Thread 0x7fffbd429700 (LWP 28940)] [New Thread 0x7fffbcc28700 (LWP 28941)] [New Thread 0x7fff73fff700 (LWP 28942)] [New Thread 0x7fff737fe700 (LWP 28943)] [New Thread 0x7fff72ffd700 (LWP 28944)] [New Thread 0x7fff727fc700 (LWP 28945)] [New Thread 0x7fff71ffb700 (LWP 28946)] [Thread 0x7fffe68f3700 (LWP 28933) exited] double free or corruption (!prev)

Thread 1 "b-em" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 0x00007ffff5d2ee97 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0x00007ffff5d30801 in __GI_abort () at abort.c:79

2 0x00007ffff5d79897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff5ea6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181

3 0x00007ffff5d8090a in malloc_printerr (str=str@entry=0x7ffff5ea8890 "double free or corruption (!prev)") at malloc.c:5350

4 0x00007ffff5d87e84 in _int_free (have_lock=0, p=0x555557158d50, av=0x7ffff60dbc40 )

at malloc.c:4281

5 0x00007ffff5d87e84 in __GI___libc_free (mem=0x555557158d60) at malloc.c:3124

6 0x00007ffff6f2b560 in _al_ogl_unlock_region_new () at /usr/lib/x86_64-linux-gnu/liballegro.so.5.2

7 0x00007ffff6e9a4d3 in al_unlock_bitmap () at /usr/lib/x86_64-linux-gnu/liballegro.so.5.2

8 0x0000555555609931 in video_doblit (non_ttx=, vtotal=)

at vidalleg.c:279

9 0x000055555560cf18 in video_poll (clocks=0, clocks@entry=2, timer_enable=timer_enable@entry=1)

at video.c:834

10 0x000055555556e053 in polltime (c=2) at 6502.c:176

11 0x000055555556e053 in branchcycles (temp=3) at 6502.c:956

12 0x000055555557add1 in m65c02_exec () at 6502.c:5198

13 0x00005555555f0d1b in main_timer (event=0x7fffffffd7e0) at main.c:433

14 0x00005555555f0d1b in main_run () at main.c:491

15 0x000055555556d79e in main (argc=, argv=) at main.c:584

` I'm sorry to be vague but I do wonder if it might be necessary to get bored and let the emulator window get covered up by other windows as you use the PC to do other stuff while waiting for it to crash; if I sit there with the emulator in the foreground it seems to be more reliable, but I wouldn't swear this is accurate.

Let me know if you need any more information or if there's anything I can do to help with this...

Cheers.

Steve

SteveFosdick commented 5 years ago

This is an interesting one. Working through the backtrace entries 15 down to 8 are inside B-Em and #7 is in allegro.

B-Em is not doing any memory allocation around changing mode so a double free should not be the issue. Given the crash happening in al_unlock_bitmap it is possible the video emulation is writing outside the bitmap it has allocated. If this is only provoked when changing mode frequently I wonder if it relates to one of the 16bit registers in the 6845 being written as two writes of 8bits each and the video emulation using a half-written value in between.

SteveFosdick commented 5 years ago

In video.c I temporarily changes the put_pixel and put_pixels functions to check their input x and y values. This isn't usually there for performance reasons. I found x was being exceeded from a couple of places, once in mode7_render and a couple of other places.

Try branch sf/issue#82 which should fix this.

SteveFosdick commented 5 years ago

The next thing to do with this is to test it still runs some games/demo programs without obvious screen corruption.

ZornsLemma commented 5 years ago

Thanks for the prompt response on this! Unfortunately the fix hasn't worked for me, although it may be a bit more reliable. Here's the (almost identical) backtrace from a crash I had just now:

51  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) 
(gdb) bt
#0  0x00007ffff5d2ee97 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff5d30801 in __GI_abort () at abort.c:79
#2  0x00007ffff5d79897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff5ea6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff5d8090a in malloc_printerr (str=str@entry=0x7ffff5ea8890 "double free or corruption (!prev)") at malloc.c:5350
#4  0x00007ffff5d87e84 in _int_free (have_lock=0, p=0x555557158be0, av=0x7ffff60dbc40 <main_arena>)
    at malloc.c:4281
#5  0x00007ffff5d87e84 in __GI___libc_free (mem=0x555557158bf0) at malloc.c:3124
#6  0x00007ffff6f2b560 in _al_ogl_unlock_region_new () at /usr/lib/x86_64-linux-gnu/liballegro.so.5.2
#7  0x00007ffff6e9a4d3 in al_unlock_bitmap () at /usr/lib/x86_64-linux-gnu/liballegro.so.5.2
#8  0x0000555555609931 in video_doblit (non_ttx=<optimised out>, vtotal=<optimised out>)
    at vidalleg.c:279
#9  0x000055555560cf5e in video_poll (clocks=1, clocks@entry=2, timer_enable=timer_enable@entry=1)
    at video.c:836
#10 0x000055555556d931 in polltime (c=2) at 6502.c:176
#11 0x000055555557b098 in m65c02_exec () at 6502.c:5127
#12 0x00005555555f0d1b in main_timer (event=0x7fffffffd7e0) at main.c:433
#13 0x00005555555f0d1b in main_run () at main.c:491
#14 0x000055555556d79e in main (argc=<optimised out>, argv=<optimised out>) at main.c:584

I've tried modifying put_pixel()/put_pixels() to add a bounds check on x (I can't see how to determine the valid bounds on y) and I'm doing another run now to see what happens.

ZornsLemma commented 5 years ago

OK, I don't really know what I'm talking about, but is this code a bit iffy? In video_poll() we have:

            } else if (scrx < (1280-16)) {                                       
                put_pixels(region, scrx, scry, (ula_ctrl & 0x10) ? 8 : 16, colblack);
                if (vid_linedbl)                                                 
                    put_pixels(region, scrx, scry+1, (ula_ctrl & 0x10) ? 8 : 16, colblack);
                if (!crtc_mode) {                                                
                    put_pixels(region, scrx + 16, scry, 16, colblack);           
                    if (vid_linedbl)                                             
                        put_pixels(region, scrx + 16, scry+1, 16, colblack);     
                }                                                                
            }                                                                    

In the '!crtc_mode' block, we do 16 pixels starting at x=scrx+16, so we could go as high as x=scrx+16+15 but we only checked scrx<1280-16.

Edited to add: When this triggers my bounds check, region->pitch is -5120 and region->pixel_size is 4, so I think we should have 0<=x<1280 at this point; I may have that wrong though, I'm winging it here!

SteveFosdick commented 5 years ago

That does indeed look dodgy. When I thought I had solved it, it was because of an absence of messages about the bounds check rather than just that it did not crash but in put_pixels (with the count) I am not sure I was checking the count.

I have pushed a commit that adds optional bounds checking. It is off by default but if you add: #define PIXEL_BOUNDS_CHECK 1 towards the top of the file it will enable the bounds check and this will report which line had exceeded the bound.

SteveFosdick commented 5 years ago

Ok, doing that I get the message for x=1272 on lines 1046 and 1048 so it's a case that x itself is fine, it's once the count is added that it is too high. Changing 1280-16 to 1280-32 seems to solve the problem. I have pushed that change also to the branch.

On the values of pitch and pixel size, the negative pitch means the line in the bitmap are stored upside down, though the pixels still go left to right. We don't need to adjust our concept of x and y as the put_pixel function should do the right thing.

ZornsLemma commented 5 years ago

Cheers, I've had that latest code running the random-mode-switch test for an hour now without any crashes. (I hacked the pixel bounds check to simply abort() if anything went out of range so I couldn't miss it, and it hasn't triggered.) So I think this has fixed it; as you say above, perhaps the big question is whether it's broken any games/demos... I don't have any of those handy myself, unfortunately.

SteveFosdick commented 5 years ago

OK, I am not much of a game player but I will see what I can find.

SteveFosdick commented 5 years ago

This fix is now merged to master along with Issue #83.