tkchia / gcc-ia16

Fork of Lambertsen & Jenner (& al.)'s IA-16 (Intel 16-bit x86) port of GNU compilers ― added far pointers & more • use https://github.com/tkchia/build-ia16 to build • Ubuntu binaries at https://launchpad.net/%7Etkchia/+archive/ubuntu/build-ia16/ • DJGPP/MS-DOS binaries at https://gitlab.com/tkchia/build-ia16/-/releases • mirror of https://gitlab.com/tkchia/gcc-ia16
GNU General Public License v2.0
173 stars 13 forks source link

ICE when building libgcc with -mno-callee-assume-ss-data-segment #102

Closed asiekierka closed 2 years ago

asiekierka commented 2 years ago
$ make clean-target-libgcc
# ...
$ make all-target-libgcc CFLAGS_FOR_TARGET="-mno-callee-assume-ss-data-segment"
# ...
../../../gcc-ia16/libgcc/libgcc2.c: In function ‘__popcounthi2’:
../../../gcc-ia16/libgcc/libgcc2.c:858:1: internal compiler error: in elimination_costs_in_insn, at reload1.c:3624
 }
 ^

I'm starting to feel like it might be better to just add support for the Compact memory model... I'm dealing with a DS != SS platform, but -mno-callee-assume-ss-data-segment causes a lot of caveats (for example, va_list is currently expected to live in DS).

tkchia commented 2 years ago

Hello @asiekierka,

Yes, unfortunately support for %ss.data code is still extremely hacky and involves some not-quite-reliable deep black magic. (GCC was never quite designed in the first place to work with non-flat address spaces. Plus, the libgcc library was definitely not written with a %ss.data environment in mind.)

I am not so sure you really need to recompile an entire runtime library to work with %ss.data though.

Depending on what you want to do, you might be able to get by, with marking the relevant routines (e.g. event handlers) in your code as __attribute__ ((no_assume_ss_data)) or even __attribute__((interrupt)), while leaving the rest of your program working with %ss = .data.


(That said, a compiler crash is always a bug. I see that gcc-ia16 is having trouble with this RTL pattern at reload1.c line 3624:

(insn 25 24 2 2 (use (reg:PHI 24)) -1
     (nil))
(insn 27 26 9 2 (set (subreg:QI (reg:HI 38) 0)
        (mem/u/j:QI (plus:HI (plus:HI (zero_extend:HI (subreg:QI (reg:HI 38) 0))
                    (reg/f:HI 35))
                (unspec:HI [
                        (reg:PHI 24)
                    ] 1)) [0 __popcount_tab S1 A8])) ../../../../gcc-ia16/libgcc/libgcc2.c:854 -1
     (nil))

Let me see what I can find out about this crash.)

Thank you!

asiekierka commented 2 years ago

I'm developing a toolchain for the WonderSwan handheld with two targets:

It may be possible that, in the second case, I could pay a performance penalty (16-bit -> 8-bit bus) and allocate stack space in the data segment instead. However, (a) this could prevent using the toolchain to easily recompile more complex existing WonderWitch software, (b) I am not yet sure how this interacts with the abstraction layer ("FreyaOS") provided by the device. The development kit's assumption of DS != SS has caused problems even in the early 00s, after all.

EDIT: I've done some more debugging. The reason for this is because the SRAM is banked. "BIOS/OS" code uses a different bank from user code. As such, if one enters a FreyaBIOS call with SS = 0x1000, the stack can be trashed mid-interrupt, leading to undefined behavior. This might be workaroundable by changing SS/SP before an interrupt call to point to internal console RAM, then changing it back after an interrupt call. However, there's still the issue of console-emitted interrupts, and I wouldn't really want to wrap every single IVT entry on top of an abstraction layer. So that won't be great either.

I am aware that -mno-callee-assume-ss-data-segment is a hack, but I've noticed you say in past issues that a compiler crash is always a bug, so I wanted to report it from my observations nonetheless - especially as the reproduction was so easy.

tkchia commented 2 years ago

Hello @asiekierka,

I found the cause of the crash. The insn pattern that confused the compiler was for an xlat instruction: the ia16.md knew about an al := [bx + al] pattern (so to speak) where there was no explicit segment override, but did not know about an al := seg:[bx + al] pattern with an explicit segment override term (the (unspec: ...) portion). So I have now fixed this by adding the relevant insn pattern to ia16.md.

I have also added a test case for this to the regression test suite.

Thank you!

asiekierka commented 2 years ago

Thank you very much!

ecm-pushbx commented 2 years ago

Hi @tkchia, do you know that xlatb may not support segment overrides on some pre-386 machines? I documented this in https://hg.pushbx.org/ecm/insref/rev/9360e3b5ca1c

tkchia commented 2 years ago

Hello @ecm-pushbx,

Hi @tkchia, do you know that xlatb may not support segment overrides on some pre-386 machines? I documented this in https://hg.pushbx.org/ecm/insref/rev/9360e3b5ca1c

Argh! I admit that I have not heard of such a "bug" before. Also, I just looked up Harald Feldmann's 86bugs.lst (part of RBIL), and did not find any mention of this. This is not very nice, if indeed it is true.

Edit: well, yes, according to benrg on the Stack Overflow link you mentioned (https://retrocomputing.stackexchange.com/questions/20031/undocumented-instructions-in-x86-cpu-prior-to-80386), MAME's NEC CPU emulators "all don't support a segment override for XLAT (the prefix is allowed but ignored)". But after reading the actual code that he refers to (https://github.com/mamedev/mame/blob/master/src/devices/cpu/v30mz/v30mz.cpp#L3013), I am actually not so sure about that. The implementation for the 0xd7 (xlat) opcode calls a default_base(.) function... which seems to implement a segment override?

Thank you!

ecm-pushbx commented 2 years ago

The implementation for the 0xd7 (xlat) opcode calls a default_base(.) function... which seems to implement a segment override?

Thank you!

You're right, good catch.

tkchia commented 2 years ago

Hello @ecm-pushbx,

I came across this account from Chuck(G) from vcfed.org, who apparently did some tests with a real NEC V20: https://forum.vcfed.org/index.php?threads/the-d6-instruction-on-the-nec-v-series-v20-v30-chips.29471/ . I guess we can safely say that, yes, xlat did indeed support segment overrides even on that CPU. :slightly_smiling_face:

Bottom line is that the D6 takes the low order byte in AL and uses it to fetch one byte from memory at that address into AL. Flags are not affected. So if AL=1, the contents of location 0001 will appear in AL. AH has no part in this and is not modified. Fetches are performed relative to DS, but segment overrides will work. [...]

You beat me to the punch! The D6 is in fact nothing more than an xlat [bx]. Just incomplete instruction decoding in the V-series CPU. D6 == D7.

Thank you!

asiekierka commented 2 years ago

But after reading the actual code that he refers to (https://github.com/mamedev/mame/blob/master/src/devices/cpu/v30mz/v30mz.cpp#L3013), I am actually not so sure about that. The implementation for the 0xd7 (xlat) opcode calls a default_base(.) function... which seems to implement a segment override?

This was changed in MAME in December 2020, which might explain it.

ecm-pushbx commented 2 years ago

@asiekierka Can you link the change you're referring to?

@tkchia I knew that either the V20 or the V30 worked with segment overrides because Robert tested my LZMA-lzip depacker on one once, and at the time it depended on cs xlatb and would crash if the override got ignored. However, this does not rule out that some other machines might lack the support.

tkchia commented 2 years ago

Hello @ecm-pushbx,

@asiekierka Can you link the change you're referring to?

I was able to find a Git mega-commit for the particular line, via its Git blame:

Git blame v30mz.cpp line 3013

The mega-commit apparently came from a pull request (https://github.com/mamedev/mame/pull/7428) which comprised lots of smaller commits.

Thank you!

tkchia commented 2 years ago

Hello @asiekierka,

It may be possible that, in the second case, I could pay a performance penalty (16-bit -> 8-bit bus) and allocate stack space in the data segment instead. However, (a) this could prevent using the toolchain to easily recompile more complex existing WonderWitch software, (b) I am not yet sure how this interacts with the abstraction layer ("FreyaOS") provided by the device. The development kit's assumption of DS != SS has caused problems even in the early 00s, after all.

Weird. In that case, how on earth did the official SDK tell apart addresses which might come from either the fast RAM (starting at 0:0) or the slow SRAM (starting at 0x1000:0) or even ROM areas? I would imagine that, for a routine like

void foo(const char *p)
{
  ...
}

a caller could jolly well pass either an address of a stack variable, or an address of a compile-time string, in p:

void bar(int x)
{
  auto char a[] = "qux";
  static char b[] = "quux";
  static const char c[] = "quuux";
  a[1] += x;
  foo(a);
  b[2] += x;
  foo(b);
  foo(c);
  ...
}

Thank you!

ecm-pushbx commented 2 years ago

Hello @ecm-pushbx,

@asiekierka Can you link the change you're referring to?

I was able to find a Git mega-commit for the particular line, via its Git blame:

Git blame v30mz.cpp line 3013

The mega-commit apparently came from a pull request (mamedev/mame#7428) which comprised lots of smaller commits.

Thank you!

That PR changed here https://github.com/mamedev/mame/pull/7428/files#diff-850b68452f99c23ef07a7f58bbb16d1f86c6be756f7c9b05c8433a90cdedb528L3191 from

            case 0xd7: // i_trans
                m_regs.b[AL] = GetMemB( DS, m_regs.w[BW] + m_regs.b[AL] );
                CLK(5);
                break;

to

            case 0xd7: // i_trans
                m_regs.b[AL] = read_byte(default_base(DS0), m_regs.w[BW] + m_regs.b[AL]);
                clk(5);
                break;

However, GetMemB with DS as first parameter also appears to allow overrides:

inline uint8_t v30mz_cpu_device::GetMemB(int seg, uint16_t offset)
{
    return read_byte( default_base(seg) + offset );
}
asiekierka commented 2 years ago

Hello @asiekierka,

Weird. In that case, how on earth did the official SDK tell apart addresses which might come from either the fast RAM (starting at 0:0) or the slow SRAM (starting at 0x1000:0) or even ROM areas?

Point of clarification first - this is the licensed "personal" SDK, not the toolchain used for making commercial games by publishers. The WonderSwan itself does not pose any limitations on how you use its memory map, as you work on top of bare metal. It's the WonderWitch which introduces an abstraction layer. So, I will be discussing the WonderWitch here, not the WonderSwan in general.

For ROM areas, it didn't really do anything at all. From what I can tell, most software compiled for WonderWitch stores constants in the data segment. On top of that, CS is not constant - software is uploaded onto a rudimentary filesystem, where a contiguous block is allocated to store the program (somewhere in 0x80000 to 0xDFFFF) and no relocation is performed.

For the SRAM/IRAM issue, let's look at this - a contemporary (December 2000) Japanese document describing the "DS != SS" problem. It compares how different compilers reacted to it:

The tl;dr is: Turbo C++ 1.x is what the community seems to have converged on due to its free availability and compiler warnings, while the WonderWitch's development team just expected you to not make errors.

I wonder if/how Digital Mars's 16-bit C/C++ compilers handle it - they were also brought up on the WonderWitch mailing lists as far as I can see, and they are actually open source nowadays!

(For the sake of context, for Wonderful, when targetting the bare metal WonderSwan, I just decided to encourage using __far for ROM/SRAM data, and set DS = SS = 0x0000. This should be good enough for the time being.).

asiekierka commented 2 years ago

Just to finish the discussion on this peculiar DS != SS platform, here's the official "tips & tricks" page discussing the situation (for the first two compilers), translated with DeepL with small edits from myself in parentheses:

According to FreyaOS specifications, the placement and size of each segment during user program execution is typically as follows:

Segment Placement Size (Effective Value)
Code (CS) (Stored on cartridge flash memory) (Up to) 64KB (0x8000 ~ 0xDFFF)
Data (DS) SRAM (may be banked in Freya interrupt calls/handlers) (Up to) 64KB (0x1000)
Stack (SS) (Internal console RAM) Varies (0x0000)

In terms of memory models, this segment configuration corresponds roughly to the compact model. However, when compiling with the compact model (using, for example, Turbo C) there are aspects of inefficiency, such as the fact that all default pointers are handled as "far". For this reason, code is usually (compiled) with the small model, and those parts of the code that do not work properly as a result of (the compiler assuming) DS == SS are (manually adjusted) to work properly with DS != SS.

For example, if a function or block defines a non-static local variable and references its address, storage for this variable is allocated on the stack. Therefore, the address of this variable must be accessed (with an SS selector), but (as) DS == SS is assumed, code is generated to access it (with a DS selector). In this case, the problem can be avoided by (manually) using (a far pointer) to refer to it.

void foo()
{
  char on_stack[10];      /* allocated on the stack (SS) */
  static char on_data[10];    /* allocated in the data segment (DS) */

  char near *nearp;
  char far  *farp;

  nearp = on_stack;       /* incorrect: points to DS:on_stack */
  farp  = on_stack;       /* correct: points to SS:on_stack */

  nearp = on_data;        /* correct: points to DS:on_data */
  farp  = on_data;        /* correct: points to DS:on_data */
  ...
}

In this context, I think the above assessment that "the WonderWitch's development team just expected you to not make errors" is not entirely hyperbolic.

ghaerr commented 2 years ago

Hello @asiekierka,

Thank you for your general insights, as well as research and information on the earlier compilers, I've found it quite interesting :)

On the subject of DS != SS, and your last example of the correctness of sometimes using a near pointer to a DS: addressable location, such as:

        static char on_data[10];    /* allocated in the data segment (DS) */
    nearp = on_data;        /* correct: points to DS:on_data */

I wanted to point out that, in some of my experiences, this use nonetheless can result in incorrect behavior. This will occur in ia16-elf-gcc when the near pointer is passed to another function, which loses any assumptions of how the pointer might be accessed given the code generated:

void foo()
{
        static char on_data[10];    /* allocated in the data segment (DS) */
        extern void bar(char *);

    nearp = on_data;        /* correct: points to DS:on_data */
        bar(nearp);                     /* incorrect: may cause errors depending on how param is accessed in bar() */
}

The ia16-elf-gcc compiler will sometimes emit code to access the nearp parameter in bar() using (something analogous to) mov offset(%bx),%ax (uses DS), and other times emit mov offset(%bp),%ax (which uses SS), thus delivering the wrong value. This use of BX vs BP seems to be dependent on whether a local stack frame is created for the function, which is dependent on whether local variables are declared.

The entire call tree of functions called from pointer data accessed in this manner would have to be manually inspected to ensure correctness.

Thank you!

tkchia commented 2 years ago

Hello @ghaerr,

The entire call tree of functions called from pointer data accessed in this manner would have to be manually inspected to ensure correctness.

I think this nearp/farp hacks suggested by the Qute folks was because Turbo C did not allow coders to specify that a function might be called with %ss.data.

Under (recent versions of) gcc-ia16, it is of course much, much better to directly state this fact to the compiler — using __attribute__ ((no_assume_ss_data)) (e.g.). The compiler will then treat stack variables as belonging to a wholly separate address space (__seg_ss) from static storage variables.

Open Watcom also has a similar feature in the form of a __declspec (farss) modifier, and a -zu command line switch.

Thank you!

tkchia commented 2 years ago

Hello @ghaerr, hello @asiekierka,

Anyway, I am starting to be convinced that the execution environment of WonderSwan + WonderWitch is designed to drive people crazy. :neutral_face:

According to "Judge and Dox" (http://jbkun.free.fr/download/wstech24.txt), the console provides a small amount — 16 KiB — of fast RAM at the entire system's disposal. (OK, the WonderSwan Colour provides 64 KiB, which is a bit better.) A large part of this 16/64 KiB is used for graphics tiles (?), and doubtless some of the memory is used for interrupt vectors (at least, exceptions and IRQs). And I am guessing that the "FreyaOS" thing also wants to use some of this fast RAM (?). This leaves less than 8/24 KiB of fast RAM that an app can actually use.

ghaerr commented 2 years ago

Hello @tkchia,

The compiler will then treat stack variables as belonging to a wholly separate address space (__seg_ss) from static storage variables.

Thank you. To clarify my previous example, then, when __attribute__ ((no_assume_ss_data)) is used: passing a near pointer to a DS: addressable variable will still not work, as the near pointer will be assumed to be an offset within a seperate SS: address space; but passing a near pointer to an SS: addressable value (such as nearp = on_stack) will work. Correct?

I am starting to be convinced that the execution environment of WonderSwan + WonderWitch is designed to drive people crazy.

Well, certainly one has to be very careful to the nuances of running DS != SS in C: I personally limit use to a single source file only, having been bitten too many times before with unintended consequences!

Thank you!

tkchia commented 2 years ago

Hello @ghaerr,

Thank you. To clarify my previous example, then, when __attribute__ ((no_assume_ss_data)) is used: passing a near pointer to a DS: addressable variable will still not work, as the near pointer will be assumed to be an offset within a seperate SS: address space; but passing a near pointer to an SS: addressable value (such as nearp = on_stack) will work. Correct?

That is so. Basically, on_stack will be considered as having a type of, say, char __seg_ss [10]. If you declare a pointer nearp of type char __seg_ss *, then you can indeed say nearp = on_stack;.

So this compiles without errors:

__attribute__ ((no_assume_ss_data)) void foo()
{
    static char on_data[10];
    extern void bar(char *);
    char *nearp = on_data;
    bar(nearp);
}

as does this:

__attribute__ ((no_assume_ss_data)) void foo()
{
    char on_stack[10];
    extern void bar(char __seg_ss *);
    char __seg_ss *nearp = on_stack;
    bar(nearp);
}

But this will tell you "error: initialization from pointer to non-enclosed address space":

__attribute__ ((no_assume_ss_data)) void foo()
{
    char on_stack[10];
    extern void bar(char *);
    char *nearp = on_stack;
    bar(nearp);
}

Thank you!

tkchia commented 2 years ago

Hello @asiekierka,

(For the sake of context, for Wonderful, when targetting the bare metal WonderSwan, I just decided to encourage using __far for ROM/SRAM data, and set DS = SS = 0x0000. This should be good enough for the time being.).

Something I am thinking about: any idea what happens when you invoke a "FreyaOS" syscall with %ss = 0 but %ds and %es pointing at some places other than 0x1000:0? Thank you!

asiekierka commented 2 years ago

I think this nearp/farp hacks suggested by the Qute folks was because Turbo C did not allow coders to specify that a function might be called with %ss ≠ .data.

This is correct. What is more baffling is that "LSI-C for WW", a licensed commercial compiler supposedly adapted specifically for use with the WonderWitch development kit, did not allow this either. Turbo C++ 1.01, the community's choice, allows a mode in which a warning, but not an error, is emitted.

According to "Judge and Dox" (http://jbkun.free.fr/download/wstech24.txt), the console provides a small amount — 16 KiB — of fast RAM at the entire system's disposal. (OK, the WonderSwan Colour provides 64 KiB, which is a bit better.)

This is fine for the context of writing a bare metal video game. The GameBoy, another console with a C-based homebrew toolchain (based on SDCC, a compiler supporting the 8080-esque Sharp SM83 8-bit CPU), has only 8KB RAM after all, plus 8KB VRAM.

In general, from the WonderSwan's 16KB (and the Color's 64KB), on bare metal, you need to subtract:

Typically speaking, software which does not utilize all tiles, all screen tilemap entries, all sprite entries, all color palette entries, will use the remaining memory for something else. All console internal RAM is also always accessible; this is preferable to the GameBoy situation, where the VRAM is only accessible when it is not in use for video drawing (this means it is accessible during vertical blank, horizontal blank, etc).

Something I am thinking about: any idea what happens when you invoke a "FreyaOS" syscall with %ss = 0 but %ds and %es pointing at some places other than 0x1000:0? Thank you!

I suspect this would work fine. The problem is that if %ss = 0x1000:0, it's in SRAM - and the WonderWitch cartridge's SRAM has four banks which can be changed by I/O register writes. "FreyaOS" uses bank 3, while applications use the remaining banks.

Either way, -mno-callee-assume-ss-data-segment is perhaps the closest to Turbo C++'s -ms! behavior, which became the preferred mode of operation of the community back in the day, and is what most applications will expect. The only truly unworkaroundable problem I've ran into so far is that va_list functions expect a near pointer - while typically, va_lists are allocated on the stack (SS), not in the data segment (DS).

But don't worry about the WonderWitch too much, attempting to build a modern toolchain capable of outputting binaries to it is a hobby exercise and I don't necessarily expect homebrew developers to continue using that route, especially given that modern flashcart hardware prefers to allow uploading complete cartridges (= bare metal development) and is available for far less (about $100 to $150) than a second-hand, more limited WonderWitch (about $250+).

ghaerr commented 2 years ago

Hello @asiekierka,

The only truly unworkaroundable problem I've ran into so far is that va_list functions expect a near pointer - while typically, va_lists are allocated on the stack (SS), not in the data segment (DS).

Is this primarily a problem for user-written functions using va_list, or the provided stdio printf/sprintf/vfprintf family of functions? If the former, it would seem that a solution providing a stdargs.h replacement forcing a static (DS) va_list allocation could be used (with no threads). If the latter, a stdargs.h replacement that used far pointer access to parameters, and a couple of the lower-level functions arguments expanded to char __far * in the C library nano-vprintf.c could be made, as described in https://github.com/tkchia/gcc-ia16/issues/84#issuecomment-945026249.

Thank you!

tkchia commented 2 years ago

Hello @ghaerr, hello @asiekierka,

The only truly unworkaroundable problem I've ran into so far is that va_list functions expect a near pointer - while typically, va_lists are allocated on the stack (SS), not in the data segment (DS).

Is this primarily a problem for user-written functions using va_list, or the provided stdio printf/sprintf/vfprintf family of functions? If the former, it would seem that a solution providing a stdargs.h replacement forcing a static (DS) va_list allocation could be used (with no threads). If the latter, a stdargs.h replacement that used far pointer access to parameters, and a couple of the lower-level functions arguments expanded to char __far * in the C library nano-vprintf.c could be made, as described in #84 (comment).

Let me see if I can tweak the definitions of __builtin_va_list, __builtin_va_start, and __builtin_va_end in the compiler, to work with %ss.data functions. Some sort of transparent_union type might hopefully do the trick. Thank you!

tkchia commented 2 years ago

Hello @asiekierka, hello @ghaerr,

va_list under an %ss.data situation should work now (https://github.com/tkchia/gcc-ia16/issues/104).

(I am still kind of surprised that there is an actual C runtime environment which practically demands that an entire app run with %ss.data. There is a first time for everything, I suppose.)

Thank you!

asiekierka commented 2 years ago

I am still kind of surprised that there is an actual C runtime environment which practically demands that an entire app run with %ss ≠ .data.

Win16 DLLs! See Digital Mars's documentation for more information.

It's just that, in most environments, when programming with %ss != .data, you're just expected to use Compact or Large memory models instead. The WonderWitch team, due to the low performance of the handheld, insisted on "just not making bugs", as alluded to in a different issue.

ecm-pushbx commented 2 years ago

Hello @tkchia, I found another data point on the xlatb segment override question. The (GNU GPL) sources of VGAPride by @foone contain an LZ4 depacker (limited to 64 KiB chunks). In its source you will find this comment:

; NOTE:  I can't explain it, but with no extraneous background interrupts,
; timings are taking longer than normal on my IBM 5160.  So, we have to
; reset our timing numbers here:

However, the code does contain two instances of cs xlatb, here and here. Unless cs equaled ds in their tests, this anecdote indicates that the segment override worked on their XT.

ecm-pushbx commented 2 years ago

This depacker appears to be from http://www.oldskool.org/pc/lz4_8088