tkchia / gcc-ia16

Fork of Lambertsen & Jenner (& al.)'s IA-16 (Intel 16-bit x86) port of GNU compilers ― added far pointers & more • use https://github.com/tkchia/build-ia16 to build • Ubuntu binaries at https://launchpad.net/%7Etkchia/+archive/ubuntu/build-ia16/ • DJGPP/MS-DOS binaries at https://gitlab.com/tkchia/build-ia16/-/releases • mirror of https://gitlab.com/tkchia/gcc-ia16
GNU General Public License v2.0
173 stars 13 forks source link

Custom memory layout for tiny model TSR #81

Open andrewbird opened 2 years ago

andrewbird commented 2 years ago

I'm looking at getting gcc-ia16 to cross compile the share.com TSR for FreeDOS. Currently the only compiler capable of building it is Turbo C 2.01. I've had some initial success with GCC using a local keep() function, see https://github.com/andrewbird/share/commit/b94094af4b5bed6efbafb6ee10a38106f7cadfb8 however the original code does some tricks to throw away the initialisation code and shrink the resident memory size. I'm wondering if I could achieve something similar or better using a custom linker script? From what I see in the memory map file, gcc-ia16 lays out the startup code, init and fini before the text of the program and consequently would still be included in the resident memory at TSR. I think that ideally I'd want something like this for tiny model .COM

(jump to real startup code)
(resident text section)
(resident data section)
        < some way of determining this point to use as the size to retain in the keep() function >
(normal startup code)
(usual init code)
(usual fini code)
(transient text section)
(transient data section)

I've never had cause to write a linker script before, so I am a complete novice. Is this sort of control possible? Is there a better solution?

Also I'm using this command line, how would I replicate that as separate compile / link ops?

ia16-elf-gcc -Wall -fpack-struct -fno-toplevel-reorder -mcmodel=tiny -o $@ $< -li86 -Wl,-Map=share.map
andrewbird commented 2 years ago

I had some success with the linker script and was able to reduce the resident size, however in order to place the temporary code at the end of the output I needed to move my table data out from the heap into a new data segment. This meant that my on disk size grew significantly. The only way I can see around this is copy the transient code higher at install time, jump to it and then overwrite the original location with my data tables. Here's my latest attempt https://github.com/andrewbird/share/commit/c4271c2e2a90da25f748a94e629e862d858798f3

One thing I do notice though, there are lots of these that end up in the resident code __libi86_intr_call_0133. They are only 3 bytes each, but when there are 377 of them they become quite significant.

tkchia commented 2 years ago

Hello @andrewbird,

Sorry for the late reply! I will see in the meantime if I can add better support for TSRs to the gcc-ia16 toolchain and libi86...

One thing I do notice though, there are lots of these that end up in the resident code __libi86_intr_call_0133.

Actually, under gcc-ia16 + libi86, you can try compiling your code with optimization (e.g. -Os or -O2) — this should help optimize away most of these __libi86_intr_call... routines. These little routines are used to implement the calls to the 256 possible interrupts, for use with int86 etc. E.g.

__libi86_intr_call_0041:
        int     $0x21
        ret
__libi86_intr_call_0042:
        int     $0x22
        ret

If you enable optimization, and your interrupt numbers are compile-time constants, then <i86.h> will use some magic incantations to link in only the needed int invocations.

Thank you!

tkchia commented 2 years ago

Hello @andrewbird,

From what I see in the memory map file, gcc-ia16 lays out the startup code, init and fini before the text of the program and consequently would still be included in the resident memory at TSR.

In case this is useful: the .init and .fini sections are customarily used to insert snippets (not entire routines) of code to call out to initialization routines before main starts, and to termination routines after main exits. In the default arrangement in gcc-ia16 though, the .init section simply contains a call to a routine that goes through functions pointers in .ctors.* sections. (The gccint info page has a bit of discussion about this stuff.)

To mark an entire routine as a "startup" routine, the customary way — or at least the GCC-approved way — is to place it in a .text.startup or .text.startup.* section. In fact, by default GCC will automatically place main and any __attribute__((constructor)) routines inside .text.startup.

However, using a .non_resident_text section name should work well too.

Thank you!

andrewbird commented 2 years ago

@tkchia yes that -Os really helped! At load of share.com the symbols looked like this

dosdebug> usermap load-gnu ../fdos/share.git/share.map 02c3
dosdebug> usermap list
  02c3:0120     _dos_freemem
  02c3:0132     _dos_getvect
  02c3:0144     _dos_setvect
  02c3:0158     __libi86_ret_really_set_errno
  02c3:0163     __libi86_int86_do
  02c3:01be     __libi86_intr_call_0041
  02c3:01c1     __libi86_intr_call_0057
  02c3:0235     _atoi_r
  02c3:0262     _atol_r
  02c3:027a     __errno
  02c3:02a3     memset
  02c3:02c5     strchr
  02c3:02e0     strlen
  02c3:02fc     _strtol_r
  02c3:052f     strtol
  02c3:0547     write
  02c3:055f     _write_r
  02c3:058b     __call_exitprocs
  02c3:06e3     _reclaim_reent
  02c3:07ec     _free_r
  02c3:08b5     _malloc_r
  02c3:09fd     _sbrk_r
  02c3:0a23     __udivsi3
  02c3:0a34     __umodsi3
  02c3:0a4a     __ia16_ldivmodu
  02c3:0a99     _sbrk
  02c3:0ae5     _write
  02c3:0bae     __ia16_abort_impl
  02c3:0c00     __DTOR_END__
  02c3:0c02     _global_impure_ptr
  02c3:0c04     _ctype_
  02c3:1116     __dso_handle
  02c3:111a     _impure_ptr
  02c3:114c     __ctype_ptr__
  02c3:1150     _heaplen
  02c3:1166     _psp
  02c3:116e     _global_atexit
  02c3:1170     __malloc_sbrk_start
  02c3:1172     __malloc_free_list
  02c3:11f6     table_pool
  02c3:51f6     _start
  02c3:5327     _exit
  02c3:532c     __msdos_crt_exit
  02c3:536a     init_tables
  02c3:53c8     main

And the code size was

02b0:0000 0x0011 [SHARE - Environment]
02c2:0000 0x9d3c [SHARE] (END)

After running main

02c2:0000 0x01ae [SHARE]

Which compares very favourably with Turbo C at 0x0237. Obviously I still have a significant part of the interrupt handler commented out, have you any more thoughts about register access within the handler? Would replicating the Watcom method of access via union INTPACK r be a viable method?

I had previously tried to mark _dos_freemem(), _dos_getvect() and _dos_setvect() with .non_resident_text section name as I know they are only called from installation code, but they still ended up in the .text section. Is there any way of marking their section from their caller's section (but it could get complicated if a function was called from both .text and .non_resident_text I guess)?

However, using a .non_resident_text section name should work well too.

Yes I'm making it up as I go along!

Thank you!

andrewbird commented 2 years ago

I pushed another version now that reduces both the on disk size and the resident memory to better than Turbo C. It does this by placing the non resident text over .bss and heap sections. At startup it copies that text up to just before 0xf000, then proceeds to the normal startup routines that initialize .bss and sets up the heap. Using this method I now only have one special section and malloc can now function as usual.

Type Turbo C Gcc ia16
On disk size 6300 5224
Resident memory size 0x0240 0x01b5

Obviously the new startup code would be better off in assembly rather than fixed bytes and not being very sure about what registers I need to preserve I probably saved too much. https://github.com/andrewbird/share/commit/b21cd7dcea727f625adc3cd1402b55f5a097b6bf

tkchia commented 2 years ago

Hello @andrewbird,

I had previously tried to mark _dos_freemem(), _dos_getvect() and _dos_setvect() with .non_resident_text section name as I know they are only called from installation code, but they still ended up in the .text section. Is there any way of marking their section from their caller's section (but it could get complicated if a function was called from both .text and .non_resident_text I guess)?

The short answer is, no unfortunately. Basically, the input sections where _dos_getvect (.) etc. come from are taken from the modules in the library i.e. libi86.a — and the output sections they end up in are determined by the linker script.

There is no easy way (for now) to automatically place _dos_getvect (.) in a special transient section if it will only be called from transient code.

Thank you!

andrewbird commented 2 years ago

Yes that aligns with my various experiments. I guess for now it's best to recreate the simple ones locally with int86(), but in any case I'm very happy with the resident and file sizes achieved, I just need to figure out a fairly clean way of accessing/modifying the registers from within the interrupt handler and to implement a chain interrupt function.

Thank you!

andrewbird commented 2 years ago

Interestingly switching to a local function for getvect() based on int86x() increased the resident size by 6 paragraphs! I see your implementation of _dos_getvect() uses gcc inline assembler, so that would be the way to go if I wasn't trying to avoid sprinkling compiler specific assembly language all over the C source. So for now I'll stick with the i86 library code ending up in the resident portion, and it's good to know your library is lean!

Thank you!

andrewbird commented 2 years ago

So I'm still looking to shave some bytes off the share tsr. I'm looking at the map file and I see this

 .rodata        0x000000000000125d      0x281 /usr/ia16-elf/lib/libc.a(lib_a-ctype_.o)
                0x000000000000125d                _ctype_                       

What is it, and can I exclude it somehow?

Thank you!

andrewbird commented 2 years ago

I see it's used by strtol(), which is presumably included to provide atol() which I do use. So I guess the answer is that I'm using it. Oh well!

Thank you!

tkchia commented 2 years ago

Hello @andrewbird,

What is it, and can I exclude it somehow?

_ctype_ is the array of flags that is used to implement <ctype.h> 's isalpha (.), isdigit (.), etc. macros, for the default locale ("C"). I think many C runtime libraries employ a similar method to implement <ctype.h>. Thank you!

ghaerr commented 2 years ago

Hello @andrewbird,

I see it's used by strtol(), which is presumably included to provide atol() which I do use. So I guess the answer is that I'm using it.

Well, there are other options if you're looking for tiny size: you could write your own atol() and use direct comparisons for digits 0-9, rather than the isdigit() macros which drag in the table defined in <ctype.h>. With a small, efficient implementation of atol like the one below included in your source, this should end up being quite a bit smaller that what you have, provided it meets with your program requirements. This would prevent the standard library version from being pulled in to your program.

long atol(char *s) /* simple, somewhat hacked version of atoi that returns long for smaller size*/
{
    int n = 0; /* declare long if requires decoding of numbers > 64K*/
    while (*s >= '0' && *s <= '9')
        n = n*10 + *s++ - '0';
    return (long)n;
}

Thank you!

andrewbird commented 2 years ago

Hello @ghaerr Thanks for the advice, I'd just written but not yet tested this

+/* Naive implementation of atol(), only decimal digits allowed, no signs */
+long int atol(const char *s) NON_RESIDENT;
+long int atol(const char *s) {
+       long int val;
+       const char *p;
+
+       for (val = 0, p = s; *p; p++) {
+               if (*p == ' ')
+                       continue;
+               if (*p < '0' || *p > '9')
+                       break;
+               val *= 10;
+               val += *p - '0';
+       }
+
+       return val;
+}

It saved 1.2k on the file size, plus allowed me to steer it into my new non resident text section which means I save around the same on the resident size.

Thank you again!

andrewbird commented 2 years ago

Hello @tkchia , @ghaerr , I just wanted to say thank you for all your help to me on this gnu linker voyage. My PR has now been merged into FDOS/share so it's certainly been a worthwhile effort, and I've learnt a little bit on the way. A most interesting experience for me.

Thank you!

andrewbird commented 2 years ago

Hi @tkchia, I'm just trying out your new TSR support that just landed in the PPA. I'm having a little difficulty getting it to link fully, see https://github.com/andrewbird/share/commit/062910471e291d1e130a779c29d1364007c45982

So here's the build attempt and output message

ia16-elf-gcc -Wall -fpack-struct -mcmodel=tiny -mtsr -c share.c -o share.obj -Os
nasm -f elf gcc_help.asm -o gcc_help.obj
ia16-elf-ld share.obj gcc_help.obj -o share.com -L/usr/ia16-elf/lib -li86 --script=dtr-mts.ld -Map=share.map
ia16-elf-ld: cannot find -l:crtbegin.o
ia16-elf-ld: cannot find -l:crtend.o
make: *** [Makefile:2: share.com] Error 1

So some questions:

Thank you!

andrewbird commented 2 years ago

After a little look at my previous linker script is seems I need to specify the -L path to gcc

diff --git a/build.sh b/build.sh
index 04cd88c..eaad3d0 100755
--- a/build.sh
+++ b/build.sh
@@ -5,7 +5,7 @@ if [ x"${COMPILER}" = "xgcc" ] ; then
   export COPT="-Wall -fpack-struct -mcmodel=tiny -mtsr -c share.c -o share.obj -Os"
   export XOBJS="gcc_help.obj"
   export LD="ia16-elf-ld"
-  export LOPT="share.obj ${XOBJS} -o share.com -L/usr/ia16-elf/lib -li86 --script=dtr-mts.ld -Map=share.map"
+  export LOPT="share.obj ${XOBJS} -o share.com -L/usr/lib/x86_64-linux-gnu/gcc/ia16-elf/6.3.0 -L/usr/ia16-elf/lib -li86 --script=dtr-mtsl.ld -Map=share.map"
   make

So as the link now completes, only one question remains, is it expected to need to specify the -L paths to lib86 and gcc?

andrewbird commented 2 years ago
A little comparison between the custom linker script & loading code and your new TSR support. Type Custom New TSR support
On disk size (bytes) 5416 5424
Resident size (paragraphs) 0x01c0 0x0205

I'm a little surprised by the resident size growth?

andrewbird commented 2 years ago

Using ia16-elf-gcc to do the link stage means I don't have to specify the paths or the script

ia16-elf-gcc -Wall -fpack-struct -mcmodel=tiny -mtsr -c share.c -o share.obj -Os
nasm -f elf gcc_help.asm -o gcc_help.obj
ia16-elf-gcc -mtsr -Wl,-Map=share.map share.obj gcc_help.obj -o share.com -li86
tkchia commented 2 years ago

Hello @andrewbird,

ia16-elf-gcc -mtsr -Wl,-Map=share.map share.obj gcc_help.obj -o share.com -li86

Yes, this is the way.

You probably now need to put the transient code in .text.startup, if you have not done so already. I.e.

#define NON_RESIDENT __attribute__((section(".text.startup")))

I hope to further tweak the gcc-ia16 toolchain — maybe add new function and variable attributes — so that there is a documented and flexible way to specify variables and functions as being transient. Meanwhile, the above should work.

Thank you!

andrewbird commented 2 years ago

Hello @tkchia,

#define NON_RESIDENT __attribute__((section(".text.startup")))

Yes I had already done this and it seemed to be steering those functions into the non-resident area just fine. I've attached the map file so you can see, but it seems there must be something else swelling the resident sections. share.zip

So since you now use .text.startup as the section to discard in the tsr script, does that mean you can mark certain library functions that you know are startup / end code like crtbegin, crtend, exit, __call_exitprocs as section .text.startup and in the normal linker script they will be linked into the .text section by wildcard rule .text.*?

I hope to further tweak the gcc-ia16 toolchain

Sure thing, I'll test as they arrive into the PPA

Thank you!

andrewbird commented 2 years ago

Hello @tkchia , Am I correct in saying that malloc() allocations start at heap_end_minimum? If so that may be the problem, as the share loader figures out where the end of what's supposed to be resident is by mallocing a final byte after all other allocations are done and then measuring the distance from the PSP to it. On the custom ld script script the heap_end_minimum is 0x12a0, whereas with the new tsr support it's 0x16f0. Other than that the .text, .data, bss section sizes look very similar to those produced by the custom script.

Thank you!

tkchia commented 2 years ago

Hello @andrewbird,

Ah — I think I see the problem. Your custom script placed the heap start right after the resident portion, while my script was placing it at the end of the transient portion. (I had considered treating the heap as part of the resident portion, but this is tricky to get right in the general case.)

Let me try to come up with a good way for the C runtime to report the end of the resident portion. I will let you know.

Thank you!

andrewbird commented 2 years ago

Hello @tkchia ,

while my script was placing it at the end of the transient portion.

But doesn't that mean if we want to release the transient portion we can never use malloc'd memory afterwards? In the share program malloc'd memory is used to allocate the share and lock tables that will be used by the resident code. Their sizes are not known at compile time as they are determined by command line parameters.

Let me try to come up with a good way for the C runtime to report the end of the resident portion.

I'm not sure if that's too helpful on its own, but if we could redefine the location of the heap at runtime it could be?

I hope I'm not asking too many silly questions, this level of control on placing memory in a C program is both new and interesting to me.

Thank you!

andrewbird commented 2 years ago

Can this work?

Entry stage

. high
data.startup
text.startup
data
text
. low

Setup stage before main()

. high
bss.startup
data.startup
text.startup
stack.startup
.
heap
bss
data
text
. low

Run main()

. high
stack
.
heap
bss
data
text
. low
tkchia commented 2 years ago

Hello @andrewbird,

In the share program malloc'd memory is used to allocate the share and lock tables that will be used by the resident code. Their sizes are not known at compile time as they are determined by command line parameters.

Hmm — this is a problem. 😐

One problem with moving .text.startup etc. above the heap, is that the maximum heap size, whether it is 32 KiB, or 48 KiB, or 60 KiB, etc. will probably need to be hard-coded in advance, so that we know where exactly to position .text.startup etc.

It will be nice if we can automatically position the transient code and data according to their size — if the transient portion is large we probably want it to start lower in memory. But it seems the GNU linker scripts do not allow one to position sections "backwards" in memory — e.g. I cannot easily say "allocate a .bss.startup output section in such a way that the section ends at offset 0xff00 or thereabouts".

Thank you!

ghaerr commented 2 years ago

Hello @andrewbird and @tkchia,

(I had considered treating the heap as part of the resident portion, but this is tricky to get right in the general case.)

May I ask why is this tricky, if the heap is of fixed size? A fixed heap directly after the resident portion, then followed by the transient .text.start/.data.start sections should solve this problem and eliminate the requirement for a custom linker script, right?

Let me try to come up with a good way for the C runtime to report the end of the resident portion.

Wouldn't malloc(1), after all other allocations are completed, return the end of the resident data section, which, along with some manipulation with DS and CS values, allow the PSP offset to be known?

Thank you!

tkchia commented 2 years ago

Hello @ghaerr, hello @andrewbird,

I am experimenting with some changes to gcc-ia16 and newlib-ia16 that will allow the heap to appear before the transient code and data, but in a more rigorous way (https://github.com/tkchia/newlib-ia16/commit/5c10cbfe9baffadeb31a6a81f09de4f40067b54d etc.). Thank you!

andrewbird commented 2 years ago

But it seems the GNU linker scripts do not allow one to position sections "backwards" in memory — e.g. I cannot easily say "allocate a .bss.startup output section in such a way that the section ends at offset 0xff00 or thereabouts".

I'm not sure you need to worry about that as such since it would be NOLOAD, hence not allocated, and the AT final address is computable.

So wouldn't this work, it's similar to my gcc_help.ld but extended to handle the extra .data.startup and .bss.startup?

section .text {
...
}
section .data {
...
edata - .
}

section .bss NOLOAD {
/*  NOLOAD means it can overlap with text.startup and maybe .data.startup */
}

/* this is where heap ends up, its location and size are not important */

section .text.startup AT (0xff00 - sizeof(.bss.startup) - sizeof(.data.startup) - sizeof(.text.startup)) {
/* load address starts at .edata */
...
}

section .data.startup AT (0xff00 - sizeof(.bss.startup) - sizeof(.data.startup)) {
...
}

section .bss.startup (NOLOAD) AT (0xff00 - sizeof(.bss.startup)) {
/* again it's not allocated, only cleared */
}

section .stack.startup (NOLOAD) AT (0xff00) {
/* again it's not allocated, and only shown here for reference 0xff00 -> 0xffff */
}

Or did I miss something obvious?

andrewbird commented 2 years ago

I just tried this and it seems to give me what I expect (although SIZEOF didn't seem to work, end of section - start of section did)

    .text.startup (0xfd00 - __xx_bss_startup_length - __xx_data_startup_length - __xx_text_startup_length) :
                        AT (__sbss_keep) {                                      
...
}
__xx_text_startup_length = (__etext_startup - __stext_startup);

    .data.startup (0xfd00 - __xx_bss_startup_length - __xx_data_startup_length) :
                        AT (__sbss_keep + SIZEOF (.text.startup)) {             
...
}
__xx_data_startup_length = (__edata_startup - __sdata_startup);

    .bss.startup (0xfd00 - __xx_bss_startup_length) (NOLOAD) : {                
...
}
__xx_bss_startup_length = (__ebss_startup - __sbss_startup);

And the .MAP file looked like this

.bss            0x00000000000011c4       0xb8                                   
                0x00000000000011c4                __sbss_keep = .               

.text.startup   0x000000000000f88a      0x461 load address 0x00000000000011c4   
                0x00000000000011c4                __stext_startup_load = LOADADDR (.text.startup)

.data.startup   0x000000000000fceb        0xd load address 0x0000000000001625   
                0x0000000000001625                __sdata_startup_load = LOADADDR (.data.startup)

.bss.startup    0x000000000000fcf7        0x9                                   
                0x000000000000fcf7                __sbss_startup = .            

Thank you!

tkchia commented 2 years ago

Hello @andrewbird,

This is very cool, thanks! I will update the Newlib package in the PPA to use this.

tkchia commented 2 years ago

Hello @andrewbird, hello @ghaerr,

By the way...

Wouldn't malloc(1), after all other allocations are completed, return the end of the resident data section, which, along with some manipulation with DS and CS values, allow the PSP offset to be known?

Actually sbrk (0) might be simpler and work better. If there are unallocated "holes" in the heap — perhaps left over by earlier <stdio.h> operations etc. — then malloc (1) may allocate from these holes, and will not return the true end of allocated heap memory. So I think sbrk (0) would be better.

Thank you!

andrewbird commented 2 years ago

Hello @tkchia, That's an interesting observation about malloc(1) and holes. So sbrk(0) might be a way of reducing code size slightly as well, if it doesn't cost too much to include another library call into the resident text.

I just got the PPA update so I'll check out your latest work.

Thank you!

ghaerr commented 2 years ago

Hello @tkchia and @andrewbird,

If there are unallocated "holes" in the heap — perhaps left over by earlier operations etc. — then malloc (1) may allocate from these holes

Thank you. I hadn't considered the possibility of stdio's allocation of buffers for stdout/stdin in this case. Depending on the implementation, many will dynamically allocate upon first use of I/O, although some have pre-allocated static buffers to alleviate this. Yet others allow static FILE definitions to override dynamic allocation. In any case, sbrk(0) is superior.

So sbrk(0) might be a way of reducing code size slightly as well

If your resident program only requires allocating memory, never to be freed, using sbrk could save dragging in the malloc code.

However, many libc routines use malloc, including printf as described above, depending on the particular libc implementation.

For instance, in the ELKS operating system the init program is always resident, so the following code included in init.c works to reduce the stdin buffer from 1024 to 1 byte, and the stdout and stderr buffer sizes from 1024 to 64 bytes, and have them pre-allocated. In this case, malloc is still dragged in, but not called. Eliminating printf entirely would eliminate the need for this, but wasn't an good option.

static unsigned char bufin[1];
FILE  stdin[1] =
{
   {
    bufin,
    bufin,
    bufin,
    bufin,
    bufin + sizeof(bufin),
    0,
    _IOFBF | __MODE_READ | __MODE_IOTRAN
   }
};

static unsigned char bufout[64];
FILE  stdout[1] =
{
   {
    bufout,
    bufout,
    bufout,
    bufout,
    bufout + sizeof(bufout),
    1,
    _IOFBF | __MODE_WRITE | __MODE_IOTRAN
   }
};

static unsigned char buferr[64];
FILE  stderr[1] =
{
   {
    buferr,
    buferr,
    buferr,
    buferr,
    buferr + sizeof(buferr),
    2,
    _IONBF | __MODE_WRITE | __MODE_IOTRAN
   }
};

Thank you!

andrewbird commented 2 years ago

Hello @tkchia, @ghaerr,

It's looking really good now https://github.com/andrewbird/share/tree/new-ld-tsr-support-02

Type Turbo C Custom New TSR support
On disk size (bytes) 6416 5416 5400
Resident size (paragraphs) 0x246 0x01c0 0x1aa

So as you might tell, I'm very happy with the results. Here's the .MAP file share.zip, looking down the list of allocations in .text, .data and .bss the abort stuff stands out. Would that be a candidate to be marked as part of the *.exit sections?

Thank you!

ghaerr commented 2 years ago

Hello @andrewbird,

I looked at the share.c source and now see what's going on. Since printf is already removed, and malloc called only for the file and lock tables, it seems using sbrk for those allocations would allow for malloc (actually nano-malloc and nano-free) to disappear, which should save 0x24+0x148+0xc9 bytes according to your mapfile. This could be done for the TurboC case as well, and could eliminate the #ifdefs and unneeded free call.

Thank you!

andrewbird commented 2 years ago

Hello @ghaerr,

I'm not familiar with sbrk() (I never heard of it before @tkchia mentioned it), but I did a quick try with it for gcc only, see https://github.com/andrewbird/share/tree/new-ld-tsr-support-02. I may have done something stupid as I still see free/malloc used in the .MAP. The filesize was reduced (5338) but the resident size change was minimal (0x1a7).

Thank you!

tkchia commented 2 years ago

Hello @andrewbird,

I may have done something stupid as I still see free/malloc used in the .MAP.

I checked the .map file — it turns out that there are some infelicities in the Newlib library that is causing _free_r etc. to be needlessly roped in. I will try to fix these; you should not need to do anything at your end.

(If you are curious: the top part of the .map file gives some indication of which particular library modules were being included into the final .exe to resolve which particular symbols. According to it,

So I guess one way to avoid linking in the malloc stuff, would be to modify Newlib to separate out the internal errno into a separate module. Alternatively, I might be able to get rid of the internal errno variable altogether, so that there is always only a thread-local errno to deal with.)

Thank you!

ghaerr commented 2 years ago

Hello @tkchia,

Nice noticing how the system-dependent errno dragged in lib_a-reent.o, causing free_r to get roped in. I notice that sbrk_r also uses errno, so it's not just write_r bringing it in.

What I haven't figured out yet is why malloc is getting dragged in. The nano-mallocr.c source is compiled into separate modules (malloc/free/realloc, etc), and the reent cleanup code doesn't call it nor does free ... where's it coming from?

Thank you!

tkchia commented 2 years ago

Hello @ghaerr,

The nano-mallocr.c source is compiled into separate modules (malloc/free/realloc, etc), and the reent cleanup code doesn't call it nor does free ... where's it coming from?

Apparently it got linked in because _free_r was referencing __malloc_free_list, which was defined in the same module as _malloc_r (argh). Thank you!

andrewbird commented 2 years ago

Hello @tkchia, hello @ghaerr,

I checked the .map file — it turns out that there are some infelicities in the Newlib library that is causing _free_r etc. to be needlessly roped in. I will try to fix these; you should not need to do anything at your end.

Well you both seem to have a good grip on what's going on, I do appreciate the effort you are putting into my little quest.

I'll look out for the updated PPA package, and post my results shortly afterwards.

Thank you!

andrewbird commented 2 years ago

Hello @tkchia, i just got your latest update from the PPA, it's looking really good!

Type Turbo C Custom Latest TSR support
On disk size (bytes) 6416 5416 3934
Resident size (paragraphs) 0x246 0x1c0 0x14a

So as you can see the filesize decreased significantly and the resident size reduced by almost 4k bytes over the Turbo C compiled version! As I already mentioned, I'm very happy with the results. Do you think there's any more scope for improvement? I've attached the new .map file, but the only thing that looks out of place now in the .text section is __call_exitprocs() etc.

Thank you!

ghaerr commented 2 years ago

Hello @andrewbird and @tkchia,

I'm very happy with the results. Do you think there's any more scope for improvement?

Looking at the most recent .map file, I'd say that, thanks to @tkchia's work untangling parts of newlib from itself and your work with the linker script and rewriting share.c using sbrk, that there's not much more reasonably obtainable gains to be had. The exit-processing is usually pretty tied into proper libc and process termination handling which share.com sometimes needs, and there's only around 300 bytes in that code.

I've enjoyed learning more about newlib from this thread, and its nice to see gcc-ia16 and its C library outdoing TurboC by quite a bit :)

Thank you!

andrewbird commented 2 years ago

Hello @ghaerr and @tkchia,

The exit-processing is usually pretty tied into proper libc and process termination handling which share.com sometimes needs, and there's only around 300 bytes in that code.

Yes, I had assumed that we probably need that before _dos_keep() gets called, but if it gets run somehow afterwards I'm not sure what's would happen except that it's probably very bad. For that reason I was wondering if it could move its section from .text to .text.exitand in that way it would be present for any legitimate calls, but be non-resident after _dos_keep()? That being said, I agree that releasing another 300 bytes is probably not worth the effort and what you've both done here has far exceeded my aspirations!

I did just try adding the following to the linker script

NOCROSSREFS_TO(".text.startup", ".text")
NOCROSSREFS_TO(".data.startup", ".text")
NOCROSSREFS_TO(".bss.startup", ".text")

And I got one error, but I'm not sure if it's anything to worry about?

ia16-elf-gcc -mtsr -Wall -fpack-struct -mcmodel=tiny -c share.c -o share.obj -Os
nasm -f elf gcc_help.asm -o gcc_help.obj
ia16-elf-gcc -mtsr -o share.com share.obj gcc_help.obj -Wl,-Map=share.map -li86
/usr/lib/x86_64-linux-gnu/gcc/ia16-elf/6.3.0/../../../../../ia16-elf/bin/ld: /usr/lib/x86_64-linux-gnu/gcc/ia16-elf/6.3.0/crtend.o: in function `__do_global_ctors_aux':
(.text+0x1): prohibited cross reference from .text to `.ctors' in .data.startup
/usr/lib/x86_64-linux-gnu/gcc/ia16-elf/6.3.0/../../../../../ia16-elf/bin/ld: (.text+0xa): prohibited cross reference from .text to `.ctors' in .data.startup
collect2: error: ld returned 1 exit status
make: *** [Makefile:2: share.com] Error 1

I've enjoyed learning more about newlib from this thread, and its nice to see gcc-ia16 and its C library outdoing TurboC by quite a bit :)

These are my sentiments exactly, a most enjoyable experience! I'll see about tidying up my branch and raising a PR to FD/share as it is now.

Thank you!

tkchia commented 2 years ago

Hello @andrewbird,

I've attached the new .map file, but the only thing that looks out of place now in the .text section is __call_exitprocs() etc.

Yes, I hope to find a neat way to move __call_exitprocs (), abort (), etc. to .text.exit, at some point in time. Thank you!

andrewbird commented 2 years ago

Hello @tkchia,

Yes, I hope to find a neat way to move __call_exitprocs (), abort (), etc. to .text.exit, at some point in time.

That's very good news. I'm already very happy with the result.

Thank you!

andrewbird commented 2 years ago

Hello @tkchia, I'm thinking about adding localisation support to FreeDOS share. I want to put the kitten library into the *.startup sections. I have it working, but is there a better way than this?

 kitten.obj: ../kitten/kitten.c
+ifeq "$(COMPILER)" "gcc"
+       $(CC) $(_CFLAGS)$@.1 $^
+       objcopy --rename-section .text=.text.startup \
+               --rename-section .data=.data.startup \
+               --rename-section .rodata=.rodata.startup \
+               --rename-section .bss=.bss.startup $@.1 $@
+else
        $(CC) $(_CFLAGS)$@ $^
+endif

Thank you!

andrewbird commented 2 years ago

Hello @tkchia, It seems that my section renaming, although steering things well to the transient portion, messes up file global data. I'm not sure if it's the initialisation (zero) of the data or when it's accessed that's failing, If I add the proper attribute then the problem goes away.

The problem definition is this

-static nl_catd _kitten_catalog = 0;   /* _kitten_catalog descriptor or 0 */
+static nl_catd _kitten_catalog __attribute__((section(".data.startup"))) = 0;  /* _kitten_catalog descriptor or 0 */

I guess the objcopy hack is just that.

Thank you!

andrewbird commented 1 year ago

Hello @tkchia, I'm still trying to track this problem down with my use of section renaming. I made a short test case t1.zip, I wonder if you would be so kind as to just have a quick look at the map file and see if it's what you expect to see, as the whole load address thing somewhat confuses me? Of course the test case works fine, but in the real program the variable that's in the bss.startup has a non zero value at startup.

Thank you!

tkchia commented 1 year ago

Hello @andrewbird,

Hmm... this is weird. The length of the transient BSS section (in __lbss_nokeep) is calculated as 0. This means that for some reason the mybss variable is not allocated space — or it is, but somehow it is made to overlap with something else.

I am not sure why the linker is doing this, and in any case this is probably a linker script (or linker) bug.

Thank you!

tkchia commented 1 year ago

Hello @andrewbird,

I wonder if you would be so kind as to just have a quick look at the map file and see if it's what you expect to see, as the whole load address thing somewhat confuses me?

Hmm... this is weird. The length of the transient BSS section (in __lbss_nokeep) is calculated as 0.

This particular issue should be fixed now (https://github.com/tkchia/newlib-ia16/commit/03a53fa3090b98941a5d2cf88c77e8ce4110cbef).

Anyway, for flat .com files, load addresses are used for their "original" purpose — to say that a section is placed at program startup at a particular address, but will have to be moved to another address before its contents are actually used.

Thank you!