Swap file improvements - Githubissues

ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086

Other

983 stars 106 forks source link

Swap file improvements #1729

Closed toncho11 closed 9 months ago

toncho11 commented 11 months ago

So this is the old discussion about the swap file: https://github.com/ghaerr/elks/issues/246

There have been improvements recently

can a swap file use UMB memory based on this PR https://github.com/ghaerr/elks/pull/1590 ? And what about EMS/XMS in the current state of the kernel?
would the new Direct Floppy/Direct HDD drivers allow for better utilization of the the swap file? If the system is not blocked while swapping a process than it will be more responsive.

Maybe you can launch a "loading screen" process which launches another second process for the actual application and so you get a nice loading screen while the kernel is busy swapping an existing process (which is currently not used much) with your second process which is your actual application.

ghaerr commented 11 months ago

@toncho11,

Even though four years ago it was said in #246 that ELKS somewhat supported swapping using the CONFIG_MEM_TABLE allocator, that allocator was removed years ago when the three different memory allocators were replaced by one rewritten by mfld-fr. Also, there seems to be no trace of CONFIG_SWAP. So at this time, there is no swapping of any kind supported by ELKS.

can a swap file use UMB memory based on UMB memory support PR?

There would be no benefit at all using UMB memory for swap, since the UMB memory is now able to be used for both system buffers as well as regular program allocation. In other words, any program may be allocated into UMB memory to run if it fits, thus allowing more memory to run applications programs without even supporting swapping.

would the new Direct Floppy/Direct HDD drivers allow for better utilization of the the swap file? If the system is not blocked while swapping a process than it will be more responsive.

Yes for direct floppy, and currently no for direct HD, since the direct HD driver is not yet interrupt driven. If swapping were implemented, yes it could help speed if the floppy driver didn't have to wait. However, in general, if a process has to get swapped out in order for another to (get swapped in) and run, having interrupt or async I/O really doesn't help unless more than two processes are being waited for by the user. One would essentially have to be running three concurrent programs to see the benefit. But, as @jbruchon commented, swapping to floppy will be unbelievably slow.

Even on the early systems such as UNIX v6 which supported swapping, this was one using a non-filesystem area of the hard disk, which allowed writing to sectors freely without regard for filesystem tables, as otherwise the speed would have been excessively slow. IMO, the reason swapping was implemented in UNIX v6 was because of the always-multi-user nature of UNIX (meaning they had to), something that ELKS isn't really designed for, running on a single PC.

An idea that could work well would be swapping to XMS memory, rather than disk. This could be made quite fast and greatly extend the available memory. Of course, none of that would work on systems without a 386 CPU.

On a technical note, there are some pretty big issues to overcome to get swapping to work quickly. For instance, when a user types ^C at a shell prompt, all processes part of the TTY process group (that means anything started from the shell) would possibly have to be swapped in, just to process the SIGINT interrupt, then swapped out again. IMO, one wouldn't even want to think about doing such a thing on a floppy-based system. There is also the problem of reliably finding a large enough area in the 640k RAM memory list to read in any swapped out programs, without having also swap out whatever else that might be running.

One may think that it could be a good idea for those folks running busy network stacks to have swapping, but every arriving network packet could cause the daemon to be swapped in, then back out. In order to make this efficient, we'd need a rewritten scheduler, as the current scheduler is very dumb and uses a strict round-robin approach without any regard to how much or how typically a program uses up its time slice.

Lots for work for little benefit? Lets hear what others have to say about what they would use swapping for that they can't accomplish now.

Having said all that, @Mellvik has stated that implementing swapping is on his list of things to implement on TLVC.

Thank you!

jbruchon commented 11 months ago

I personally would recommend implementing in-memory compressed swap. There are several algorithms that perform decently on ancient hardware and don't require too much code or execution time.

Mellvik commented 11 months ago

Wow, this is an interesting discussion. Swapping is indeed on my agenda for TLVC although quite a bit down the list, and to be considered an experiment as to the usefulness in this environment.

I personally would recommend implementing in-memory compressed swap. There are several algorithms that perform decently on ancient hardware and don't require too much code or execution time.

Thanks @jbruchon - this is an interesting idea, I need to look at that. The problem is of course to find memory to swap to and find the balance between swap space and process space, as @ghaerr points out in his comment about UMBs. LIM memory (aka expanded) is possible though, even on 8bit ISA bus machines like the PC and XT.

On the other hand there is the fact that we more often than not use CF cards or the like as hard disks for the old klunkers these days. It's fast - becoming even faster (saving cycles) when we eliminate block sorting etc. which is no longer useful for solid state storage. Swapping to a cf card partition may become quite useable - even on a XT

BTW - I just re-read the chapter on swapping in Maurice Bach's book, it's a really great intro - or refresh.

toncho11 commented 11 months ago

Here is another example: https://monotech.fwscart.com/product/microram---640k---umb-ram---8-bit-isa It even has screenshot with 699KB free memory.

jbruchon commented 11 months ago

For 286+ swap can also violate the rules of a stock 8086 if you want. The swap code alone can switch to protected mode, copy memory, and switch back. The entire rest of the kernel can be oblivious.

Mellvik commented 11 months ago

Thank you @jbruchon -

Indeed, ELKS (and TLVC) are already doing that with @ghaerr's XMS buffers, via the INT19 call. This can be replicated into a very useful swapping scheme. Of course going all the ways - native via loadall - is also possible if demanding. In both cases though there is the penalty of a full processor reset on return.

Mellvik commented 11 months ago

Sorry - INT15, not INT19, and of course it's a lot easier on the 386 and up. You point is - however - well taken. The current XMS implementation in ELKS may quite easily be extended to support swapping.

jbruchon commented 11 months ago

True, I always forget about the 286's PM permanence. 386 then! It occurs to me that a 386+ swap module could also use 32-bit instructions to move the memory around much more quickly, and if compression was added then it could use 32-bit operations during that compression (assuming they're worth it).

Back to the compressed swap proposal. I've done a little bit of playing around with compression and Apple has an interesting memory compressor in Darwin called WKdm. Both of those might provide some inspiration. For my algorithm I found that LZ compression can be done more quickly if you limit the block size of the compressor (in my case, values 0-4095 fit in 12 bits leaving 4 bits for some flags) and have a little reserved room to index the bytes. You should set a limit since indexing too many of one byte will basically become slower linear scanning. Use the index instead of linear scanning when below the limit.

It's also possible to do it with a table (array) of 256 byte counters instead of lists of bytes; use the byte value you're examining as an offset into the table, increment if you're doing the pre-scan counting, decrement otherwise, don't scan at all if it's 0, stop and use linear scanning if that byte's count hits 255. This uses a lot less memory and CPU time but also slows things down if the block size is big enough and the data has high entropy. LZ compression works better as the dictionary gets bigger so you'd want to compress as many sequential chunks of data into one compressed unit as you could.

The way I encode my compressed data in lzjody is using a prefix byte for each data group, packing where possible. There is a prefix byte flag for literals P_LIT, a flag for LZ with long lengths P_LZL, and a flag for LZ with short lengths P_LZ. Both variants pack the LZ length into the unused bits of the prefix byte but P_LZL shifts and adds the next byte for a 12-bit length word instead of 4-bit. I'd highly recommend doing something like this, especially since your lack of additional compression methods would give you more bits. You could even throw RLE compression up to 64 bytes long in there since 2 bits yields 4 values (LZ, LZ long, RLE, literal) and RLE is absurdly simple and fast. That's what I did.

With in-memory objects you'll likely have a lot of unused (zeroes) sections so trying RLE before LZ will also speed things up. You may even be able to speed things up further by assuming everything is 16 bits instead of 8 bits, thus a 64-byte length becomes 128, and you can use 16-bit operations for everything instead of 8-bit, at the expense of some (probably minor) compression reduction.

ghaerr commented 11 months ago

It occurs to me that a 386+ swap module could also use 32-bit instructions to move the memory around much more quickly

We already do just that, using the cached segment limit registers in 386's "unreal mode" to move directly from main memory to XMS and vice versa - xms_fmemcpy.

Back to the compressed swap proposal.

What kind of final compression percentages are you seeing with some of your faster algorithms?

We implemented compressed executables some time ago which achieved 30-38% compression, quite good IMO. However, there were complaints that decompressing the executable was too slow on early 8086 machines, and no one uses it. (The decompressor was fast and written in ASM). It seems everyone wanted speed, especially on top of the slow floppy access. I fear swapping could easily turn into one of those things that everyone talks about wanting until they try it, then reject it because of its speed.

if compression was added then it could use 32-bit operations during that compression

It would seem to me that if the the machine has XMS memory there'd likely be no need for compression, as XMS memory even on early machines is 3-5x the amount of base RAM, usually starting at 2MB.

I'm still wondering about the answer to a previous question: what real world usability things can be done on an ELKS PC with swap that can't effectively be done now? To me, this helps answer what kind of technology might want to be used for such a thing.

jbruchon commented 11 months ago

I don't know. I'm just throwing ideas out. When it comes to ancient computers I'm most used to doing assembly on the 8-bit Commodore 64 which is drastically slower than any 8086-based machine. In-memory compressed swap will always be faster than floppy-based swap, even if it's a fairly dumb algorithm for speed reasons. As you've said, access to XMS or other "huge" amounts of extra memory makes any notion of compressed swap pointless too.

Mellvik commented 11 months ago

I'm still wondering about the answer to a previous question: what real world usability things can be done on an ELKS PC with swap that can't effectively be done now? To me, this helps answer what kind of technology might want to be used for such a thing.

Agreaad, @ghaerr - that is the ultimate question. I also agree that a situation in which swapping is an issue, we'll likely have significantly more 'extra' memory than main memory, including the swap to disk alternative. So compression may not be important. That said, compression is always technically fascinating - so your writeup triggered significant curiosity @jbruchon .

Back to the why question, which hit me since I flagged the intent to add swapping in the first place. And there is one single primary driver: Increased process space. I'd like to be able to have more daemons around - cheaply. Most of them are idle just about all the time, so the issue is much less performance than availability. An update daemon although the bootopts sync= directive helps a lot. I'd like to have at, cron, http or netcat for scripting, like to do automatic software updates across the net etc. Of course along with what we're running now, like 4 Gettys (3 serial) and the networking stuff. Having the ability to swap out processes that haven't run for 2 seconds provides for more RAM for the processes actually running - when needed.

At least that's a start. Then admittedly - it's the challenge...

ghaerr commented 11 months ago

Having the ability to swap out processes that haven't run for 2 seconds

I like the idea of a two second timer or so. I can see the case for swap with lots of network daemons also.

Which brings up another idea - did UNIX V6 just swap the data segment and re-read the code segment from filesystem disk? It is possible the kernel could temporarily discard an "unused" code segment from main memory without swapping, then just re-read it when required. This would need to make sure there are no CS: pointers from the data segment, but that can usually be assured if no __far pointers are used.

ghaerr commented 11 months ago

This would need to make sure there are no CS: pointers

Thinking about the technical aspects of this a bit more - since 8086 code isn't naturally relocatable even when a single code segment is used, and that even UNIX v6 systems had an MMU and virtual address space - many ELKS programs that use __far CS pointers would have to be swapped back in to their original locations, something that would be very hard to do.

For instance, any program that calls signal to set a signal handler gives the kernel a far CS pointer, which would have to be adjusted before being swapped in. In addition, the saved kernel stack, which may contain an interrupted signal handler in progress, would have the saved CS value, which would become incorrect. So there are some very serious technical hurdles that need to be enumerated to ensure we can ever get swapping to work without requiring swapping back to the original location.

If the original location is occupied by a now-running program, it also would have to be swapped out, thus possibly introducing a thrashing effect between just two processes that happened to start using the same CS range. Complicated for sure!

jbruchon commented 11 months ago

The code swapping does seem like a pain, but I think that compressed swapping data areas makes a lot of sense, especially if memory is zeroed before being handed to the program. Since ELKS can't do lazy allocation due to no memory protection, compressed swapping data segments and allocated memory for idle programs would grant ELKS the ability to "claw back" that unused memory once the program goes idle. Zeroing it out before allocation means any area that hasn't been written to yet will easily get crushed down via RLE compression rather than more complex methods like LZ, which also means extremely fast decompression; it's literally this simple loop (for RLE encoded as { value, number_of_repeats }):

void rle_decompress(char *compressed_data_start, char *decompressed_data_start) {
    char *p = compressed_data_start;
    char *d = decompressed_data_start;
    char c = *p;
    p++;
    for (unsigned char count = *p; count > 0; count--) {
        *d = c;
        d++;
    }
}

For pre-zeroed memory with this example, every 255 bytes of zeroes crushes to 2 bytes plus whatever header bytes are needed to tell the decompressor it's an RLE segment.

Sorry if I'm getting annoying! I keep getting carried away.

ghaerr commented 11 months ago

Sorry if I'm getting annoying! I keep getting carried away.

Haha, heck no its fun to talk about all these kinds of things, and the more technical the better! :)

rle_decompress(char compressed_data_start, char decompressed_data_start)

Ah, the old problem of whether or not the compress/decompress can operate "in place" or not: it would seem to me that both would have to operate in place, which narrows the algorithms down significantly. This is because there might not be extra memory available (which could be as much as 90-100% of the original amount, or more with some worst case data) in order to operate. The data segment also suffers like the code segment in that it isn't relocatable if any far pointers are used or received from the operating system, so that means it would be wise to just compress to the same location (and hope there's enough memory to decompress when needed). The trick with compress/decompress in place is a guarantee that more space won't be required than was originally used - very rare when both compress and decompress required.

ELKS' existing compressed executables will decompress in place, which is required in order for the loader to work, since the loader needs to know how much space is required and one doesn't want even more space required in order to load a program. This allows the a.out header to be used by the loader for the initial memory allocation. However, the compressor itself is host-based, and quite a bit larger, and uses lots of memory, so it won't work.

especially if memory is zeroed before being handed to the program

The .bss section is always zeroed before execution. I don't have information on what typical .data sections look like, I would assume they're mostly non-zero since otherwise the linker would stick them in the .bss section, unless initialized to zero in C.

Frankly, the relocation issues, memory fragmentation and extra algorithms required sound like a lot of work, and at the end, one can't be guaranteed to be able to actually swap a program in or out when needed. I'm not sure what the user should expect when they don't/can't get an executable to run when expected. I still know know what kind of compression % can be expected, but even in the 30-40% range doesn't really release much space when memory is full and possibly just further fragments the small 640k main memory arena.

Are we talking about getting swapping running on an 8086? Or 386? If the latter, just using XMS which is almost always present IMO would be a far better solution, with the caveat that swapping would be unavailable on ancient 8086s, which frankly barely have the power to do much simultaneously anyways. (Witness the recent discussion of the sl train taking 2+ minutes to run on an 8088, which run on 386 machines in 10 seconds).

ghaerr commented 11 months ago

https://monotech.fwscart.com/product/microram---640k---umb-ram---8-bit-isa

Yes - I would say that this $43 board, as well as @Mellvik's "Lo Tech 1MB" memory board even set up as a single 64K UMB block, adds quite a bit of capability for daemons or other programs that is fully supported right now.

Mellvik commented 11 months ago

Again, this turned into a very interesting and useful discussion.

It seems there are two different approaches to swapping - please correct me if I'm wrong: in memory compressed and traditional swap to secondary storage. The former being technically challenging but may accommodate the case of non-relocatable processes. The second being technically easy as in moving data back and forth and managing memory segments, while possibly out of reach because some or all processes have bindings to physical locations. A party killer.

Admittedly I haven't completely understood the first (compression) variant - how to manage/release/reclaim the segments freed by compression. So - for now - I'm most interested in the traditional approach, which - ignoring the relocation issues - is fairly straightforward on all PC hardware using either mass storage or EMS/XMS memory options. Which leaves us with the process relocation issues. Again I'm just beginning to wrap my head around the anatomy of an ELKS process with its peculiarities - some of which @ghaerr has mentioned (thank you, that sort of landed the ambitions quite a bit).

Assuming that traditional swapping is impractical or impossible with today's executable format, the question becomes whether it can be changed to accommodate swapping - and if yes, at what cost? Also, if the price is deemed too high for a general change, would several executable formats be feasible - as in 'linking for swappable' vs. 'linking for resident'? After all, many processes, daemons in particular, are very simple programs with predictable resource requirements.

toncho11 commented 11 months ago

I was thinking what about connecting an Arduino class device on the parallel port with a lot of internal memory. I quickly checked that there is a STM32 with 192KB of SRAM (for code and data, not flash memory). So in theory one could connect to the SMT32 and use these 192KB. OK, we are not restricted to these 194KB. In theory we can put much more (maybe a bit slower).

It could be used directly by the kernel as "slow memory" or swap file memory. Actually you have "slow memory" on some architectures such as Amiga.

Question is how slow will the communication be. Speed is said to be 150 kB/s for the old parallel ports. So a 20kb process will be written in 133ms ...

At least it can be a fun project. It will require support in ELKS, Arduino type device and parallel port ISA card on older XT machines.

ghaerr commented 11 months ago

It seems there are two different approaches to swapping - please correct me if I'm wrong

Correct, and good summary. And the second (traditional) approach gets even easier if only XMS memory moves are considered, because that would likely eliminate the problem of waiting for I/O to complete and other associated issues when using a swap disk.

I took a look at Bach's book as well (thanks @Mellvik), and learned there's an additional big issue - that of a much more sophisticated scheduler required in order actually have a system work well with swapping - round robin alone doesn't cut it, a good algorithm needs to be implemented to prevent both thrashing and deadlocks, something I hadn't considered. ELKS doesn't have that and uses strict round-robin with no priorities.

would several executable formats be feasible - as in 'linking for swappable' vs. 'linking for resident'?

That line of thought are good ideas - the C library could be looked over to find all far code/pointer references, the header could be updated to flag such, as well as medium model restrictions, disabling or flagging signal code, etc. It's all pretty complicated.

Taking this to the next step, any real swapping implementation probably needs to take a back seat to dynamic task allocation, as we're only able to run 14 user-controlled tasks currently, and while adding dynamic tasks as a next step is close to reality, we still have the bigger problem of 1K+ kernel data segment requirement for every running, sleeping or swapped task. This will quickly become a problem if one wants to have a large number of daemons sleeping, waiting for requests that wake them. Unlike UNIX v6+ and the "u" structure which was able to be swapped out along with the data segment, ELKS's kernel stacks have to remain inside the kernel data segment. My proposed solution to that is designing a variable kernel stack size to try to keep kernel data size down (likely specified in the a.out header).

Given all this, swapping is very low on the list of things I'm likely to implement, but contributions are definitely welcome! :)

there is one single primary driver: Increased process space. I'd like to be able to have more daemons around - cheaply. Most of them are idle just about all the time, so the issue is much less performance than availability

Perhaps an entirely new design is needed to accommodate this - like the ability to register applications to run when some event occurs, kind of like good old inetd but also work for clock timeouts, etc. The inetd portions of this could be handled with a ktcp extension that fired off the specified daemon (and how long to have it stay running, etc) from a specification file, and the rest could possibly be handled by extensions to /bin/init using /etc/inittab or another file for cron, at, sync, etc.

Mellvik commented 11 months ago

Taking this to the next step, any real swapping implementation probably needs to take a back seat to dynamic task allocation, as we're only able to run 14 user-controlled tasks currently, and while adding dynamic tasks as a next step is close to reality, we still have the bigger problem of 1K+ kernel data segment requirement for every running, sleeping or swapped task. This will quickly become a problem if one wants to have a large number of daemons sleeping, waiting for requests that wake them.

That obviously depends on the definition of 'large number of processes'. In my mind going from 16 to 25 is a large step, 30 is huge.

there is one single primary driver: Increased process space. I'd like to be able to have more daemons around - cheaply. Most of them are idle just about all the time, so the issue is much less performance than availability

Perhaps an entirely new design is needed to accommodate this - like the ability to register applications to run when some event occurs, kind of like good old inetd but also work for clock timeouts, etc. The inetd portions of this could be handled with a ktcp extension that fired off the specified daemon (and how long to have it stay running, etc) from a specification file, and the rest could possibly be handled by extensions to /bin/init using /etc/inittab or another file for cron, at, sync, etc.

Thanks for bringing inetd back into the discussion. I seem to remember it was declared persona non grata some years back. It would take care of some of the requirements mentioned above. Also, putting that kind of functionality into ktcp is a smart way to avoid another process slot locked up.

That said, and going back to the swapping issue, there is another source of inspiration I haven't mentioned. Venix 2.1 - which is running on one of my ancient machines, is a complete Unix V7 implementation for the PC/XT - impressively efficient given the resources at hand, and complete in the sense that it has everything you'd expect from a Unix system, including swapping, multiuser, 25 processes, virtual consoles, serial logins, cc/ld/nm/make etc, vi, csh, sh etc etc.

In this context, the swapping capability is in focus, and it's on my list to look into the anatomy of a Venix process, including the memory models supported. Unfortunately, documentation for the x86 version of Venix 2.1 seems to be gone from the face of the earth, but the LSI11/Rainbow documentation is available and it's fair to assume they have a lot in common. BTW - and I found this one quite interesting: While Venix is using partitions the regular Unix way - for root, tmp and usr, there is no regular partition for swapping. Instead, the root file system is created significantly smaller than the root partition, the rest is a logical swap partition. Pretty smart - giving the user a chance to run something else on the machine too (this was before extended partitions).

ghaerr commented 11 months ago

the definition of 'large number of processes'. In my mind going from 16 to 25 is a large step

Looking at the latest boot, the kernel reports a 22k heap before L1 buffers, which is usually 8, so 14k remaining. Currently, 16 tasks are already allocated in the data segment at around ~16k. Adding another 9 processes would bring the free heap down to 5k, IMO extremely tight, given that the TTY buffers and system buffer headers also need allocation. So something also needs to be done about somehow swapping kernel stacks out of the kernel data segment as well in order for all this to work well. Variable size stacks would help, but probably not enough.

bringing inetd back into the discussion. I seem to remember it was declared persona non grata some years back.

I can remember some discussion on that, I think the reason was that the daemons would require rewriting to support that design. If ktcp were enhanced to start a daemon on demand, they would likely also have to be slightly rewritten, since the accept would have already occurred in ktcp.

Venix 2.1 ..., is a complete Unix V7 implementation for the PC/XT including swapping, multiuser, 25 processes, virtual consoles, serial logins, cc/ld/nm/make etc, vi, csh, sh etc

Wow, what a system! Having a full development system on an 8086 platform today would be so very nice. I wonder whether they used the same compiler to compile Venix or not. Amazing.

the root file system is created significantly smaller than the root partition, the rest is a logical swap partition.

Neat idea.

Mellvik commented 11 months ago

Wow, what a system! Having a full development system on an 8086 platform today would be so very nice. I wonder whether they used the same compiler to compile Venix or not. Amazing.

Yes, it's actually self-compiling. I have something called the 'system reconfiguration kit' which includes BSD style config files and some sources to enable developers to add drivers and such, like sys/include and some instructions on how to link new kernels. Possibly even some a.out documentation - I'd have to look for that. Some details about low level disk access doesn't match so it won't boot on qemu (yet). Fun stuff indeed.

Mellvik commented 11 months ago

bringing inetd back into the discussion. I seem to remember it was declared persona non grata some years back.

I can remember some discussion on that, I think the reason was that the daemons would require rewriting to support that design. If ktcp were enhanced to start a daemon on demand, they would likely also have to be slightly rewritten, since the accept would have already occurred in ktcp.

Quite possibly worth the effort - ifdef'ing one or the other configuration.

ghaerr commented 11 months ago

I've been looking hard again at how (and why) swapping could/should be implemented. I initially thought along the lines of "Why not, people seem to want it? How hard would it really be?"

Starting from the aspect of really why does ELKS need swapping, it sounds like people are saying that we either don't have enough tasks (current limit 16), and/or that we don't have enough memory (current limit typically 640K + ~64K UMB) in order to run everything "that's wanted". I'll get into each of these issues below, but frankly, with the exception of one user, (@Mellvik, who actually used ELKS and constantly pushed its limits), I haven't heard of anyone that actually needs swapping. So my take is that swapping would be "nice to have", but few users really need it, or would ever use it.

The issues involved with adding swapping, in more detail below, essentially would take away from a cleanly designed system, and require design hacks that IMO don't buy much and detract from system design, unless the feature is truly needed.

Initial swapping support requires two main enhancements to the system: 1) The ability to swap code and data segments to/from main memory to XMS or disk. 2) The ability to swap kernel stacks for each process to/from the kernel data segment to EXT or XMS memory or disk.

Swapping program segments in and out differ in complexity depending on whether we're talking code or data. With the required real-mode segment architecture ELKS is built on, we have the following big problems:

Any far code segment value used within an application requires that the swapped out program be swapped back in to the exact same memory location(s). Looking at the C library, any code using the (recently revised and improved) auto constructor and destructor mechanism uses far calls to the constructors. stdio uses auto constructor/destructors. In addition, the C library signal callback code uses a far call, which would have to be rewritten.
Medium model applications like the D-Flat notepad could never be swapped to and from its original location (and it is relatively huge).
Any program using a far data pointer has the same issue. There is currently no way of knowing whether a program uses far pointers or not. Thus, even without the code segment issue(s) above, one can't reliably swap in/out except to the same address.
The kernel code that handles returning from a system call (or hardware timer interrupt) that caused it to sleep and get swapped out would need rewriting, as well as the signal code, should swapping not occur to/from the original location.
A new kernel swapping task needs to be written, along with a new scheduler that can handle the more complex problems detailed previously surrounding which tasks get swapped in and out and when.

The issue of swapping kernel stacks arises because there must be kernel data segment space available for any process that is swapped in. This means that effectively, the kernel needs to separate the kernel stack from the task structure, and likely copy stacks back and forth. Should for some reason there be enough main memory to run all tasks, this would require all kernel stacks to be present, which is a problem since each stack is about 1K, and we're already running very low of on kernel data free space. This kind of eliminates the advantage of having swappable stacks in the first place. Dynamicaly allocating task structures doesn't solve the problem when all tasks are ready to run. Kernel stack sizes could be decreased, but stack requirements increase automatically should the application use the MSDOS filesystem.

I was ready to implement a dynamic stack allocation system in ELKS, which could quickly save 1K kernel data for the idle task, but otherwise ran into issues with how the kernel would know how much stack to allocate, which would require adding yet another header structure in the a.out file, with dubious results if a filesystem read were requested on MSDOS FS. In addition, dynamically allocating a large (1K) structure risks running out of space when the data segment is tight, which is exactly the wrong time to have that happen. The jury is still out as to whether dynamic task allocation is a win or not, even without swapping. See below:

In my mind going from 16 to 25 is a large step, 30 is huge.

With all the kernel memory that's been saved with various drivers and recent enhancements, we could probably increase max tasks to 20, which requires one line of source change. Lots of benefit, even though all the daemons would remain in memory.

My conclusion is that implementing swapping is loads of work, depreciates the overall design, and will not be used much, if at all. IMO a far better solution, if even one is needed at all, would be to work on decreasing the sizes of internal objects (to allow for more kernel data memory) and decrease the size of applications programs, to allow for more usability in a small space. That's really what older systems are about - being small. vi doesn't need to use 64K RAM, we can find a substitute. We don't need large network daemons, we can work on making them smaller. The C library and its routines can be made smaller. All of this will definitely help increase the speed and usability of the system.

Mellvik commented 11 months ago

Hi @ghaerr, that's a thorough and very useful technical and practical run-through, thanks.

Admittedly, I had decided to put the swapping issue on the back burner for a while, but the discussion is just becoming too interesting - and useful. The fact that we have different priorities - TLVC vs. ELKS - doesn't change that, more likely the opposite: Expanding the 'horizon' - we don't have to land on the same page.

I'm approaching this from a different angle than 'need'. Agreed, our systems don't 'need' swapping, they work well as is and there are lots of buttons that can be tweaked to improve many things even more. To me, 'need' not a useful metric however. There are lots of functions in our systems that were added for other reasons than need. Functions that after the fact ended up being enablers for taking the systems in new directions. Which is my angle towards swapping:

is it possible?
is it viable (benefits)?
cost (development, resources when running, what do we have to break to get there?)

Of course it's possible. The jury is still out on the other two which is what makes these discussions so valuable. Admittedly, to me there is also a fun factor involved: The challenge of creating a system that does unbelievable things with small resources. Like (fantasy) 50 processes swapping to XMS or SSD? No problem. The bigger the challenge, the more tempting.

The other interesting thing about adding major functionality that isn't strictly needed is that it may open new doors: The process will typically discover that parts of the current system - old designs or shortcuts, whatever - stand in the way and need to be revisited, quite possibly revised, rewritten. Not because they were wrong but because they don't fit in a new setting. The recent revision of the buffer system is a great case in point. A couple of bug fixes started a process that turned many stones and revealed plenty opportunities which were explored and taken advantage off. With great results.

The issues involved with adding swapping, in more detail below, essentially would take away from a cleanly designed system, and require design hacks that IMO don't buy much and detract from system design, unless the feature is truly needed.

Initial swapping support requires two main enhancements to the system:

The ability to swap code and data segments to/from main memory to XMS or disk.

The ability to swap kernel stacks for each process to/from the kernel data segment to EXT or XMS memory or disk.

To me these two are the same. It seems obvious from your discussion (and from reviewing other OSes) that swapping a process must entail swapping the entire process including the 'metadata structures' and stack. Also - again from my point of view - a process must be able to run anywhere in main memory after a swapin. (I seem to remember having read an article (long time ago) about swapping and how the segmented architecture (for once) may be advantageous. I haven't been able to find back to the source yet.)

So what if the anatomy of a process needs to change? Is the design so clean that it cannot be made cleaner while providing more flexibility? What if the mechanisms to allocate and handle stacks and signals get a revision? What if the use of far pointers needs to be restricted or more disciplined? What if the scheduler is redone to handle an entirely new set of challenges? Etc. At the end of the day we'll either end up with at better, more flexible, more expandable/capable system. Or decide that the price is too high. In both cases there will be lots of learning and most likely, lots of improvements along the way.

I'm nowhere near understanding the detailed anatomy of a TLVC/ELKS process yet (these discussions help), but do intend to get there at some point - because it's interesting and because the process of understanding/applying/testing/mastering is rewarding in itself. And it's a real privilege to have this 'software lab' with all kinds of sophisticated tools that TLVC and ELKS and the surrounding toolkits represent.

Also, it doesn't have to be a 'big bang' type of change. Having determined the technical goals/requirements, the process can start with the ability to swap small model programs, a revised scheduler and a swapper. And go from there. If we end up with restrictions on program sizes that don't exist today, it may be worth it - forcing tool chain (library) optimizations like you allude to.

@ghaerr - you mention a number of attractive areas for optimization. I believe these are important enhancements regardless - it's not an either/or situation. Adding an inetd like we've briefly touched, is attractive. Upping the process limit to 20 is good if it works. I tried that (changing the constant that sets the limit) about 6 months ago, and ended up with a non-bootable system. Dynamic task allocation may be very beneficial, in particular if it can be made to play in a swapping environment.

All this said, I'm not anywhere close to start such a project with TLVC. There are too many real needs and real bugs to be covered. Another year - if decided to be worth the effort. For sure, there's a lot of energy in challenges like this.

ghaerr commented 11 months ago

@Mellvik, thank you for your comments, and its quite true that priorities, time and energy play a large part in a project like swapping. I appreciate and admire your energy and enthusiasm, and its certainly showing in your TLVC project!

a process must be able to run anywhere in main memory after a swapin. (I seem to remember having read an article (long time ago) about swapping and how the segmented architecture (for once) may be advantageous.

The segmented architecture is fantastic for relocation, as long as there are no segment register references inside the module. That is, strictly small code or data model programs are extremely easily relocated - just change CS and DS (and SS) and its done. The problem comes when a saved far pointer to code or data is used (which are used by the C library, medium model programs, and the ELKS kernel): the embedded segment references are absolute, and thus not relocatable with a CS/DS change.

So - an ELKS/TLVC that only ran small model programs linked with a separate, simplified C library and rewritten signal handling (along with the heavy lifting for task stacks, kernel scheduler and swapper) would likely work to allow those processes to be swapped in to a new address. Is having a special C library and dual kernel signal handling worth it? I suppose its a matter of priorities. Given the amount of time it would take all that to get implemented, a question to ask is: is the project primarily a research project, or do users expect the system to work for their needs and requests in the meantime? (As you can probably tell, I'm not really appreciative of long outstanding development branches, especially when it comes time to merge back their changes into mainline - but that's just me).

Also, it doesn't have to be a 'big bang' type of change. Having determined the technical goals/requirements, the process can start with the ability to swap small model programs, a revised scheduler and a swapper. And go from there.

Fully agree there. I haven't been able to see a way to incrementally implement what is needed, though. My last serious thought on the task stack structure changes ended up with a conclusion that it would necessarily disrupt the core task switching implementation and possibly introduce longstanding vulnerabilities. Not as big a deal for a research project, but users will complain if the system crashes and want it fixed ASAP.

Upping the process limit to 20 is good if it works. I tried that (changing the constant that sets the limit) about 6 months ago, and ended up with a non-bootable system.

We should probably try that again, if you think 20 tasks would help instead of swapping. There's been quite a bit of kernel data segment recovery and there might just be enough space now, especially if L1 buffers are decreased to 6. I've been thinking of how to greatly decrease the idle task's stack, but since the task structure is a constant array, haven't figured a good way yet. Another very simple option would be to dynamically allocate all the task structures at once, but with a /bootopts option for how many; that would allow for experimentation between number tasks and L1 buffers, the two major users of kernel data space. It is possible an individual system could be dynamically configured to do what is desired without having swapping at all.

There are lots of functions in our systems that were added for other reasons than need.

There are too many real needs and real bugs to be covered. Another year - if decided to be worth the effort.

In a nutshell - Agreed.

I'm happy to help anyone who'd like to jump into this with more technical discussions. Otherwise, I'll likely close this as not planned, until the time comes when it is worth the effort.

Mellvik commented 11 months ago

Again, great discussion @ghaerr!

So - an ELKS/TLVC that only ran small model programs linked with a separate, simplified C library and rewritten signal handling (along with the heavy lifting for task stacks, kernel scheduler and swapper) would likely work to allow those processes to be swapped in to a new address. Is having a special C library and dual kernel signal handling worth it?

It depends. If 'special' means improved and is a stepping stone for improvements that eventually will benefit the entire system, then my take is yes, it's probably worth it.

Given the amount of time it would take all that to get implemented, a question to ask is: is the project primarily a research project, or do users expect the system to work for their needs and requests in the meantime? (As you can probably tell, I'm not really appreciative of long outstanding development branches, especially when it comes time to merge back their changes into mainline - but that's just me).

Well, that makes it two of us. A working and stable system is always the key priority.

Also, it doesn't have to be a 'big bang' type of change. Having determined the technical goals/requirements, the process can start with the ability to swap small model programs, a revised scheduler and a swapper. And go from there.

Fully agree there. I haven't been able to see a way to incrementally implement what is needed, though. My last serious thought on the task stack structure changes ended up with a conclusion that it would necessarily disrupt the core task switching implementation and possibly introduce longstanding vulnerabilities. Not as big a deal for a research project, but users will complain if the system crashes and want it fixed ASAP.

Some serious research into what has to change, and subsequent planning are obvious requirements. I certainly don't have all (or even many) of the answers yet, and I may end up with the same conclusion, that it's not worth it. Regardless of conclusion, there will have been an enormous amount of learning behind it.

Upping the process limit to 20 is good if it works. I tried that (changing the constant that sets the limit) about 6 months ago, and ended up with a non-bootable system.

We should probably try that again, if you think 20 tasks would help instead of swapping.

To me this is not an either/or. Testing the systems with the highest possible # of processes is useful in order to know/verify the limits and learn the practical consequences if any.

There's been quite a bit of kernel data segment recovery and there might just be enough space now, especially if L1 buffers are decreased to 6. I've been thinking of how to greatly decrease the idle task's stack, but since the task structure is a constant array, haven't figured a good way yet. Another very simple option would be to dynamically allocate all the task structures at once, but with a /bootopts option for how many; that would allow for experimentation between number tasks and L1 buffers, the two major users of kernel data space. It is possible an individual system could be dynamically configured to do what is desired without having swapping at all.

Well, that's an interesting thought. I'm not sure about it killing the swapping idea (possibly self inflicted blindness), but that kind of dynamism must be valuable regardless. Like, if not using FAT at all, a much smaller stack is great - if more can be allocated with a FAT filesystem happens to be mounted (which sounds like your previously killed dynamic stack allocation btw). I have a number of pending experiments with various L1/L2 buffer sizes on physical systems. Getting by with 6xL1 while keeping performance decent is indeed intriguing.

Anyway, except for the last (buffer) issue, I'm resting this one for now. If/when a swapping project is initiated, I will surely be interested in discussions and viewpoints.

Thanks.