otterkit / otterkit-cobol

A free and open source Standard COBOL compiler for 64-bit environments
https://otterkit.com
Apache License 2.0
249 stars 15 forks source link

[Core]: Rethinking how our COBOL backend should be structured (or, what went wrong with our backend). #40

Closed KTSnowy closed 9 months ago

KTSnowy commented 1 year ago

As you might have noticed by our last commit being two months ago, we've been in a bit of a hiatus. The reason being that our C# backend most likely won't work as I originally thought it would, so I reserved these past two months to study assembly more closely and to look for alternative solutions that could help us continue working on the backend. I have a couple of ideas, but they're not completely refined yet. I'll document it once I finish refining the full picture of the new backend architecture.

While we still want to support interoperability with it, our backend will most likely not generate C# anymore due to several mapping issues that I'll explain below.

TL;DR COBOL doesn't map particularly well to current Algol-like languages.


After staring at both COBOL and assembly code for the past two months, I've noticed that it seems to map extremely well to assembly though. In fact, COBOL appears to map directly to assembly much better than C ever did (the PDP-11 might be the only exception).

If you stare at both side by side long enough, you start noticing that the separation of the data and procedure divisions is not just for syntactic purposes, but it's actually a clear separation of the data and text segments in machine code. In virtual memory, one has read/write permissions, and the other has read/execute permissions, because of this they can never be placed together (overlap) in memory, and this separation is made explicit in COBOL's syntax.

This also carries over to COBOL's object oriented syntax. Though in this case, there's instead a clear syntactic separation between the stack, the heap, and the code segment. An object method cannot contain a working-storage section (the heap), and the object itself cannot contain a local-storage section (the stack), and the code itself is separate from both. Even though it supports OOP concepts, it still maps surprisingly well to assembly, certainly much better than other object oriented languages.

The statement based syntax also appears to be somewhat similar to assembly mnemonics in the way they are written (similar to: verb operand operand ...), it can be argued that COBOL statements make a sort of high-level assembly language. This becomes clear when you realize that user-defined paragraphs and sections in the procedure division are just assembly labels, and in fact they have the same behavior as a label with a conditional jump at the end to determine whether to fall through to the next label or return back to the perform statement that called it. This "fall through" behavior combined with the ability to call a label as if it was a parameterless function, is not easily mapped to other high-level languages, but it maps extremely well to any modern assembly language.

Realizing this was an absolutely enlightening moment. The whole reason why everything is so neatly separated in COBOL, from the divisions to the statements, in both procedural and object oriented code, appears to be so that it can still map well enough to some abstract assembly language. The same can't be said for Algol-like languages. They don't map that well directly to assembly anymore, on any modern architecture.

That's the problem right there, the reason why we've been having so much trouble with the backend trying to map COBOL to C# or C is because we're essentially trying to do the job of a disassembler. We're taking an assembly-like language and trying to turn it into an Algol-like language, we're doing the opposite of what we should be doing. COBOL is essentially a portable low-level assembly-like language, it's not Algol-like, it's not Pascal-like, it's not ML-like, there's nothing else like it. The English-like syntax just makes it much easier to read, remove all of the extra words in most of the statements and the final result will look much like an assembly mnemonic.

It also gives users additional features that are arguably lower level than even what C allows you to do without extra library support. Like the ability to override how a symbol is to be exported to the object file. To align a bit field to the first bit of the first available byte boundary (aligned clause). To align a struct in such a way that all items are synchronized to the left or to the right of a natural boundary, changing where the padding bytes will be (synchronized clause). To change the justification of a bit field, effectively switching its entire in memory bit ordering (justified clause). To change the in-memory endianness of... IEEE 754 binary floating-point variables??? There are more of these useful low level features, and that includes raw pointers as well!

These are not the features nor the syntax of a high-level language nowadays, it gives some extreme flexibility over how the bits and bytes are stored both in memory and in the object file. We probably shouldn't be lifting a low-level (arguably extremely) assembly-like language like COBOL to a higher level Algol-like language. It just doesn't map as cleanly as I thought it would when I started the project, I messed up at astronomical proportions. I'll be looking for alternative solutions that don't require depending on the behavior of other languages.

Some immediate problems with trying to map COBOL to some Algol-like languages:

GitMensch commented 1 year ago

Just to note: You get around the strict aliasing and padding issues by handling all of the relevant parts in a called function that fiddles the data around, possibly system and C runtime specific (and GC partially does this by checking endianess and aligning and aliasing either with given defines or by defines created at configure time by inspecting the behaviour). I can say that this "does not work out of the box", especially not portable (but when handled correctly (complex) does work quite well).

Similar applies to most of the other issues, apart from the externalized names which you have to hack around (casing normally means: if not a literal, then uppercase, OpenCOBOL just missed to implement it that way, so adjusting it is harder), and possibly needs multiple entry points as well as lookup functions to implement correctly (note: the standard does not say how the modules are found on disk so regressions there - even generated names to start with - are an implementor decision; it likely is required to be documented for a conforming implementation).

For the topic at hand: would generating IL be a better fit than generating C#?

KTSnowy commented 1 year ago

Hey @GitMensch.

apart from the externalized names which you have to hack around

I'm planning on using the external repository for this. Whoever designed it must have intended for it to be used as a "linking table", similar to a C header file. Everything needed for any kind of COBOL-related linking is already in there, so we should probably use it for that purpose.

The standard requires us to "provide a mechanism that allows the user to specify whether to update the external repository when a compilation unit is compiled", which means that it's a physical thing that can be updated and seen by the user, an auto-generated header file. So we should either embed it into the object file, or maybe generate it as a separate binary file that can be read by the compiler.

The external repository is absolutely needed for overloaded methods though, because as far as I've checked, the standard requires overloaded methods to be externalized with the same exact name. So we can't do any name mangling like C++ usually does for their methods, which means that any dynamic linking or binding of overloaded methods has to go through the external repository to find the correct method. It has to exist at runtime, in a place the program can find.

Interestingly though, I haven't seen any other COBOL compiler actually use the external repository as a physical thing like this, or at least I couldn't find any documentation on it. (I hate that it's all proprietary)

For the topic at hand: would generating IL be a better fit than generating C#?

It would be much better. As pointed out on the issue above, COBOL (as defined in the standard) seems to map amazingly well to assembly, that would include IL instructions as well. Mostly because COBOL seems to be lower level than C in some parts, it's certainly more flexible than C for low-level tasks, so trying to map it into anything other than some assembly language would require "lifting" certain features. We'd be doing the work of a disassembler.

The specification for the CIL Instruction Set is freely available and standardized. I'll look into it and see if it would work for us.

GitMensch commented 1 year ago

Concerning the external repository: I've always thought about that like an sqlite db containing all the internal prototypes (and it either is updated on compile or not)...

I also don't know of a single compiler actually implementing it.

So far, cobc only keeps that external repository in memory - as soon as the process vanishes, it is gone. It likely will be kept later, using a default name of repo.db in a system-central but all user readable place with the option to put/name it different (so multiple projects can use multiple ones even if the names are identical). A nice benefit: this can also support the build system "oh, you've changed a prototype for a program which is statically called from these 50 programs - consider to recompile them".

If the CIL hint did help, then that would be nice and that way you still would have a "modern and portable" compiler, and may even be able to use the matching debugging info (seems to be standardized there, too).

kant2002 commented 1 year ago

I would like to learn a bit more about practical problems, since I with another friend working on C -> IL compiler and so far, we think that most of C can be expressed in IL without a problem. I do not think too closely about very low level aspects of C and what can we do with it, so maybe I'm a bit optimistic, but I would like to think (and contribute probably) that COBOL -> IL is possible.

KTSnowy commented 1 year ago

Hey @kant2002, both COBOL and C to IL should be possible, but you'll soon run into the same issues that we did with dotnet's relocatable heap. Any pointers you have that point into the managed heap will become invalid whenever the GC runs and compacts itself.

The solution we came up with was to write our own native memory allocator so that we can bypass the managed heap and create pointers to our own objects that don't suddenly become invalid.

You'll likely want to do the same, or use one of the open source memory allocators like rpmalloc or mimalloc.

kant2002 commented 1 year ago

Yes. We are don't use managed code anywhere. When emit code we use only pointers and arithmetics, and calls on static classes. We also have C Runtime, which is written in regular C#. So garbage collected memory was never a problem for us, since we never allocate. Also we attempt to store constant data in CIL metadata.

Regarding malloc/free, yes this is part of C runtime, and allocated using unmanaged code and I think this is fine. I see no problem. Yes generated IL is very much C-like.

So I really questioning what kind of limitations cannot be expressed in current approach. I'm definitely not yet familiar with otter.

sgorozco commented 1 year ago

I would like to learn a bit more about practical problems, since I with another friend working on C -> IL compiler and so far, we think that most of C can be expressed in IL without a problem. I do not think too closely about very low level aspects of C and what can we do with it, so maybe I'm a bit optimistic, but I would like to think (and contribute probably) that COBOL -> IL is possible.

The C++/CLI compiler has the capability to compile pure C or C++ code (sans managed 'ref class' elements) entirely into IL, making your optimism quite grounded ;) The generated IL may be termed as "esoteric," but a disassembly reveals how it emits IL code to represent a language devoid of garbage collection like C. Delving into the disassembled code could aid in reengaging post-hiatus. @KTSnowy: If you can articulate the logic causing the impedance mismatch between COBOL and C# in pure C or C++, then C++/CLI becomes a valuable ally. This way, you or anyone on the team might avoid the necessity of emitting pure IL, and merging both is quite easy thanks to the Interop layer built-in (an intermediary ref-class that can be easily called from C# naturally bridging with the unmanaged C or C++ code (unmanaged BUT compiled to IL), with zero DLLImports required).

I have just been a passive observer of this quite interesting project. I have written my share of C++/CLI code so if I can be of any help, please reach out! =)

GitMensch commented 9 months ago

Closed as "not planned" means what now? Note that the pull request were postponed because of the likely change of the backend.