Open VorpalBlade opened 1 day ago
As a not-so-helpful but reasonably true answer to the title question: the AM ends wherever you define it to end. The Implementation defines some manner in which the Concrete Machine is used in order to emulate the AM. For each step of the AM, the Implementation maps that AM step into whatever CM operations need to happen, and then maps whatever CM state changes back into AM state changes as established by the manner in which the CM emulates the AM.
When communicating with some process on the CM which is not known to this AM, all state sharing is done at the level of the shared CM semantics. (For LTO, this would be LLVM-IR, and not the target physical machine!) The manner in which CM effects from another process are translated into AM effects is again defined by the Implementation manner of lowering the AM to the CM semantics.
Things of course get more interesting when concurrency is involved, since operations which are atomic on the AM may not be atomic on the CM (e.g. atomic access lowering to nonatomic access plus appropriate fencing). But this basic framework is the shape by which all effects that don't originate within the AM get translated into the proper AM effects; it's the Implementation which chooses and defines what means what.
Of course we do want to do better than just saying "it's implementation defined behavior" where we can do so, so I think this is still a valid question to ask.
At a minimum, Rust code using the same instance of std needs to be in the same AM world, and code using different instances of std exist in different AM worlds. (Rule of thumb: are std's global places shared or distinct?) Whether this would be consistently initialized correctly in the different proposed scenarios is a different question.
I think it might be... well, "simpler", than all those scenarios above. To me, the key element is the Ralf quote that was linked: A given AM needs to have a single address space. So, any time there's more than one address space, for any reason, that's more than one AM. This seems like a helpful point to focus on, because then we don't have to ask about 2 user processes, or a user process and the kernel, or two rust processes sharing memory, or ten other edge cases like that. If it's separate address spaces, it's separate abstract machines.
It might also be the case that two things sharing the same address space are still separate Rust AM instances somehow, but at least it narrows how many potential cases we need to investigate.
At a minimum, Rust code using the same instance of std needs to be in the same AM world, and code using different instances of std exist in different AM worlds. (Rule of thumb: are std's global places shared or distinct?) Whether this would be consistently initialized correctly in the different proposed scenarios is a different question.
That makes sense, but is not as helpful as it could be for no-std code (which is one of the things I'm really interested in).
A given AM needs to have a single address space. So, any time there's more than one address space, for any reason, that's more than one AM.
This view also makes sense, but there are some systems where this creates issues:
This shows that there is a wide grey scale between "share everything" and "share nothing" out there in the wild. We could draw the line at any point in between, and the two lines you proposed both seem reasonable. But what are the side effects / fallout of picking any one of those lines?
In particular if they become separate AMs once there is any memory that isn't shared between the threads of execution, what happens if you still share std? Does that become UB?
That would render some of those embedded systems such as the RP2040 problematic as you cannot change the memory map, you get what you get. It is a no-std platform though, so maybe thst is its saving grace.
It's not about sharing std
specifically, it's about the assumptions that Rust is built on at the language level. We kinda assume, for example, that shared references can be sent to another thread, and that your threads can be executed on any core. As soon as cores become limited in what they do, things get dicey. Not that it can't work, but you've gotta be very careful, and mindful of what you're doing. This isn't entirely unknown to the Rust ecosystem, because some OS APIs are restricted in what threads they can do on what thread, but it does make a lot of unsafe code be needed if a safe API can't be wrapped around the situation.
Historically there have been many systems with asymmetric multiprocessing, where not all cores have access to all memory. The most well known example is probably the Cell processor in the PlayStation 3, where the SPE vector cores had some of their own local fast RAM.
The SPE cores run a different instruction set from the main PPE core. As such the best way to model this is probably different processes (and thus different AMs) on different devices which happen share a part of their address space the same way a CPU and GPU share memory. Modeling it as threads would imply that you can safely migrate threads between SPEs as well as between an SPE and PPE, which is not the case due to some memory being private to an SPE and due to the different instruction sets. The different instruction sets also imply that code for the SPE and PPE has to be compiled separately with different copies of the standard library, while everything in a single AM can be compiled together and has to share a single copy of the standard library.
If you have two separate Rust programs that are both running under an OS, it seems pretty clear they are two separate instances of the Rust AM (Abstract Machine).
If you have two threads in the same program, there is one instance of the Rust AM my understanding is that there is a single instance of the AM.
However, what about two separate programs, but there is shared memory between then (e.g. using
mmap
for IPC)? It has to be two separate instances to not be UB according to this reply by Ralph Jung on IRLO. The shared memory will usually not be at the same address in both processes. Since we presumably want shared memory to NOT be UB they must be separate instances of the AM.So the "border" is somewhere between "threads sharing all of memory" and "memory mapped between processes". But where exactly?
Let's consider some hypothetical scenarios to see where they would land:
There are many other edge cases that you could come up with. I would ideally like a clear definition of what exactly constitutes an instance of the Rust Abstract Machine vs several, as this heavily impacts what sort of virtual memory tricks are valid in Rust, in particular around mmap-accelerated ring buffers, embedded systems, and kernels. Or is this still an open question, and if so what parts are decided?
I looked through the Rust reference but wasn't able to find any definition.