Closed mrcnski closed 1 year ago
Very interesting. Having consistent figures would be perfect for deterministically determining preparation memory.
I am not sure if we can safely disregard the allocator overhead, though -- it would mean that, potentially, preparation can pass on one validator and error out (e.g. with OOM) on a different validator, with a different allocator. The enforced memory limit for preparation would have to be low enough where we would have some certainty that no reasonable allocator could trigger OOM.
I do like this idea -- do you know of any prior art for this technique @koute?
it would mean that, potentially, preparation can pass on one validator and error out (e.g. with OOM) on a different validator
In theory an OOM on only one set of validators can already happen due to factors outside of our control. We don't control how much RAM the validator machines have; we don't control what other processes they're running; we don't control the system libraries and how much memory they take, etc.
Still, this is a valid attack vector. If the attacker can directly allocate and deallocate memory and write to that memory then it's 100% possible and very easy to trigger an OOM which a global allocator wrapper would be unable to detect. But for metering non-malicious actors it should probably be fine, and technically we could sidestep the problem by also having a second absolute global limit - either a cgroups-based one (which would be ideal since it's OS enforced) or an RSS-based one (worse and potentially unreliable, but would work on more OSes). The deterministic** limit would work well for normal payloads, while this alternative one would be high enough to never trigger on normal payloads but would be a failsafe against potential bad actors.
There's also an option of writing our own allocator which could work the same across OSes, allow us to track memory usage more precisely and would be in pure Rust. If we have our own database then maybe we could also have our own allocator? :P (I could probably write one with jemalloc-like performance in less than a month, but, honestly, it's probably not a good use of our resources?)
** - although even this is probably not going to be entirely deterministic due to e.g. multithreading where different amounts of memory will get allocated depending on how the OS schedules everything etc., but it should still be roughly in the same ballpark.
I do like this idea -- do you know of any prior art for this technique @koute?
This is used relatively common for debugging purposes, but probably not for metering.
Thanks, great info! And agreed with everything.
There's also an option of writing our own allocator which could work the same across OSes, allow us to track memory usage more precisely and would be in pure Rust. If we have our own database then maybe we could also have our own allocator? :P (I could probably write one with jemalloc-like performance in less than a month, but, honestly, it's probably not a good use of our resources?)
Haha, not sure. Maybe a good candidate for a grant from W3F? 🙂
Potentially also interesting are low-fragmentation allocators like e.g. https://github.com/plasma-umass/Mesh. Would it have any benefits in avoiding OOM? Would we get more deterministic memory measurements?
There is for sure some performance overhead in keeping the fragmentation low. But I'd be interested in seeing how it performs in practice -- by using less memory the application could utilize cache more effectively, who knows? Worth exploring further?
There's also an option of writing our own allocator which could work the same across OSes
I believe it's not the time to put some significant effort into developing a deterministic cross-platform allocator yet. We are still so far away from cross-platform determinism in many other aspects that it would make little difference. From my perspective, what we really need is to develop a wrapper allocator which would do some bookkeeping for us and then forward allocation/deallocation requests to a real allocator. That would be enough for now; enforcing those limits somewhat differently on different platforms is bad but not worse than non-deterministic native stack behavior and non-deterministic timeouts on different platforms.
We should not worry about multithreading right now as well, as Wasmtime does all the PVF preparation in single-threaded mode, and I don't see any clear pathway to overcoming that shortly.
Potentially also interesting are low-fragmentation allocators
As I see it, it adds one more headache (to ensure not only deterministic measurement but also deterministic defragmentation logic) with no sensible positive outcome. If our wrapper allocator enforces OOM, it doesn't matter if the allocated memory is fragmented or not, we'll enforce OOM after the allocation of X bytes anyway. If the kernel enforces OOM, it doesn't give us any leverage. If one host has 1 Gb of available memory, another host has 8 Gb, and our limit is 1.5 Gb, then trying to allocate 1.2 Gb will result in OS OOM condition on the first host and not on the second, regardless of the memory is fragmented or not.
Agreed about the wrapper. Seems like it would help us track allocation metrics and enforce limits with minimal effort.
As I see it, it adds one more headache (to ensure not only deterministic measurement but also deterministic defragmentation logic) with no sensible positive outcome.
Why is it a headache to try a new allocator and see how it performs? If we write the wrapper properly we should be able to swap allocators in and out pretty easily. If we try the low-fragmentation allocator and it turns out to be too slow (which I suspect it probably will), then we just don't use it. But if it turns out to have acceptable performance, why not use it? It would help us avoid OOM in practice, I think.
If our wrapper allocator enforces OOM, it doesn't matter if the allocated memory is fragmented or not, we'll enforce OOM after the allocation of X bytes anyway.
Yes, but we could maybe practically set the limit X lower. From what I've seen there can be huge amount of waste (gaps/fragments on the heap) with traditional allocators. Any limit X implicitly accounts for this wasted space.
Yes, but we could maybe practically set the limit X lower. From what I've seen there can be huge amount of waste (gaps/fragments on the heap) with traditional allocators.
Ah, I see what you mean. The profitability of defragmentation highly depends on what we're limiting, and we have yet to agree on that.
I agree it's not a problem to try another allocator, but we should understand what we're trying to achieve with it to be able to evaluate the outcome :)
The profitability of defragmentation highly depends on what we're limiting, and we have yet to agree on that.
Someone should probably look at the metrics we added and see what they say. 😂 So far I would agree that RSS is not very useful since the measurements I have seen were very high. Here is a post summarizing the advantages/disadvantages of each approach.
Edit: concerning the issues with RSS and swap, @eskimor there argues that most validators will not have swap enabled or significant enough memory pressure for it to be an issue.
Tracking and limiting allocations seems sensible and simple to me. Additionally taking into consideration fragmentation instead of just the raw allocation stat would probably actually be less deterministic. Was just a fun idea though.
If we're limiting data segment size, defragmentation can significantly help and even make the whole measurement much more deterministic.
What do you mean? Wikipedia: "The size of this segment is determined by the size of the values in the program's source code, and does not change at run time."
concerning the issues with RSS and swap, @eskimor there argues that most validators will not have swap enabled or significant enough memory pressure for it to be an issue.
I agree, with one amendment: "under current conditions". @bkchr always reminds me that it's not us who are in control of the network. We cannot predict what validators, in what hardware configurations, and even with which concrete implementations of Polkadot spec will participate on the network in the future. I wouldn't like the determinism of the measurements could be ruined by honest actors who decided to enable swap space on their hosts (btw, some of them, hosted on shared platforms, don't have a choice). From the point of view of an OS process, we don't have any memory beyond the virtual one, so I'd say it's exactly what has to be measured (or its volatile part, at least, to be more precise).
What do you mean? Wikipedia: "The size of this segment is determined by the size of the values in the program's source code, and does not change at run time."
Well, yes, there's a lot of ambiguity between abstract concepts and real life :) That's what is called "heap segment" in the article, but in a real Linux process, there's no separate heap segment, it's a continuation of the data segment, which grows towards the stack when you allocate more memory, and it represents the heap. Look at the brk()
manpage to get a rough idea of what's going on there. Measuring data segment size shouldn't be a challenging task, it's exposed through /proc/<pid>/maps
, so I suppose it should be possible to acquire that information from a program code easily (though I've never done that before) and it's a really good metric that can provide information on how much memory process needs (not uses) to run. Still, I believe that measuring real peak allocation can be accomplished with less effort. Also, I'm not sure how MacOS's MMU behaves, whether they have a separate heap segment or what.
If we write the wrapper properly we should be able to swap allocators in and out pretty easily.
Yeah, we have raw accesses to jmalloc-ctl sprinkled around the codebase. It'd probably make sense to create e.g. sc-host-allocator
crate (or something like that) in substrate which would abstract over the allocator. (Kind of like the old parity-util-mem
but without it actually setting the global allocator - that'd be done in the binary)
Look at the
brk()
manpage
FYI, AFAIK only glibc's crappy allocator uses sbrk; good allocators in general don't. (e.g. jemalloc by default only uses it to allocate extents if it can't do so through mmap, so in practice should be pretty much never)
A perfectly defragmented page may be swapped out and not included in the RSS at all.
This reminds me, when tracking RSS it'd probably make sense to track RSS and swap. In general if there's swapping going on then that does open an entirely new DoS vector. (e.g. imagine the PVF process triggering enough memory allocations to not OOM anything yet, but to trigger excessive swapping in the host validator process; we could rectify this by running both in separate cgroups which gives us tools to control this)
FYI, AFAIK only glibc's crappy allocator uses sbrk; good allocators in general don't.
That means I'm somewhat rusty on modern allocators. If it's all mmap
ed nowadays, measuring the data segment size doesn't make sense, and we're left with either RSS+swap or real allocation measurement (which is still enough for us. I believe).
Cgroups would be an excellent opportunity to take control over everything we need, but are they available anywhere beyond Linux?
Cgroups would be an excellent opportunity to take control over everything we need, but are they available anywhere beyond Linux?
They aren't, but do we care? We could just require people to run Linux to securely run a validator (and 99% of nodes are running Linux according to our telemetry) and for other OSes have minimal/no sandboxing and require e.g. a command line parameter --i-know-what-i-am-doing-running-a-validator-on-unsupported-os
or something like that to run for those who want to ignore this requirement.
Also, this is totally unrelated, but another benefit of running PVF stuff under cgroups is that we could give the cgroup e.g. only 2 cores and then safely enable parallel compilation in wasmtime
. (There are other ways to do it, but this one would be the most bulletproof.)
They aren't, but do we care? We could just require people to run Linux to securely run a validator
That's a valid point for now, but doesn't it look like we're trying just to postpone the problem? For example, we're talking a lot about cross-platform PVF execution determinism, like between x64 and arm64, when we could just require validators to run on x64 and forget about that, but for how long? I mean, a global platform revolution wiping out Linux servers on Intel in favor of, say, AIX servers on ARM is unlikely to happen tomorrow, and the question is how far ahead should we look?
@eskimor @rphmeier do you have some roadmap in mind?
That's a valid point for now, but doesn't it look like we're trying just to postpone the problem? For example, we're talking a lot about cross-platform PVF execution determinism, like between x64 and arm64, when we could just require validators to run on x64 and forget about that, but for how long?
We can not and we should not require anything. However, we can put out recommendations and tell people on what the current implementation is not supporting. If we only support cgroups
, then people not having cgroups
can not use the binary or should be aware of the security implications aka not ensuring that the validation is not eating terrabytes of memory. You should also not forget that people can just take the code and remove all the security related machinery we have implemented. We can not proof any way how much memory the validation has used etc. We are relying on the fact that 2/3 are honest and don't run any hacked binary. Theoretically there could be some kind of offchain market for selling higher validation times to parachains and then all validators just give a particular block more time to validate it :P
If we only support
cgroups
, then people not havingcgroups
can not use the binary or should be aware of the security implications
But telling people "we only support cgroups
" is the same as telling them "we only support Linux systems running on Intel hardware", isn't it?
Anyway, as we discussed before, the preparation memory limit is not something we can apply deterministically across platforms and not something that can be included in the spec for that reason. I remember you agreed it's a hacky thing, and we only need it to mitigate possible malicious behavior until we find some better solution.
So if nobody has any objections against sticking to a single platform for now, I will vote for cgroups then, that looks like the most straightforward solution.
Came here from another issue. Just my 2 cents. Apologies if it doesn't make sense.
If we only support cgroups, then people not having cgroups can not use the binary or should be aware of the security implications aka not ensuring that the validation is not eating terrabytes of memory.
But isn't it more than that? If your node can't limit the memory the binary can crash with OOM and the validator will need some time to restart. An attacker would then craft a parathread that crashes all non Linux validators. And that could affect the overall network, right?
But telling people "we only support cgroups" is the same as telling them "we only support Linux systems running on Intel hardware", isn't it?
I think it is. Probably no one would run a validator and risk the binary being crashed by a malicious parathread. But as you said: It is already a pretty bad idea to run on a non-Linux system for other reasons. I don't get the Intel hardware part, though. Cgroups should be supported on different hardware architectures as long as it is Linux.
So if nobody has any objections against sticking to a single platform for now, I will vote for cgroups then, that looks like the most straightforward solution.
I don't think it causes any harm except that adds another thing to the list of things that need to be reworked for cross platform support in the future. It seems unlikely but I remember some talks about leveraging Apple hardware for its insane memory bandwith. But it seems much more likely that Intel will just adapt the feature and we continue using it on Linux.
I don't get the Intel hardware part, though. Cgroups should be supported on different hardware architectures as long as it is Linux.
I believe it was in another discussion where I insisted that any memory limits cannot be applied deterministically across platforms. That is, we can enforce it with cgroups, wrapper allocator, or whatever, but the Wasmtime binary itself, compiled on different platforms, wouldn't use precisely the same amount of memory on each of them. Somewhere due to alignment rules, somewhere due to different code generation, but they shall differ. And if we set the memory limit to 1024 bytes, and Intel binary uses 1023 and ARM binary uses 1025, it is still not deterministic at all. The same goes for the stack overflow condition, for example.
In general, when I said "we" I meant "Parity" aka the Parity Polkadot node implementation. We need to face the reality and the reality currently is that most validators run on Linux. Let's take the instance reusing which was slow on MacOs, because we only had optimized it for Linux until you @athei came and optimized it for lovely MacOs ;) Here we can go again the same way, we will only start with cgroups
and print some big fat warning if cgroups
isn't supported. Later when demand increases we can also implement it for whatever OS using whatever the OS provides.
It seems unlikely but I remember some talks about leveraging Apple hardware for its insane memory bandwith.
That is only for research to find out if we may want this when this is commodity/not only apple exclusive.
Somewhere due to alignment rules, somewhere due to different code generation, but they shall differ. And if we set the memory limit to 1024 bytes, and Intel binary uses 1023 and ARM binary uses 1025, it is still not deterministic at all. The same goes for the stack overflow condition, for example.
We can continue discussing about all these small edge cases until the end of the universe :P We will need some framework to find out these problems and then come up with some smart ways to solve them. I already proposed some RFP to work on some conformance testing framework: https://github.com/w3f/Grants-Program/blob/master/docs/RFPs/Open/parachain_validation_conformance_testing.md
Or we write our own wasm compiler that is build around all our requirements from the beginning.
Or we write our own wasm compiler that is build around all our requirements from the beginning.
Vote up with both hands. I'm currently trying to come up with a proof of concept for my very simplistic approach to that problem, we'll see if it pays out, but regardless, having deterministic code generation is the only straightforward way to deterministic code execution.
Or we write our own wasm compiler that is build around all our requirements from the beginning.
Writing a linear time compiler for wasm that also produces fast code is hard. There were many failed attempts in trying to make this work for contracts. Given, the requirements for contracts are stricter and you might come up with something decent when only needing to be linear in memory usage but not time.
But it seems like a colossal task if you try to compete with the performance of wasmtime. I think cranelift (the wasmtime backend) is already pretty memory conscious. So vetting it for linear memory complexity and disabling non linear (in terms of memory) optimizations might be the way to go. They might be open to merge a "linear memory complexity mode". Performance will probably suffer but it will with a custom solution, too. Some optimization just don't have linear algorithms. But the upshot is that you don't need to worry about limiting memory.
For example, in contracts our current thinking is that it might be easier to design tailor made instruction set that can be trivially compiled and has other useful properties for our use case.
My 2 cents: Stick with wasmtime and memory metering / cgroups. It seems to be good enough. As long as you don't need linear time compilation that is (it seems that you don't).
Or we write our own wasm compiler that is build around all our requirements from the beginning.
I should have written it, but this was a JOKE :D For sure I don't want to encourage us to write our own wasm compiler, we have more important things to fix for now :P And yeah, this would be an endless task that would take some years to be finished.
Leaving this here in case it's useful for the allocator wrapper (just skimmed it so far): https://rustmagazine.org/issue-3/task-stats-alloc/ Except, we would need per-thread metering as opposed to per-task. Could help us meter memory usage of the PVF job.
Except, we would need per-thread metering as opposed to per-task.
That was my first thought, and that's what I was experimenting with for the last couple of days, and now I doubt if it's the way to go.
The problem is that Wasmtime is multi-threaded by itself. So in the end we get a bunch of statistic counters from different threads only identified by their id, and we don't know what should be summed up and what shouldn't.
I'm evaluating other solutions. As the preparation worker is a separate process, it seems that per-process measurement, if accompanied by adjustable allocator checkpoints and limits, should give a very good approximation, and when worker binaries are separated it will be close to perfect.
Any thoughts and ideas are highly appreciated.
I'm evaluating other solutions. As the preparation worker is a separate process, it seems that per-process measurement, if accompanied by adjustable allocator checkpoints and limits, should give a very good approximation, and when worker binaries are separated it will be close to perfect.
Sounds reasonable, even with both workers being part of the same process it should work? As each process is always only doing one task?
Sounds reasonable, even with both workers being part of the same process it should work?
Currently, they are different OS processes (spawned by different calls to tokio::process::Command::spawn()
), they're just spawned from the same binary. So they do not share the address space, and it should perfectly work for each of them separately.
Are we going to change that at some point?
Are we going to change that at some point?
Change what? I mean each validation/preparation always needs to be done in its own process. We need to be able to kill the workers if something is taking more time than allowed.
Change what? I mean each validation/preparation always needs to be done in its own process. We need to be able to kill the workers if something is taking more time than allowed.
Yeah, agreed, makes perfect sense, I just didn't understand that "even with both workers being part of the same process" part of your question :)
We could perhaps emulate this by having a global allocator wrapper and just tracking the sizes ourselves. Won't be as accurate since it wouldn't include the allocator's overhead/metadata, but on other hand that'd make it more consistent (the figures reported would be the same on different allocators) so maybe it's not a bad thing.
Originally posted by @koute in https://github.com/paritytech/polkadot/issues/6675#issuecomment-1420729273
Related issue
https://github.com/paritytech/polkadot-sdk/issues/719