Open hgroover opened 2 years ago
Interesting specs. What values did you try for e.g. _OSTREE_MAX_OUTSTANDING_FETCHER_REQUESTS
? Relatedly, I think it could make sense to make that into a config knob. Also, to confirm, is the remote using static deltas? If so, try with --disable-static-deltas` too.
The three values I've been changing are _OSTREE_MAX_OUTSTANDING_FETCHER_REQUESTS 3 _OSTREE_MAX_OUTSTANDING_DELTAPART_REQUESTS 1 _OSTREE_MAX_OUTSTANDING_WRITE_REQUESTS 3
The write requests limit was changed to 3 in a release soon after 2020.3 (I've gone through everything up to and including 2022.2 - I think it was 2022.3 that went to a newer GLib version (and I'm running a very old uclibc build so updating GLib is going to be a long process). I've also used 4 for fetcher requests. What I've tried to do is limit threads because part of my problem seems to be that on the kernel version I have, the oom-reaper has a cadence of dumping about one candidate thread per second regardless of the VM setting for oom-killer dump, and if there are enough threads it SOMETIMES creates the conditions for an I/O thread to run out of memory before any process can get reaped - in short, a kind of oom-killer double-fault situation.
I'll try the --disable-static-deltas also. I have lots of logs including the /var/log/messages from the oom reaper. Thanks!
I actually ran into problems with the FETCHER_REQUESTS 3 setting. 4 or 5 seems to work better, 3 will sometimes cause errors and early termination. I think my problem with oom-killer deaths is simply that ostree crosses a 10mb RSS size. I just need to adjust the vm settings. There's actually quite a lot of memory available (as I've killed all nonessential processes). However, --disable-static-deltas makes a measurable difference also. Thanks for that. I will follow up once I've done more testing with modified vm settings.
Instrumenting the code has given me some insight into the problem. It's actually not a problem which is endemic to ostree but in this case ostree definitely exacerbates the problem. In all cases I'm running with --disable-static-deltas (because I have none). When I get up around 1100 to 1300 fetch requests completed (with a total of over 1500 in the queue) I start running into the oom-killer death situation. Interestingly when oom-killer targets ostree, I need to wait before retrying in my script. If I wait at least 5 seconds before trying to run ostree pull again, I am likely to have success, otherwise I'll get an oom-killer death straight away. I've experimentally patched in an operation-ceiling option that allows me to terminate after a certain number of completed fetches, and an anti-flood option that allows me to optionally inject usleep() calls between fetches. Ideally there's a way to add sleep when the number of fetches in-flight exceeds a certain amount, but I'm approaching these changes with some caution. Attached is --verbose output with some of my modified debug messages showing the number of completed and queued fetches. The exit from the operation-ceiling option currently triggers an assert which is not intentional but this is just to show the volume and rate. I expect this is fairly normal on large change sets on systems with more memory, and it's also quite possible this is exacerbating a kernel issue which was addressed after the 4.9.217 kernel I'm running. Interestingly running without --verbose I am still likely to run afoul of the oom-killer situation, which suggests that the rate of I/O requests may also be to blame, thus the introduction of the anti-flood option. fwupdate-20221016-155642-1-output-dd6ea0b8b3f80a493d2142aa5e6bb8ad9f101054e3eb08b85276d91a5811c51c.log
I've finished my investigation and concluded that there are a number of conditions required to manifest this particular problem:
The oom-killer death is not directly related to anything ostree is doing, but because fetch requests pile up quickly in the queue while larger fetches are taking place, we seem to get enough heap fragmentation that uncommitted pages get pulled in and the I/O activity causes available memory to finish very quickly. The attached patch applies to 2020.3 and works around the problem in such extreme cases by implementing three options: --large-file-size=<threshold-size-in-bytes> treats files with size >= threshold size specially; between completion of fetch and completion of write, the large file blocks all other fetch queuing and servicing --operation-ceiling=<count> limits the number of entries that are queued to a specified count. This triggers an assert when exiting from the main entry point which results in an exit code of 134 (there is certainly going to be a better way to do this but this approach worked for my purposes). This indicates that another pull attempt is needed (for use within a script wrapper). --anti-flood=<delay-ms> injects a usleep delay at critical junctures, otherwise we tend to still get oom-killer deaths if not running with --verbose
It's really a horrendous kluge but allows me to run to completion without playing oom-killer roulette. I doubt anyone else has run into this sort of extreme case and there's probably a more elegant way to handle blocking on large requests. I am limited to using 2020.3 due to the signing methods available but I'm open to suggestions on the best way to make this available. In any case here is the patch.
Yeah, we clearly need backpressure here. Your patch is AFAICS implementing this by blocking the main thread, which probably works well enough for unattended device updates but anything linking libostree and doing anything else on that thread may be surprised at blocking. (Specifically e.g. on a tty we'll stop rendering fetch progress)
Hmm. This issue is really an extension of the problems with enqueuing too many http requests that caused us to add queuing for the HTTP fetching in c18628ecb8b2ef591db0444fa57052cba60807a8
Do you have a handy reproduction scenario for this? Does
diff --git a/src/libostree/ostree-repo-pull.c b/src/libostree/ostree-repo-pull.c
index 86b4358a..05fecdb8 100644
--- a/src/libostree/ostree-repo-pull.c
+++ b/src/libostree/ostree-repo-pull.c
@@ -150,6 +150,7 @@ static void enqueue_one_static_delta_superblock_request_s (OtPullData *
FetchDeltaSuperData *fetch_data);
static void enqueue_one_static_delta_part_request_s (OtPullData *pull_data,
FetchStaticDeltaData *fetch_data);
+static void ensure_idle_queued (OtPullData *pull_data);
static gboolean scan_one_metadata_object (OtPullData *pull_data,
const char *checksum,
@@ -385,6 +386,8 @@ check_outstanding_requests_handle_error (OtPullData *pull_data,
g_free (checksum);
}
+ /* Finally, if we still have capacity, scan more metadata objects */
+ ensure_idle_queued (pull_data);
}
}
@@ -461,6 +464,9 @@ ensure_idle_queued (OtPullData *pull_data)
if (pull_data->idle_src)
return;
+ if (fetcher_queue_is_full (pull_data))
+ return;
+
idle_src = g_idle_source_new ();
g_source_set_callback (idle_src, idle_worker, pull_data, NULL);
g_source_attach (idle_src, pull_data->main_context);
help?
OK filed as https://github.com/ostreedev/ostree/pull/2766
Thanks for the feedback, I wanted to follow up to say that I'm going to have a look at the suggested changes. As far as my patch, it's really a hack and of course when it comes back to the main thread it is going to exit with an assert because I haven't set an error condition if I'm exiting early. I still get oom-killer deaths occasionally but it's very significantly reduced. As far as a reproduction scenario, this is on a MIPS device running a 4.9 kernel with a total of 128mb DRAM. A significant amount is reserved at boot time for CMA so what I actually have available at runtime after unloading video drivers, etc. is around 70mb. When I've observed the oom-killer death taking place, basically we had a large number of reads queued up, and at least one or two of them were single filesystem objects over 8mb in size (for example, the kernel image). We would very quickly queue up more reads with signature verification and write pending, then run out of memory in pending I/O operations which would lead to filesystem corruption (this is not every oom-killer death, just the ones that result in leaving /ostree in an unusable state). By the time the app gets killed, we already have a corrupted filesystem. Since this happens on the pull, the current deployment is usable but I have no way to recover /ostree and do a successful pull. I'm currently stuck on 2020.3 due to build issues on buildroot, but I'll try backporting the commit to see if that makes a difference. I can reproduce this with the right combination of updates.
There is an article online that suggests ostree memory usage can be quite high:
https://witekio.com/de/blog-de/ostree-tutorial-system-updates/
do we have any hard data in a blog post that describes what kind of memory is typically required in an el9 ostree-based system for upgrades? If no, should we write one?
We don't have any hard data. In https://github.com/ostreedev/ostree/pull/2766 I found what I believe is one source of memory use proportional to O(files).
I think we can repurpose this issue for gathering data and metrics, and ideally scaling our queues roughly to available resources (CPU, network speed, I/O bandwidth, memory). We definitely don't want any unbounded queues.
First unusual event: ostree pull (using libcurl, https) is reaped by the oom-killer on a very large delta pull. Kernel version 4.9.217 mips Total available physical memory 85mb (physical DRAM is 128mb, minus a large CMA chunk, minus kernel and wpa_supplicant).
For the most part, I am able to successfully run ostree pull. However, if the total quantity of deltas is high, ostree is likely to use up all available memory and get reaped by the oom-killer. This happens less frequently if the thread limits in ostree-repo-private.h are set to lower values, but is still guaranteed to happen on a large set of deltas. I am currently investigating running under ulimit as a workaround but have to ask if there might be a better way to avoid this?
In the usual case I get an errorlevel 137 in a script and can retry the ostree pull repeatedly until it succeeds. In the worst case I will get an oom-killer murder spree that begins terminating every process on the system, and while looking for victims an I/O thread fails and results in ubifs corruption, resulting in /ostree left in a read-only state. In short, I need to avoid oom-killer deaths. I have very little memory on this system and enough NAND to pull the update (but not enough for a second rootfs).