Open seungjin opened 3 months ago
My laptop that runs spin 3.3.0-pre0 (bde256d2 2025-04-10)
does not have that 100% cpu usage problem.
Tested released spin 3.2.0 (c9be6d8 2025-03-24) with my other VPS server, but I don't see that 100% cpu load issue. Seems my that specific VPS issue. Or do you know any case trigger-cron can consume 100% cpu load?
Seems it is related to libc.so.6: version `GLIBC_2.38' (Debian 12).
Thanks for flagging this. I've transferred it to the spin-trigger-cron
repo.
So I tried the same environment under a Fedora container, which comes with fairly new libraries.
$ spin up --runtime-config-file runtime-config.toml
Using runtime config [key_value_store.mem: spin], [sqlite_database.mem: spin], [sqlite_database.lachuoi: libsql], [sqlite_database.geoname: libsql], [sqlite_database.feedpub: libsql] from "runtime-config.toml"
Logging component stdio to ".spin/logs/"
Using runtime config [key_value_store.mem: spin], [sqlite_database.mem: spin], [sqlite_database.lachuoi: libsql], [sqlite_database.geoname: libsql], [sqlite_database.feedpub: libsql] from "runtime-config.toml"
Logging component stdio to ".spin/logs/"
Serving http://127.0.0.1:3000
Available Routes:
geoip: http://127.0.0.1:3000/geoip (wildcard)
webfinger: http://127.0.0.1:3000/.wellknown/webfinger
root: http://127.0.0.1:3000
image-description: http://127.0.0.1:3000/image/description (wildcard)
assets: http://127.0.0.1:3000/static (wildcard)
random-place: http://127.0.0.1:3000/random-place (wildcard)
ifconfig: http://127.0.0.1:3000/ifconfig (wildcard)
Preparing Wasm modules is taking a few seconds...
It seems all HTTP triggers are up and running, but the trigger-cron is stuck and not starting, and it’s using 100% CPU. I can use the HTTP-triggered app and get the proper returns
That only happens my staging(a sort of) server. On my laptop or other my dev vms do not have that issue. Does trigger-cron have some dependency? (I am digging now).
I have one VM running Debian 12, which comes with an old version of GLIBC. Spin works there, but trigger-cron does not.
I then launched a Fedora Toolbox (https://containertoolbx.org/), but the same issue occurred: Spin works, but trigger-cron does not start.
Next, I created a full OCI container image and ran it on the same machine. That works fine.
Most users are using OCI containers and their images, so they won’t encounter this issue.
However, we still have the case where Spin works, but trigger-cron doesn’t start.
I’m not sure exactly what caused this (even with Fedora Toolbox—this is probably a separate issue, as it’s related to Toolbox).
~~I won’t investigate further because it seems to be a rare case, and I can simply use an OCI image-based instance.
Just recording this for future reference.~~
You may close this issue or I will close in 24 hours.
Thanks!
Oh.. probably,, we can give some nice logging message or timeout when trigger-cron takes too long to start.
Create an OCI image. The image works on my laptop, but it doesn't work (only spin-trigger-cron) on my staging VM. Currently, I can say that there is a VM that does not run spin-trigger-cron even with a container. It’s possibly not an issue with trigger-cron, but with the VM; I can't be sure for now.
@karthik2804 I think you know the cron trigger better than others - are you tracking this? Do you have any ideas on what might be happening or what @seungjin could try in order to diagnose it? Thanks!
@karthik2804 Is there any flag (e.g., -v) that can show logs or verbose messages? It seems to be stuck on the message 'Preparing Wasm modules is taking a few seconds...' and I have no idea what's going on inside. There are no logs or messages.
@seungjin You can set the RUST_LOG
environment variable e.g. RUST_LOG=trace spin up
. This will likely produce a lot of output though. I think you can filter it to Spin crates via RUST_LOG=spin=trace
, but you may need to also trace into the wasmtime
engine e.g. RUST_LOG=spin=trace,wasmtime=debug
.
(https://docs.rs/env_logger/latest/env_logger/#enabling-logging)
RUST_LOG=trace
How come I completely forgot about this! Thank you!!! 🙇🙇🙇
RUST_LOG=spin=trace,wasmtime=debug
~Found when it happens:~
~I use SQLite and a key-value store in memory for caching purposes. This module is triggered by an HTTP request.~ ~https://github.com/seungjin/lachuoi/blob/57a44fad1faf8c990d6fee63a010b0709703f49c/runtime-config.toml#L1~ ~https://github.com/seungjin/lachuoi/blob/9c71a6fe35d8692ccc5fb82963819c7ad6b4458f/spin.dev.toml#L135~
~This issue occurs only when the following conditions are met:~ ~(1) The HTTP trigger module is using SQLite/key-value storage in memory.~ ~(2) spin-trigger-cron is running alongside (1).~ ~(3) This specific VM (VPS) is being used—seems this VM has some peculiar behavior as well.~
~When all three conditions (1), (2), and (3) are met simultaneously, spin-trigger-cron gets stuck. ~ ~It seems that this VM is also experiencing some unusual behavior with its memory management.~
~Is there any chance that the HTTP trigger's storage in memory could freeze spin-trigger-cron?~
Give me some time.. Feeling I am getting closer :-)
Ok. I made very simple point. Forget about all config and system thing I wrote above for now. Simply let's talk with code.
This code not working in my staging VM -This is the one I have had issue. This app is http triggered one. Nothing to do with my cron triggered app. https://github.com/seungjin/lachuoi/blob/a3a28f5b3990b882b27ed40b617807c8ba293a49/apps/random-place/src/lib.rs#L16 I use async router in this module/wasm/microservice. It blocked my cron trigger starting.
I changed to https://github.com/seungjin/lachuoi/blob/4977d87e139a2402d15e7adbcf8c569396096f59/apps/random-place/src/lib.rs#L16 Non async router. Now this module works well with cron trigger.
I don't know what caused this internally. Probably it is your turn @karthik2804 telling me why this happens ;-)
Also do you have any idea why my laptop and other VMs were working fine but only this specific VM is stuck with this?
Thanks in advance.
Thanks for the update @seungjin. Karthik will hopefully be on line in a few hours.
In the meantime, may I check what Spin environment your staging VM is running - Spin CLI or SpinKube. The reason I ask is that it's startling for a change in your HTTP component to affect loading in a separate process (the cron trigger process). Which made me want to double check if you were using SpinKube, where the two triggers are cohosted in the same process.
Thanks for all the investigation - hopefully we can get to the bottom of this.
It was Spin CLI
. Not SpinKube that I wlll play with for production later.
seungjin@orange-jellyfish-13406:~$ ps aux | grep spin
seungjin 8474 0.0 0.2 568496 24292 ? Ssl 04:38 0:00 spin up --runtime-config-file runtime-config.toml
seungjin 8497 0.6 0.9 4231601772 76068 ? Sl 04:38 0:04 /usr/local/bin/spin trigger http --runtime-config-file runtime-config.toml
seungjin 8502 0.0 0.2 433176 20320 ? Sl 04:38 0:00 /usr/local/bin/spin trigger-cron --runtime-config-file runtime-config.toml
seungjin 8511 0.6 0.6 4231362388 54648 ? Sl 04:38 0:04 /home/seungjin/.local/share/spin/plugins/trigger-cron/trigger-cron --runtime-config-file runtime-config.toml
As you see, I have two different processes - spin trigger http
, spin trigger-cron
One thing you should consider is this happens only in this VM.
~Oh, one thing I can mention is the microservice architecture. Basically the services triggered by cron are calling the Spin HTTP triggers (http request to localhost:3000).~
I updated my random-place source to fix a bug—nothing system-related, just a logic bug. After that, the service not starting properly on my staging box with cron-trigger. I didn’t touch anything related to async this time. New code here: https://github.com/seungjin/lachuoi/blob/a87b2e81f40c1abce6d638e4465d47d0db125b5b/apps/random-place/src/lib.rs
This is a very interesting case: (1) Even with Docker/Containers, this code doesn’t work only on my staging box but works fine on my laptop (both X86). This shouldn’t happen given the nature of containerization. (2) It also doesn’t work with Wasm, which is another sandboxed/isolated environment. It should follow the same principles as containers.
So, this case is very puzzling. How can a simple code change affect these sandboxed/isolated environments?
I have to admit I'm at a loss. I know @karthik2804 looked at the issue last week, but I don't believe he was able to reproduce or identify any possible cause. Your other findings make this even more mystifying.
Taking stock
Let me recap to check my understanding (sorry I have been away and my head is spinning trying to catch up):
And a question
I notice in your latest code you have this line:
Connection::open("geoname").expect("geoname libsql connection error")
But the spin.toml
component does not have access to the geoname
label:
[component.random-place]
source = "target/wasm32-wasip1/release/random_place.wasm"
allowed_outbound_hosts = ["https://geoname-seungjin.turso.io"]
[component.random-place.build]
command = "cargo build --target wasm32-wasip1 --release"
watch = ["src/**/*.rs", "Cargo.toml"]
What happens if you change the expect
to a normal error return (so that it doesn't panic)? What happens if you give the component permission to geoname
(sqlite_databases = ["geoname"]
) (and map it in your runtime-config file)?
(Although this shouldn't affect startup, nor vary from machine to machine.)
I applied following changes: (1) Use .unwrap() instead .expect() : https://github.com/seungjin/lachuoi/commit/60898d5b325a45094ce0930688462d89f8d3528d (2) Set sqlite_databases = ["geoname"] to place-random app: https://github.com/seungjin/lachuoi/blob/8e2675f31ac255810890a9e78b21c77f44947191/spin.toml#L138 -> Result: The same. Seems that does not help much.
App's name changed somewhat (random-place to place-random) since the first port, but probably you can follow its content without big issue.
Thank you!
Screenshot from Htop attached.
Thanks for the update. Are you able to deploy just the cron bits to the staging VM? If so, does it still exhibit the 100% CPU behaviour?
If cron alone hits 100%: if you deploy a single cron component to the staging VM, does it still exhibit the 100% CPU behaviour?
Also are there any differences between the staging VM and other VMs where it works, e.g. different operating system?
I will do some generalization with the code and test it with different cases.
The VM is my concern. It’s a machine with 8 GB of RAM and 4 cores, as specified in the contract. It is KVM and Debian 12 on it. I use cheap VMs. So this machine is also not from a major vendor—it's actually one of the cheapest I have. It's a 'lifetime' server, which means I paid once upfront and the VPS provider offers the VM for my lifetime (under certain limitations and rules). This also means the provider might be doing various things to reduce resource usage and running costs. So, there could be some hidden tricks they're using to limit its resources. I’ve checked all the possible ways they might be limiting my VM but haven’t found anything yet. My other cheap VPS vendors' VMs work fine with the code. Only this hosting company—by the way, it's zap-hosting.com, a German provider typically known for hosting game servers—has this issue.
Regardless of this technical disadvantage, the reason WebAssembly (WASM) is gaining popularity is its isolation. So even if the host provider is using tricks, I believe it should still run—or at least give us some clue or message about what’s going on.
Thank you!
BTW, the minimal environment would consist of two services: (1) place-random – an HTTP-triggered function, URL: /places/random (2) cron-task (e.g., mstd-random-cafe in my app) – a cron-triggered function that makes an HTTP request to place-random at /places/random.
The place-random service works fine on its own. However, the cron-task app is triggered on a schedule and sends an HTTP request to place-random (e.g., http://localhost:3000/places/random).
So, the minimal working environment would require both apps running together.
Could you just confirm for us whether or not the cron task running on its own does sets off the 100% usage please? Thanks! I realise it will fail without the HTTP component - I just want to get a handle on if it pegs to 100% CPU while doing so.
If you change the HTTP component called by the cron task to be a no-op (return "hello world" without doing anything), does that make any difference?
Disabled all my http triggered apps in Spin.toml (commented) and leave only 5 cron triggered apps. Yes. It still get 100% cpu load. So seems it is trigger alone issue.
Beside that, I am typically using spin-rust-sdk with most recent main branch but my apps that use rust-sdk's main branch do not work with spin-trigger-cron (some import error).
~~Wait.. Disabled all crons but left only one cron app. Then I don't see 100% load. I have total 5 croned apps now.~~
Correction!
It(100% cpu load) does not happen only with Cron triggered apps. Only happens when I run with my that place-random http triggered app.
Spin trigger-cron is reserving 100% CPU usage. The fastest cron job I'm running is every hour, but it keeps using 100% of the CPU
From
htop
:I use the released spin version (downloaded from github) and installed trigger-cron with
spin plugins install --url https://github.com/fermyon/spin-trigger-cron/releases/download/canary/trigger-cron.json
It is Debian GNU/Linux 12 (bookworm)