spinframework / spin-trigger-cron

Apache License 2.0
7 stars 8 forks source link

Spin trigger-cron is reserving 100% CPU usage. #31

Open seungjin opened 3 months ago

seungjin commented 3 months ago
$ spin --version
spin 3.2.0 (c9be6d8 2025-03-24)
$ spin plugins list | grep cron
trigger-cron 0.2.0 [installed]

Spin trigger-cron is reserving 100% CPU usage. The fastest cron job I'm running is every hour, but it keeps using 100% of the CPU

From htop:

    PID USER       PRI  NI  VIRT   RES   SHR S  CPU%▽MEM%   TIME+  Command
 454838 seungjin    20   0 4035G  131M 20740 S 100.5  1.7  0:41.63 /home/seungjin/.local/share/spin/plugins/trigger-cron/trigger-cron --runtime-config-file runtime-config.toml
 454848 seungjin    20   0 4035G  131M 20740 R 100.5  1.7  0:40.46 /home/seungjin/.local/share/spin/plugins/trigger-cron/trigger-cron --runtime-config-file runtime-config.toml

I use the released spin version (downloaded from github) and installed trigger-cron with spin plugins install --url https://github.com/fermyon/spin-trigger-cron/releases/download/canary/trigger-cron.json

It is Debian GNU/Linux 12 (bookworm)

seungjin commented 3 months ago

My laptop that runs spin 3.3.0-pre0 (bde256d2 2025-04-10) does not have that 100% cpu usage problem.

seungjin commented 3 months ago

Tested released spin 3.2.0 (c9be6d8 2025-03-24) with my other VPS server, but I don't see that 100% cpu load issue. Seems my that specific VPS issue. Or do you know any case trigger-cron can consume 100% cpu load?

seungjin commented 3 months ago

Seems it is related to libc.so.6: version `GLIBC_2.38' (Debian 12).

itowlson commented 3 months ago

Thanks for flagging this. I've transferred it to the spin-trigger-cron repo.

seungjin commented 3 months ago

So I tried the same environment under a Fedora container, which comes with fairly new libraries.

$ spin up --runtime-config-file runtime-config.toml
Using runtime config [key_value_store.mem: spin], [sqlite_database.mem: spin], [sqlite_database.lachuoi: libsql], [sqlite_database.geoname: libsql], [sqlite_database.feedpub: libsql] from "runtime-config.toml"
Logging component stdio to ".spin/logs/"
Using runtime config [key_value_store.mem: spin], [sqlite_database.mem: spin], [sqlite_database.lachuoi: libsql], [sqlite_database.geoname: libsql], [sqlite_database.feedpub: libsql] from "runtime-config.toml"
Logging component stdio to ".spin/logs/"

Serving http://127.0.0.1:3000
Available Routes:
  geoip: http://127.0.0.1:3000/geoip (wildcard)
  webfinger: http://127.0.0.1:3000/.wellknown/webfinger
  root: http://127.0.0.1:3000
  image-description: http://127.0.0.1:3000/image/description (wildcard)
  assets: http://127.0.0.1:3000/static (wildcard)
  random-place: http://127.0.0.1:3000/random-place (wildcard)
  ifconfig: http://127.0.0.1:3000/ifconfig (wildcard)
Preparing Wasm modules is taking a few seconds...

It seems all HTTP triggers are up and running, but the trigger-cron is stuck and not starting, and it’s using 100% CPU. I can use the HTTP-triggered app and get the proper returns

seungjin commented 3 months ago

That only happens my staging(a sort of) server. On my laptop or other my dev vms do not have that issue. Does trigger-cron have some dependency? (I am digging now).

seungjin commented 3 months ago

I have one VM running Debian 12, which comes with an old version of GLIBC. Spin works there, but trigger-cron does not. I then launched a Fedora Toolbox (https://containertoolbx.org/), but the same issue occurred: Spin works, but trigger-cron does not start. Next, I created a full OCI container image and ran it on the same machine. That works fine. Most users are using OCI containers and their images, so they won’t encounter this issue. However, we still have the case where Spin works, but trigger-cron doesn’t start. I’m not sure exactly what caused this (even with Fedora Toolbox—this is probably a separate issue, as it’s related to Toolbox). ~~I won’t investigate further because it seems to be a rare case, and I can simply use an OCI image-based instance. Just recording this for future reference.~~

You may close this issue or I will close in 24 hours. Thanks!

seungjin commented 3 months ago

Oh.. probably,, we can give some nice logging message or timeout when trigger-cron takes too long to start.

seungjin commented 3 months ago

Create an OCI image. The image works on my laptop, but it doesn't work (only spin-trigger-cron) on my staging VM. Currently, I can say that there is a VM that does not run spin-trigger-cron even with a container. It’s possibly not an issue with trigger-cron, but with the VM; I can't be sure for now.

itowlson commented 3 months ago

@karthik2804 I think you know the cron trigger better than others - are you tracking this? Do you have any ideas on what might be happening or what @seungjin could try in order to diagnose it? Thanks!

seungjin commented 3 months ago

@karthik2804 Is there any flag (e.g., -v) that can show logs or verbose messages? It seems to be stuck on the message 'Preparing Wasm modules is taking a few seconds...' and I have no idea what's going on inside. There are no logs or messages.

itowlson commented 3 months ago

@seungjin You can set the RUST_LOG environment variable e.g. RUST_LOG=trace spin up. This will likely produce a lot of output though. I think you can filter it to Spin crates via RUST_LOG=spin=trace, but you may need to also trace into the wasmtime engine e.g. RUST_LOG=spin=trace,wasmtime=debug.

(https://docs.rs/env_logger/latest/env_logger/#enabling-logging)

seungjin commented 3 months ago

RUST_LOG=trace How come I completely forgot about this! Thank you!!! 🙇🙇🙇

RUST_LOG=spin=trace,wasmtime=debug

seungjin commented 3 months ago

~Found when it happens:~

~I use SQLite and a key-value store in memory for caching purposes. This module is triggered by an HTTP request.~ ~https://github.com/seungjin/lachuoi/blob/57a44fad1faf8c990d6fee63a010b0709703f49c/runtime-config.toml#L1~ ~https://github.com/seungjin/lachuoi/blob/9c71a6fe35d8692ccc5fb82963819c7ad6b4458f/spin.dev.toml#L135~

~This issue occurs only when the following conditions are met:~ ~(1) The HTTP trigger module is using SQLite/key-value storage in memory.~ ~(2) spin-trigger-cron is running alongside (1).~ ~(3) This specific VM (VPS) is being used—seems this VM has some peculiar behavior as well.~

~When all three conditions (1), (2), and (3) are met simultaneously, spin-trigger-cron gets stuck. ~ ~It seems that this VM is also experiencing some unusual behavior with its memory management.~

~Is there any chance that the HTTP trigger's storage in memory could freeze spin-trigger-cron?~

seungjin commented 3 months ago

Give me some time.. Feeling I am getting closer :-)

seungjin commented 3 months ago

Ok. I made very simple point. Forget about all config and system thing I wrote above for now. Simply let's talk with code.

This code not working in my staging VM -This is the one I have had issue. This app is http triggered one. Nothing to do with my cron triggered app. https://github.com/seungjin/lachuoi/blob/a3a28f5b3990b882b27ed40b617807c8ba293a49/apps/random-place/src/lib.rs#L16 I use async router in this module/wasm/microservice. It blocked my cron trigger starting.

I changed to https://github.com/seungjin/lachuoi/blob/4977d87e139a2402d15e7adbcf8c569396096f59/apps/random-place/src/lib.rs#L16 Non async router. Now this module works well with cron trigger.

I don't know what caused this internally. Probably it is your turn @karthik2804 telling me why this happens ;-)

Also do you have any idea why my laptop and other VMs were working fine but only this specific VM is stuck with this?

Thanks in advance.

itowlson commented 3 months ago

Thanks for the update @seungjin. Karthik will hopefully be on line in a few hours.

In the meantime, may I check what Spin environment your staging VM is running - Spin CLI or SpinKube. The reason I ask is that it's startling for a change in your HTTP component to affect loading in a separate process (the cron trigger process). Which made me want to double check if you were using SpinKube, where the two triggers are cohosted in the same process.

Thanks for all the investigation - hopefully we can get to the bottom of this.

seungjin commented 3 months ago

It was Spin CLI. Not SpinKube that I wlll play with for production later.

seungjin@orange-jellyfish-13406:~$ ps aux | grep spin
seungjin    8474  0.0  0.2 568496 24292 ?        Ssl  04:38   0:00 spin up --runtime-config-file runtime-config.toml
seungjin    8497  0.6  0.9 4231601772 76068 ?    Sl   04:38   0:04 /usr/local/bin/spin trigger http --runtime-config-file runtime-config.toml
seungjin    8502  0.0  0.2 433176 20320 ?        Sl   04:38   0:00 /usr/local/bin/spin trigger-cron --runtime-config-file runtime-config.toml
seungjin    8511  0.6  0.6 4231362388 54648 ?    Sl   04:38   0:04 /home/seungjin/.local/share/spin/plugins/trigger-cron/trigger-cron --runtime-config-file runtime-config.toml

As you see, I have two different processes - spin trigger http, spin trigger-cron

One thing you should consider is this happens only in this VM.

seungjin commented 3 months ago

~Oh, one thing I can mention is the microservice architecture. Basically the services triggered by cron are calling the Spin HTTP triggers (http request to localhost:3000).~

seungjin commented 2 months ago

I updated my random-place source to fix a bug—nothing system-related, just a logic bug. After that, the service not starting properly on my staging box with cron-trigger. I didn’t touch anything related to async this time. New code here: https://github.com/seungjin/lachuoi/blob/a87b2e81f40c1abce6d638e4465d47d0db125b5b/apps/random-place/src/lib.rs

This is a very interesting case: (1) Even with Docker/Containers, this code doesn’t work only on my staging box but works fine on my laptop (both X86). This shouldn’t happen given the nature of containerization. (2) It also doesn’t work with Wasm, which is another sandboxed/isolated environment. It should follow the same principles as containers.

So, this case is very puzzling. How can a simple code change affect these sandboxed/isolated environments?

itowlson commented 2 months ago

I have to admit I'm at a loss. I know @karthik2804 looked at the issue last week, but I don't believe he was able to reproduce or identify any possible cause. Your other findings make this even more mystifying.

Taking stock

Let me recap to check my understanding (sorry I have been away and my head is spinning trying to catch up):

And a question

I notice in your latest code you have this line:

Connection::open("geoname").expect("geoname libsql connection error")

https://github.com/seungjin/lachuoi/blob/a87b2e81f40c1abce6d638e4465d47d0db125b5b/apps/random-place/src/lib.rs#L30

But the spin.toml component does not have access to the geoname label:

[component.random-place]
source = "target/wasm32-wasip1/release/random_place.wasm"
allowed_outbound_hosts = ["https://geoname-seungjin.turso.io"]

[component.random-place.build]
command = "cargo build --target wasm32-wasip1 --release"
watch = ["src/**/*.rs", "Cargo.toml"]

https://github.com/seungjin/lachuoi/blob/a87b2e81f40c1abce6d638e4465d47d0db125b5b/apps/random-place/spin.toml#L13

What happens if you change the expect to a normal error return (so that it doesn't panic)? What happens if you give the component permission to geoname (sqlite_databases = ["geoname"]) (and map it in your runtime-config file)?

(Although this shouldn't affect startup, nor vary from machine to machine.)

seungjin commented 2 months ago

Image

I applied following changes: (1) Use .unwrap() instead .expect() : https://github.com/seungjin/lachuoi/commit/60898d5b325a45094ce0930688462d89f8d3528d (2) Set sqlite_databases = ["geoname"] to place-random app: https://github.com/seungjin/lachuoi/blob/8e2675f31ac255810890a9e78b21c77f44947191/spin.toml#L138 -> Result: The same. Seems that does not help much.

App's name changed somewhat (random-place to place-random) since the first port, but probably you can follow its content without big issue.

Thank you!

Screenshot from Htop attached.

itowlson commented 2 months ago

Thanks for the update. Are you able to deploy just the cron bits to the staging VM? If so, does it still exhibit the 100% CPU behaviour?

If cron alone hits 100%: if you deploy a single cron component to the staging VM, does it still exhibit the 100% CPU behaviour?

itowlson commented 2 months ago

Also are there any differences between the staging VM and other VMs where it works, e.g. different operating system?

seungjin commented 2 months ago

I will do some generalization with the code and test it with different cases.

The VM is my concern. It’s a machine with 8 GB of RAM and 4 cores, as specified in the contract. It is KVM and Debian 12 on it. I use cheap VMs. So this machine is also not from a major vendor—it's actually one of the cheapest I have. It's a 'lifetime' server, which means I paid once upfront and the VPS provider offers the VM for my lifetime (under certain limitations and rules). This also means the provider might be doing various things to reduce resource usage and running costs. So, there could be some hidden tricks they're using to limit its resources. I’ve checked all the possible ways they might be limiting my VM but haven’t found anything yet. My other cheap VPS vendors' VMs work fine with the code. Only this hosting company—by the way, it's zap-hosting.com, a German provider typically known for hosting game servers—has this issue.

Regardless of this technical disadvantage, the reason WebAssembly (WASM) is gaining popularity is its isolation. So even if the host provider is using tricks, I believe it should still run—or at least give us some clue or message about what’s going on.

Thank you!

seungjin commented 2 months ago

BTW, the minimal environment would consist of two services: (1) place-random – an HTTP-triggered function, URL: /places/random (2) cron-task (e.g., mstd-random-cafe in my app) – a cron-triggered function that makes an HTTP request to place-random at /places/random.

The place-random service works fine on its own. However, the cron-task app is triggered on a schedule and sends an HTTP request to place-random (e.g., http://localhost:3000/places/random).

So, the minimal working environment would require both apps running together.

itowlson commented 2 months ago

Could you just confirm for us whether or not the cron task running on its own does sets off the 100% usage please? Thanks! I realise it will fail without the HTTP component - I just want to get a handle on if it pegs to 100% CPU while doing so.

If you change the HTTP component called by the cron task to be a no-op (return "hello world" without doing anything), does that make any difference?

seungjin commented 2 months ago

Disabled all my http triggered apps in Spin.toml (commented) and leave only 5 cron triggered apps. Yes. It still get 100% cpu load. So seems it is trigger alone issue.

Beside that, I am typically using spin-rust-sdk with most recent main branch but my apps that use rust-sdk's main branch do not work with spin-trigger-cron (some import error).

seungjin commented 2 months ago

~~Wait.. Disabled all crons but left only one cron app. Then I don't see 100% load. I have total 5 croned apps now.~~

seungjin commented 2 months ago

Correction!

It(100% cpu load) does not happen only with Cron triggered apps. Only happens when I run with my that place-random http triggered app.