jhuckaby / Cronicle

A simple, distributed task scheduler and runner with a web based UI.
http://cronicle.net
Other
3.69k stars 383 forks source link

Croncile server crashed suddenly #570

Open harshgpt816 opened 1 year ago

harshgpt816 commented 1 year ago

Summary

Hello again,

Today I noticed my croncile cluster crashed after making changes yesterday it was all working correctly last I tested (yesterday i added minio as storage).

Now, when i am checking my cluster which had been running from 5 days (since creation) seemed offline ; on clucking further I observed there was a incident that mentioned using export statements to print the logs as in my case there is no crash log file. ( :ref: https://github.com/jhuckaby/Cronicle/issues/529 )

Steps to reproduce the problem

I issued the following commands: export CRONICLE_echo=1 export CRONICLE_foreground=1 /opt/cronicle/bin/control.sh restart

and here is the output:

[1674287478.064][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][2][Cronicle v0.9.19 Starting Up][{"pid":3129090,"ppid":3129085,"node":"v16.19.0","arch":"x64","platform":"linux","argv":["/usr/bin/node","/opt/cronicle/lib/main.js"],"execArgv":[]}]
[1674287478.065][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][1][WARNING: An old PID File was found: logs/cronicled.pid: 3129060][]
[1674287478.065][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][2][Old process 3129060 is apparently dead, so the PID file will be replaced: logs/cronicled.pid][]
[1674287478.066][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][9][Writing PID File: logs/cronicled.pid: 3129090][]
[1674287478.066][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][9][Confirmed PID File contents: logs/cronicled.pid: 3129090][]
[1674287478.066][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][2][Server IP: 192.168.5.34, Daemon PID: 3129090][]
[1674287478.067][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][3][Starting component: Storage][]
[1674287478.067][2023-01-21 13:21:18][vm-cron-01][3129090][Storage][debug][2][Setting up storage system v3.1.3][]
[1674287478.068][2023-01-21 13:21:18][vm-cron-01][3129090][Filesystem][debug][2][Setting up filesystem storage][]
[1674287478.068][2023-01-21 13:21:18][vm-cron-01][3129090][Filesystem][debug][3][Base directory: data][]
[1674287478.068][2023-01-21 13:21:18][vm-cron-01][3129090][Cronicle][debug][3][Starting component: WebServer][]
node:internal/fs/utils:347
    throw err;
    ^

Error: ENOSPC: no space left on device, open 'logs/crash.log'
    at Object.openSync (node:fs:590:3)
    at Object.writeFileSync (node:fs:2202:35)
    at Object.appendFileSync (node:fs:2264:6)
    at EventEmitter.<anonymous> (/opt/cronicle/node_modules/pixl-server/server.js:179:9)
    at EventEmitter.emit (node:events:513:28)
    at process.<anonymous> (/opt/cronicle/node_modules/uncatch/uncatch.js:20:11)
    at process.emit (node:events:513:28)
    at process._fatalException (node:internal/process/execution:149:25) {
  errno: -28,
  syscall: 'open',
  code: 'ENOSPC',
  path: 'logs/crash.log'
}
/opt/cronicle/bin/control.sh start: Cronicle Daemon could not be started

Your Setup

As I checked further the server has empty storage and the minio backend is also working fine ; i also verified by uploading file manually to minio bucket and faced no issue.

Here is further diagnosis I did to check the cron vm:

root@vm-cron-01:/home/pi# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/pve-vm--198--disk--0 49G 17G 30G 36% / none 492K 4.0K 488K 1% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 13G 100K 13G 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 6.3G 0 6.3G 0% /run/user/1000 root@vm-cron-01:/home/pi# ls /opt/cronicle/logs/ Cronicle.log Filesystem.log Storage.log Transaction.log archives cronicled.pid jobs

Operating system and version?

Ubuntu 22.04 container in Proxmox

Node.js version?

Nodejs 16

Cronicle software version?

Are you using a multi-server setup, or just a single server?

Are you using the filesystem as back-end storage, or S3/Couchbase?

As I checked the server was running fine from 5 days since creation when it was running on filesystem as storage backend. Last evening I changed it to minio backend and it was working fine for some time (per my checking after restarting it)

Can you reproduce the crash consistently?

Log Excerpts

harshgpt816 commented 1 year ago

Looking at the cronicle.log file I can see it seems to have crashed at 8:55AM today ( I applied the storage changes to minio last night ~7PM)

[1674271504.151][2023-01-21 08:55:04][vm-cron-01][2314463][Cronicle][debug][9][Chose server: vm-cron-01 via algo: random][]
[1674271504.151][2023-01-21 08:55:04][vm-cron-01][2314463][Cronicle][debug][6][Launching local job][{"category":"clcytoutm4n","plugin":"testplug","target":"allgrp","retries":0,"retry_delay":30,
"timeout":65,"timezone":"Asia/Calcutta","params":{"duration":"30","progress":1,"burn":0,"action":"Success","secret":"Will not be shown in Event UI"},"now":1674271500,"id":"jld5e0wlzg2","time_start":167427150
4.151,"hostname":"vm-cron-01","event":"elcyv76tvgh","event_title":"test-event-003","plugin_title":"Test Plugin","category_title":"CAT-002-SAMPLE","nice_target":"All Servers","command":"bin/test
-plugin.js","log_file":"/opt/cronicle/logs/jobs/jld5e0wlzg2.log"}]
[1674271504.151][2023-01-21 08:55:04][vm-cron-01][2314463][Cronicle][debug][9][Child spawn options:][{"cwd":"/opt/cronicle","uid":0,"gid":0,"env":{"SUDO_GID":"1000","MAIL":"/var/mail/root","USE
R":"root","HOME":"/root","OLDPWD":"/home/pi","SUDO_UID":"1000","LOGNAME":"root","TERM":"unknown","PATH":"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin","LANG":"C","SUDO_COMMAND":"/op
t/cronicle/bin/control.sh start","SHELL":"/bin/bash","SUDO_USER":"pi","PWD":"/opt/cronicle","__daemon":"true","CRONICLE":"0.9.19","JOB_ID":"jld5e0wlzg2","JOB_LOG":"/opt/cronicle/logs/jobs/jld5e0wlzg2.log","JOB_NOW":"1674271500","JOB_CATEGORY":"clcytoutm4n","JOB_PLUGIN":"testplug","JOB_TARGET":"allgrp","JOB_RETRIES":"0","JOB_RETRY_DELAY":"30","JOB_TIMEOUT":"65","JOB_TIMEZONE":"Asia/Calcutta","JOB_TIME_START":"16
74271504.151","JOB_HOSTNAME":"vm-cron-01","JOB_EVENT":"elcyv76tvgh","JOB_EVENT_TITLE":"test-event-003","JOB_PLUGIN_TITLE":"Test Plugin","JOB_CATEGORY_TITLE":"CAT-002-SAMPLE","JOB_NICE_TARGET":"
All Servers","JOB_COMMAND":"bin/test-plugin.js","JOB_LOG_FILE":"/opt/cronicle/logs/jobs/jld5e0wlzg2.log","USERNAME":"root","DURATION":"30","PROGRESS":"1","BURN":"0","ACTION":"Success","SECRET":"Will not be s
hown in Event UI"}}]
[1674287003.557][2023-01-21 13:13:23][vm-cron-01][3128817][Cronicle][debug][2][Spawning background daemon process (PID 3128817 will exit)][["/usr/bin/node","/opt/cronicle/lib/main.js"]]
[1674287003.706][2023-01-21 13:13:23][vm-cron-01][3128825][Cronicle][debug][2][Cronicle v0.9.19 Starting Up][{"pid":3128825,"ppid":1,"node":"v16.19.0","arch":"x64","platform":"linux","argv":["/
usr/bin/node","/opt/cronicle/lib/main.js"],"execArgv":[]}]
harshgpt816 commented 1 year ago

LOL for some unknown unforeseen reason my container is out of i nodes.

sudo df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/pve-vm--198--disk--0 3276800 3276800 0 100% / none 8217107 25 8217082 1% /dev tmpfs 8217107 1 8217106 1% /dev/shm tmpfs 819200 151 819049 1% /run tmpfs 8217107 2 8217105 1% /run/lock tmpfs 1643421 17 1643404 1% /run/user/1000

To anyone who sees this bug in future ; make sure to check for i nodes in your vm/container.

To the author: sorry for the trouble ; seems like it is not a croncile issue but the point is all other containers are not out of inode ; only the cronicles one is, No idea why is that. No idea on what to do next either.

jhuckaby commented 1 year ago

When using the Filesystem plugin and storing your data on local disk, INODES can become a problem with very large schedules (i.e. 10,000+ jobs per day).

Setting job_data_expire_days to a reasonably small number of days should greatly reduce the INODE usage with the Filesystem Plugin. The default is 180 days, but setting this to 30 or 15 days is probably better for larger schedules.

This of course depends on the filesystem and mount (ext4, zfs, etc.). Some filesystems have a huge amount of available INODEs, and you can increase them in some cases, or configure them when formatting.