sexibytes / sexigraf

SexiGraf is a vSphere centric Graphite appliance with a Grafana frontend.
http://www.sexigraf.fr
MIT License
128 stars 21 forks source link

Sexigraf stopped running pull scripts #396

Open Redicious opened 6 months ago

Redicious commented 6 months ago

Hey,

my Sexigraf instance stopped pulling data at 2nd May 4:00 CEST from all VBR and unmanaged ESXis (there is nothing else in the inventory). (I've been on vacation, and sexigraf is not used yet, still in trial - thats why I'm late to the party)

So grafana only shows its own metrics.

In /var/log/sexigraf/ there are no new logs for VbrPullSatistics etc.

Only logs which geht updated are

/var/log/sexigraf/carbon/carbon-cache-a/console.log 145964 5/14/2024 1:21:30 PM /var/log/sexigraf/carbon/carbon-cache-b/console.log 145964 5/14/2024 1:21:30 PM /var/log/sexigraf/carbon/carbon-cache-b/query.log 18180 5/14/2024 1:19:39 PM /var/log/sexigraf/graphite/info.log 1733424 5/14/2024 1:17:07 PM

There is also no change in the patterns for when it stopped pulling data - so no error message hinting at what could go wrong.

I also checked the standard stuff... There is enough disk space, inodes left, ram is good, cpu is good, etc.

If I run ViPullStatistics.ps1 manually it also works, and I have a set of data points for the ESXi it is run for.

Since ViPullStatistics.ps1 basically works, and there are no Transcripts in /var/log/sexigraf: what does invoke it? Where should I look next? There is nothing in crontab...

Cheers Red

rschitz commented 6 months ago

Hi and thank you for your feedback. No VI and VBR is strange, do you still got someting in /etc/cron.d/ ? And do you still got your entries collection activated in the credential stores? image

Redicious commented 6 months ago

Thx for your response! Files for vi and vbr are present in /etc/cron.d. And they are enabled in the credential story - yesterday I disabled and enabled a few. Which reflects in the creation date of the files in /etc/cron.d. Good to know how this works then. I totally forgot about cron.d - and relied on crontab -l

However I can see the crons in syslog, like this one:

May 14 14:30:01 sexigraf CRON[219558]: (root) CMD ( /usr/bin/pwsh -NonInteractive -NoProfile -f /opt/sexigraf/ViPullStatistics.ps1 -credstore /mnt/wfs/inventory/vipscredentials.xml -server -sessionfile /tmp/vmw_.key >/dev/null 2>&1)

So I took the command and run it manually: Starting the script like in the cron results in

Fatal error. Internal CLR error. (0x80131506) Aborted Even if I strip it down to the bare minimum: root@sexigraf:/opt/sexigraf# /usr/bin/pwsh -f "/opt/sexigraf/ViPullStatistics.ps1" Fatal error. Internal CLR error. (0x80131506) Aborted

If I however start pwsh and then run the script within pwsh it works.

PS /opt/sexigraf> /opt/sexigraf/ViPullStatistics.ps1 -credstore /mnt/wfs/inventory/vipscredentials.xml -server host -sessionfile /tmp/vmw_host.key Transcript started, output file is /var/log/sexigraf/ViPullStatistics..log 2024-05-14T15:05:44.7208885+00:00 [INFO] ViPullStatistics v0.9.1037 ....

Same for a simple script:

root@sexigraf:/opt/sexigraf# echo 'return "hello world!"' >> helloWorld.ps1 root@sexigraf:/opt/sexigraf# chmod 755 helloWorld.ps1 root@sexigraf:/opt/sexigraf# /usr/bin/pwsh -f "/opt/sexigraf/helloWorld.ps1" Fatal error. Internal CLR error. (0x80131506) Aborted root@sexigraf:/opt/sexigraf# pwsh PowerShell 7.2.17 Copyright (c) Microsoft Corporation.

https://aka.ms/powershell Type 'help' to get help.

PS /opt/sexigraf> ./helloWorld.ps1 hello world!

Looks like there is something wrong with .NET and not sexigraf. I'll keep you posted....

edit: formatting

rschitz commented 6 months ago

Crazy stuff! Did you updated the appliance at some point?

Redicious commented 6 months ago

Didn't update it before now - apt history and ssh log also shows noone touched it.

I couldn't figure out what caused the issue exactly. strace looked ok'ish - it just aborts. I spent hours googling and chatgpt'ing (is that the right word?) and grepping through logs... So I gave up on finding out what happened and just wanted it to be fixed. I made a snapshot, reinstalled pwsh, and now it works - it is now 7.4.1, was 7.2.17 - although I doubt it is related to the update. I think it was fixed by reinstalling, since it broke without any intentional/logged changes.

apt remove powershell-lts wget https://raw.githubusercontent.com/PowerShell/PowerShell/master/tools/install-powershell.sh wget https://raw.githubusercontent.com/PowerShell/PowerShell/master/tools/installpsh-debian.sh bash install-powershell.sh

I assume pwsh itself must have kicked the bucket. Today's bofh-excuse card says: "global warming". That must be it.

Thanks for your help!

rschitz commented 6 months ago

thanks a lot for your feedback, also spent some time googling (didnt thought about chatgpting it) but by the looks of it, it sounds related to pwsh indeed. FYI i always use the latest LTS version as long as everything works fine. really really stranger issue, hope that wont affect your SexiGraf experience overall :D cheers

Redicious commented 6 months ago

Hi,

just wanted to let you know: The issue came somewhat back, but with a "segmentation fault" error instead - but I think its just a different flavor due to the upgrade, since the conditions leading to it are the same.

Wich can be fixed (maybe only temporarily) with

rm ~/.cache/powershell/StartupProfileData-NonInteractive

I found this here, describing an issue where running pwsh -c or pwsh -f leads to the clr dying du to some optimization beeing stored int above file. Details can be found here. https://github.com/PowerShell/PowerShell/issues/18998

So I came up with this:

#!/bin/bash

# Define the log file path
log_file="/var/log/fixpwsh.log"

# Run the PowerShell script
result=$(/usr/bin/pwsh -f /opt/sexigraf/helloWorld.ps1)
ok_string="hello world!"

# Get the current timestamp
timestamp=$(date +"%Y-%m-%d %H:%M:%S")

# Check if the result is "Hello World!"
if [[ "$result" == "$ok_string" ]]; then
    echo "[$timestamp] Script returned '$ok_string', quitting." | tee -a "$log_file"
else
    # removing the profile data file
    rm ~/.cache/powershell/StartupProfileData-NonInteractive
    echo "[$timestamp] Script returned something other than '$ok_string', removed file: StartupProfileData-NonInteractive" | tee -a "$log_file"
fi

And now I run it as cron every hour...

rschitz commented 6 months ago

what kind of CPU are you running?

Redicious commented 6 months ago

2x Intel Xeon Silver 4210

rschitz commented 6 months ago

can you try to upgrade the vHardware on the sexigraf vm just to test?

rschitz commented 6 months ago

also, did you install any security tool in the appliance?

rschitz commented 6 months ago

also does it have access to internet?

rschitz commented 4 months ago

any updates?