Closed L8X closed 1 year ago
Please use our bug report issue template and complete all relevant fields. What is your exact configuration? What version of Pathvector are you running? How can we reproduce your issue? What is the error message?
Please use our bug report issue template and complete all relevant fields. What is your exact configuration? What version of Pathvector are you running? How can we reproduce your issue? What is the error message?
Is that good enough? I'm not very good at writing long things like this so forgive any mistakes in my wording, etc.
I'll still need your Pathvector config to reproduce. Feel free to use "pathvector config --sanitize" command to redact sensitive data if you like.
I'll send it to you via Discord as to avoid posting IPs and/or Peer information publicly.
I'll send it to you via Discord as to avoid posting IPs and/or Peer information publicly.
I've now sent you it, you should have it now, it's under the name "pathvector-sanitized.yml".
I've been trying out pathvector recently and have been experiencing this same issue, can we reopen?
Here's my config:
/etc/pathvector.yml
asn: aaa
router-id: bbb
source4: bbb
prefixes:
- ccc/24
peeringdb-api-key: ddd
optimizer:
probe-udp: true
exit-on-cache-full: true
probe-interval: 1
cache-size: 3
targets:
- 192.0.2.2
peers:
provider:
asn: eee
template: peer
multihop: true
neighbors:
- "fff"
password: "ggg"
optimize-inbound: true
probe-sources: [ "bbb" ]
templates:
peer:
filter-irr: false
filter-rpki: true
filter-bogon-asns: true
filter-bogon-routes: true
auto-import-limits: false
auto-as-set: false
as-set: eee
This is for my Vultr vps. I'm pretty much just grabbing from probe-simple.yml, and setting the probe-sources to the public ip this is running on. I'm not even sure if that's the way to go 😅
Can a n00b like myself get some clarification in terms of what this should look like?
For clarity, this is the output I'm seeing:
# pathvector optimizer
INFO[0000] Starting optimizer
FATA[0001] reading peer file: open /var/run/pathvector/cache/MY_CONFIG.conf: no such file or directory
edit: turns out providing it a valid ip as a target helps...
Describe the bug
I am using the default optimizer settings with PM2 setup as a daemon, and noticed that when a peer times out, I was getting error messages galore and PM2 would restart over and over, and wouldn't redo the optimizer run, the errors relate to a cache not existing, however I believe the issue stems from the fact that it appears when a peer doesn't give a response to the optimizer's probes, it doesn't create a cache file, therefore creating the fatal error I have come far too familiar with.
I would appreciate a hotfix for this, as it is causing my notifications to be spammed galore when even a single peer dares to go down.
The optimizer also doesn't seem to respect disabled sessions, so to not get spammed, you have to manually comment out the probe-sources line to stop it trying to use the downed peer...
Environment OS: Latest Debian
PV version:
Config file:
I cannot provide my config file, as my config will not work for you, and as I said, I am using the default optimizer settings with PM2 as the daemon.
However what you need to reproduce it is quite simple:
A transit session with route exports to your test instance disabled (or a peering session that doesn't have a route for what is in the optimizer's target setting)
A valid probe-sources entry in the peering session you'll be using to test a route
A valid PBR setup for the route (such as your own source IP, etc, just as normally is required for the optimizer to run)
As mentioned in point 1, you need a route in your target field that won't respond.
PM2 with autorestart enabled, running a bash script with "pathvector optimizer -v" inside.
Optionally, setup the readily available "discord.sh" script from GitHub to use notifications to a Discord webhook in a separate alert script.
Finally let the optimizer run and you'll notice an error saying something like:
0|pathvector | time="2023-06-20T22:53:40+01:00" level=debug msg="[Optimizer] Peer AS000000 EXAMPLE met or exceeded maximum allowable packet loss: 100.000000 >= 0.500000" 0|pathvector | Optimization Alert: Peer AS000000 EXAMPLE met or exceeded maximum allowable packet loss: 100.000000 >= 0.500000 0|pathvector | time="2023-06-20T22:53:43+01:00" level=fatal msg="reading peer file: open /var/run/pathvector/cache/AS000000_EXAMPLE.conf: no such file or directory
Then your alert script will start spamming the notification webhook / echo output as the fatal error (which is the bug) causes PM2 to autorestart every time it throws the fatal error (caused by the bug with the cache file).
Expected behaviour If a peer does not respond to the optimizer, I expect error checking to be used in order to skip the check for a cache file, which is what is causing the problem here.
As the cache file wasn't populated, (unlike with peers that do respond), it causes the previously mentioned fatal error which throws the optimizer out of it's standard operation and turns it into a webhook / log spammer due to how pm2's autorestart works.
The correct way to handle this, as mentioned, is to either error check, or not throw a fatal error and let the optimizer continue regardless of if a peer didn't respond and has 100% packet loss.