Errors from optimizer caused by lack of cache file

L8X commented 1 year ago

Describe the bug

I am using the default optimizer settings with PM2 setup as a daemon, and noticed that when a peer times out, I was getting error messages galore and PM2 would restart over and over, and wouldn't redo the optimizer run, the errors relate to a cache not existing, however I believe the issue stems from the fact that it appears when a peer doesn't give a response to the optimizer's probes, it doesn't create a cache file, therefore creating the fatal error I have come far too familiar with.

I would appreciate a hotfix for this, as it is causing my notifications to be spammed galore when even a single peer dares to go down.

The optimizer also doesn't seem to respect disabled sessions, so to not get spammed, you have to manually comment out the probe-sources line to stop it trying to use the downed peer...

Environment OS: Latest Debian

PV version:

Pathvector 6.3.0
Built 24d9dc1d82b81b6386c4635bd86a9683931964ec on 2023-05-15T05:10:26Z
No plugins
BIRD: ready.

Config file:

I cannot provide my config file, as my config will not work for you, and as I said, I am using the default optimizer settings with PM2 as the daemon.

However what you need to reproduce it is quite simple:

A transit session with route exports to your test instance disabled (or a peering session that doesn't have a route for what is in the optimizer's target setting)
A valid probe-sources entry in the peering session you'll be using to test a route
A valid PBR setup for the route (such as your own source IP, etc, just as normally is required for the optimizer to run)
As mentioned in point 1, you need a route in your target field that won't respond.
PM2 with autorestart enabled, running a bash script with "pathvector optimizer -v" inside.
Optionally, setup the readily available "discord.sh" script from GitHub to use notifications to a Discord webhook in a separate alert script.
Finally let the optimizer run and you'll notice an error saying something like:

0|pathvector | time="2023-06-20T22:53:40+01:00" level=debug msg="[Optimizer] Peer AS000000 EXAMPLE met or exceeded maximum allowable packet loss: 100.000000 >= 0.500000" 0|pathvector | Optimization Alert: Peer AS000000 EXAMPLE met or exceeded maximum allowable packet loss: 100.000000 >= 0.500000 0|pathvector | time="2023-06-20T22:53:43+01:00" level=fatal msg="reading peer file: open /var/run/pathvector/cache/AS000000_EXAMPLE.conf: no such file or directory

Then your alert script will start spamming the notification webhook / echo output as the fatal error (which is the bug) causes PM2 to autorestart every time it throws the fatal error (caused by the bug with the cache file).

Expected behaviour If a peer does not respond to the optimizer, I expect error checking to be used in order to skip the check for a cache file, which is what is causing the problem here.

As the cache file wasn't populated, (unlike with peers that do respond), it causes the previously mentioned fatal error which throws the optimizer out of it's standard operation and turns it into a webhook / log spammer due to how pm2's autorestart works.

The correct way to handle this, as mentioned, is to either error check, or not throw a fatal error and let the optimizer continue regardless of if a peer didn't respond and has 100% packet loss.

natesales commented 1 year ago

Please use our bug report issue template and complete all relevant fields. What is your exact configuration? What version of Pathvector are you running? How can we reproduce your issue? What is the error message?

L8X commented 1 year ago

Please use our bug report issue template and complete all relevant fields. What is your exact configuration? What version of Pathvector are you running? How can we reproduce your issue? What is the error message?

Is that good enough? I'm not very good at writing long things like this so forgive any mistakes in my wording, etc.

natesales commented 1 year ago

I'll still need your Pathvector config to reproduce. Feel free to use "pathvector config --sanitize" command to redact sensitive data if you like.

L8X commented 1 year ago

I'll send it to you via Discord as to avoid posting IPs and/or Peer information publicly.

L8X commented 1 year ago

I'll send it to you via Discord as to avoid posting IPs and/or Peer information publicly.

I've now sent you it, you should have it now, it's under the name "pathvector-sanitized.yml".

jamesalbert commented 7 months ago

I've been trying out pathvector recently and have been experiencing this same issue, can we reopen?

Here's my config:

/etc/pathvector.yml

asn: aaa
router-id: bbb
source4: bbb
prefixes:
  - ccc/24
peeringdb-api-key: ddd

optimizer:
  probe-udp: true
  exit-on-cache-full: true
  probe-interval: 1
  cache-size: 3
  targets:
    - 192.0.2.2

peers:
  provider:
    asn: eee
    template: peer
    multihop: true
    neighbors:
      - "fff"
    password: "ggg"
    optimize-inbound: true
    probe-sources: [ "bbb" ]

templates:
  peer:
    filter-irr: false
    filter-rpki: true
    filter-bogon-asns: true
    filter-bogon-routes: true
    auto-import-limits: false
    auto-as-set: false
    as-set: eee

This is for my Vultr vps. I'm pretty much just grabbing from probe-simple.yml, and setting the probe-sources to the public ip this is running on. I'm not even sure if that's the way to go 😅

Can a n00b like myself get some clarification in terms of what this should look like?

For clarity, this is the output I'm seeing:

# pathvector optimizer
INFO[0000] Starting optimizer
FATA[0001] reading peer file: open /var/run/pathvector/cache/MY_CONFIG.conf: no such file or directory

edit: turns out providing it a valid ip as a target helps...

natesales / pathvector

Errors from optimizer caused by lack of cache file #182