nautobot / nautobot-app-device-onboarding

Device Onboarding Plugin for Nautobot
https://docs.nautobot.com/projects/device-onboarding/en/latest/
Other
44 stars 31 forks source link

Can't open job results + jobs fail due to multiple VLANs being returned #215

Closed mdeng10 closed 2 months ago

mdeng10 commented 3 months ago

Environment

Expected Behavior

When clicking on the job result image I expect it to return the job results page with a log of how the job went

Observed Behavior

It loads this page instead image

I've attempted to run the Runs Commands on a Device to simulate SSoT Command Getter job, however there's a large backlog of onboarding jobs that have to run before this one - is there a way to cancel a job in nautobot? I can only delete the job result

Steps to Reproduce

Unsure of how to repro yet - but i suspect if the devices are configured with the same vlan ID, same vlan name, in the same location, it'll most likely throw the error i've been seeing

<class 'nautobot_ssot.models.Sync.MultipleObjectsReturned'>

get() returned more than one Sync -- it returned 2!

But it should still allow me to see the job result page so i can try to pinpoint which devices are causing this

housepbass commented 2 months ago

@mdeng10 I think I've seen this scenario when running multiple Sync Network Data Jobs simultaneously and there are Devices in both Jobs in the same Location. To test, try deleting duplicate VLANs from the relevant Location and then re-discover with one Job at a time.

If you restart the worker service, that will stop all running Jobs. systemctl restart autobot-worker

It would be helpful to know the inputs you're passing in to the problematic Jobs.

mdeng10 commented 2 months ago

I am indeed running multiple jobs simultaneously - that must be the cause

is there a way to cancel all pending jobs and not just running jobs? i tried using nautobot-server nbshell on the main host and editing JobResults to be successful/finished but they're still slowly running one by one

It would be helpful to know the inputs you're passing in to the problematic Jobs.

I'll try to set up the jobs another time after i do some testing

housepbass commented 2 months ago

Can you cancel the pending Jobs from the UI and restart the worker service?

mdeng10 commented 2 months ago

i don't see an option to cancel pending jobs from the UI? i can delete the jobresult but i think the job will run all the same

mdeng10 commented 2 months ago

also we have the nautobot worker service in its own docker container hosted via AWS ECS - systemctl isn't installed nor initialised, will terminating the container and creating a new one have the same effect?

housepbass commented 2 months ago

After reviewing some docs, I think the most direct way to terminate jobs will be through celery. The celery shell should be accessible from your nautobot app container via a nautobot management command.

nautobot-server celery shell
# Remove all Pending Jobs - I have not tested this
app.control.purge()

# Stop Running Jobs - I just tested this locally
i = inspect()
jobs = i.active()
for hostname in jobs:
    tasks = jobs[hostname]
    for task in tasks:
        app.control.revoke(task['id'], terminate=True)
mdeng10 commented 2 months ago

thanks for the info - what import is needed for inspect() - i've run

app.control.purge()

so hopefully that clears up the queue

housepbass commented 2 months ago

Sorry about that - here you go. More docs about inspecting celery workers

nautobot-server celery shell
i = app.control.inspect()
jobs = i.active()
for hostname in jobs:
    tasks = jobs[hostname]
    for task in tasks:
        app.control.revoke(task['id'], terminate=True)
mdeng10 commented 2 months ago

looks like something worked - not sure which but the jobs seemed to be cancelled (i can create new ones now)