netdisco / netdisco

A web-based network management tool.
http://netdisco.org/
BSD 3-Clause "New" or "Revised" License
641 stars 74 forks source link

Duplicate devices #311

Closed ollyg closed 4 years ago

ollyg commented 7 years ago

Related to #265 but not the same - devices can be discovered twice. Check with:

select name, serial, count(*) from device group by name, serial having count(*) > 1;

Probably the case that there are device_ip A<->B B<->A row pairs which would give it away as well.

ollyg commented 7 years ago

Have a report of duplicates with Delete button so the Admin can choose which to keep.

ollyg commented 7 years ago

Added report in 55edd657.

taylor5042 commented 6 years ago

Is there any way to prevent the discovery from adding additional devices that have the same serial number as another device already in the table? I don't understand all the logic behind the aliases, but I have a number of devices with this issue. This instance of netdisco has 1400 devices with around 500 of those having more than one management IP address. Of those 500 I have about 30 devices that seem to ignore the aliasing prevention of duplicates.

ollyg commented 6 years ago

Hi @taylor5042. I believe that the code referred to earlier in this ticket, which is now in the released version of Netdisco, will use the device serial number to prevent adding duplicate devices.

However since then we noticed that because the poller is really efficient and fast, duplicates appear due to parallel transactions taking place on the database. So we have some more code to prevent this using the device lldpRemChassisId for deduplication. On large networks this seems to eliminate duplicates from reappearing once removed. This code is in the pre-release version of Netdisco on CPAN.

Would you like to install it?

taylor5042 commented 6 years ago

Iā€™m willing to try it out.

On Dec 5, 2017, 9:06 AM -0700, Oliver Gorwits notifications@github.com, wrote:

Reopened #311. ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ollyg commented 6 years ago

Hi @taylor5042, okay you can install the DEV version by running (as one line):

~netdisco/bin/localenv cpanm https://cpan.metacpan.org/authors/id/O/OL/OLIVER/App-Netdisco-2.036012_003.tar.gz

You should also run ~netdisco/bin/netdisco-deploy after, to update the DB schema, and then restart the backend daemon.

Either remove all devices or remove duplicate devices through the Admin menu report "Duplicate Devices". Then future discover jobs will hopefully not throw up duplicates!

Do let me know how you get on.

taylor5042 commented 6 years ago

I got really busy and sick in December, I have been running the dev patch for a few weeks now with only a few dups that have popped up here and there. Thanks! When will this be included in the stable release?

On Jan 6, 2018, 4:45 PM -0700, Oliver Gorwits notifications@github.com, wrote:

Closed #311. ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

netdisco-automation commented 6 years ago

Hi @taylor5042 sorry to hear you were ill - I was also busy and ill in December :-(.

Netdisco is now updated with this code (and some other improvements) and you can follow the standard upgrade instructions online to get the current release (https://metacpan.org/pod/App::Netdisco).

Many thanks Oliver.

p.s. oops I should have commented as @ollyg - never mind.

taylor5042 commented 4 years ago

I am now running version 2.044004 and I'm still seeing hundreds of duplicates.

ollyg commented 4 years ago

Hi @taylor5042 sorry to hear that. Obviously there's been a lot of development since this ticket was originally created, so perhaps it would be good to start over with the investigation.

Firstly, I do recommend looking into the device_identity configuration setting which is intended to help guide Netdisco to use the same canonical IP for devices if it discovers them multiple times.

If this does not work, even after a fresh installation (on an existing installation Netdisco will probably not want to just delete duplicates itself), let's investigate....

Can you let us know in what way they are duplicates? For example have you wiped the database and started over, and they appear, or have they recently been added by Netdisco to an existing installation? Can you check whether the Serial Number field of the duplicates is shown as the same, in the user interface? Can you also check whether they share "canonical" IPs and DNS names or if they share addresses under the Addresses tab of the Device page.

taylor5042 commented 4 years ago

I recently increased the CPU's and memory on my VM. I upgraded to 2.044004 about a week ago and then I noticed netdisco was only discovering about 100 out of 3500 devices each day. I looked in the logs and found some trouble similar to these messages. DBIx::Class::Schema::Versioned::_on_connect(): Your DB is currently unversioned. Please call upgrade on your schema to sync the DB. at /home/netdisco/perl5/lib/perl5/DBICx/Sugar.pm line 121

DBI Connection failed: DBI connect('dbname=netdisco','netdisco',...) failed: FATAL: Peer authentication failed for user "netdisco" at /home/netdisco/perl5/lib/perl5/DBIx/Class/Storage/DBI.pm line 1517.

This led me to wiping out the database and redeploying. But the errors persisted, so I ended up changing the max_connections setting in postgresql.conf to 200. This setting seemed to fix these errors and I began discovering all 3500 devices again.

Through this process I ended up with about 200 duplicates which I have cleaned up. Today it seems to have stabilized, but I still have 8 duplicates showing. I am using the device_identity setting with a few lines of loopback IP's in it and this seems to be working fairly well. Initially I tried using the port loopback0 syntax but I couldn't seem to get that working and I commented it out and went with the IP addresses instead.

device_identity:
" # 'any': 'port:(?i)loopback0' "
 - 172.16.160.0/22
...
...

Currently the 8 duplicates seem to be a combination of three separate issues. First a site in Kyiv where the latency has been really bad the last few days.

Second, sometimes a devices gets partially discovered and the serial number or sysname ends up showing as a null value in the database. I can see how this issue will then easily lead to a duplicate.

And third some duplicates that just appear to be a mystery. The the last_discovery on the device_identity IP was from two days ago but the duplicate IP that isn't listed in device_identity is showing a last discovery of today.

Some of our network follows some very predictable addressing standards and I could build a very long list of IP's and put them in the discover_no section. But other parts of the network have dozens of layer three interfaces per device with no clear standards.

To answer your questions, the serial numbers are the same in the user interface(although I'm not exactly sure what you are looking for). The IP address that I have in the device_identity has an internal DNS record, but the duplicate address does not have the same DNS record, so the Device name and system names are different due to one of them having a domain name and the other does not. The data in the addresses tab is identical.

taylor5042 commented 4 years ago

Oliver, I'm happy to report that the duplicate issue seems to be under control. It appears that the tweaks I have made to the sever resources, postgres configuration and TCP/UDP tuning have nearly eliminated the duplicate issues. I still see a few duplicates popping up on some Nexus gear, but it seems to be very limited and I seem to be able to prevent these duplicates by adding a single IP for each device into the discover_no section. Also, the null sysnames/serials/models issue that I have experienced in the past seems to be resolved now as well. Let me watch it for a few more days and I'll post here again.

taylor5042 commented 4 years ago

So the duplicate issues seem to be under control. I'm still seeing 1 or 2 duplicates per day, but prior to these recent changes I was seeing closer to 15 per day adding up to about 150 or so duplicates over a 30 day period. The duplicates I'm seeing appear to be caused by the null values issue which is also better but I'm still seeing 5 -10 of these each time a scheduled discovery runs. I'm fairly certain the duplicates are mostly the results of a null value being set for the serial number which then results in allowing a duplicate to be created. I'd love to solve this completely, but at least this level is manageable. I'm also noticing the Discovery Queue getting stuck in the web interface like mentioned in other issues, but I'm not sure how to resolve that issue. Hopefully I can figure that out soon.

ollyg commented 4 years ago

Hi @taylor5042, many thanks for following up!

It appears that the tweaks I have made to the server resources, postgres configuration and TCP/UDP tuning have nearly eliminated the duplicate issues.

Excellent news. I would like to add some of this to our tips/troubleshooting docs here in the github wiki, for other users. Would you be able to add a comment here with the settings, or send them to oliver@cpan.org ?

The duplicates I'm seeing appear to be caused by the null values issue which is also better but I'm still seeing 5 -10 of these each time a scheduled discovery runs. I'm fairly certain the duplicates are mostly the results of a null value being set for the serial number which then results in allowing a duplicate to be created.

OK this makes sense. We have another ticket #227 which is actually a similar issue ... some of the SNMP methods time out and Netdisco makes some changes as a result, which messes things up. The problem which stalled that other ticket is that we don't really know if these SNMP methods should ever work for a given device ... many platforms have alternative or broken implementations so we try to avoid Nedisco just aborting.

However... in your case I wonder whether it would make sense to have a config setting which basically aborts the discover if some really basic data cannot be retrieved, for example the serial number. Would that work for you?

I'm also noticing the Discovery Queue getting stuck in the web interface like mentioned in other issues, but I'm not sure how to resolve that issue. Hopefully I can figure that out soon.

Please can you elaborate on this issue?

taylor5042 commented 4 years ago

Oliver, I sent you some detailed replies via email but I have not seen a reply. Let me know if you didn't get those messages.

ollyg commented 4 years ago

I did get the messages! Apologies, we have a new baby in the house so my attention is a bit intermittent šŸ™ƒ.

On Thu, 9 Jan 2020 at 23:33, taylor notifications@github.com wrote:

Oliver, I sent you some detailed replies via email but I have not seen a reply. Let me know if you didn't get those messages.

ā€” You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/netdisco/netdisco/issues/311?email_source=notifications&email_token=AAAHHVJZUOC35NXKP37XGPLQ46X3JA5CNFSM4DNJXLL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISFDKI#issuecomment-572805545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAHHVO74GKPMWAJI62HETTQ46X3JANCNFSM4DNJXLLQ .

taylor5042 commented 4 years ago

However... in your case I wonder whether it would make sense to have a config setting which basically aborts the discover if some really basic data cannot be retrieved, for example the serial number. Would that work for you?

Hey Oliver, I have tried my very best to eliminate all UDP Buffer issues and yet I still keep seeing Null Serial Numbers, SysNames, and other fields as well. The serial number and sysname are the most annoying to deal with. When a serial number is overwritten with a null value that seems to lead to duplicate records which I then have to delete manually. The sysname field gets very confusing when it is missing as I am trying to utilize the netdisco data via integration with other systems and reports. Would you still be able to add an abort feature for these two fields? Thanks! Matt

ollyg commented 4 years ago

I think it's a fair configuration to not overwrite an existing essential datum with a null value. This can reasonably be seen as an error. I'll start working on this feature, thanks for the reminder.

ollyg commented 4 years ago

Moving this to #227 instead