quattor / ncm-cdispd

Node Configuration Manager Configuration Dispatch Daemon
www.quattor.org
Other
2 stars 5 forks source link

Undefined ICLIST encountered #32

Closed jouvin closed 7 years ago

jouvin commented 7 years ago

When upgrading from 16.8.0-rc2 to 16.8.0-rc3, I get the following message:

2016/09/13-18:59:22 [INFO] new profile arrived, examining...
updated /var/lib/ccm/profile.1700/ccm-active-profile.1700-4114529
2016/09/13-18:59:22 [INFO] Undefined ICLIST encountered. Report this to the developers (including the templates under /var/lib/ccm and any logs): old cid 1699 new cid 1700
2016/09/13-18:59:22 [INFO] new (and changed) profile detected

It seems to have started after the upgrade to 16.8.0-rc2 and before the upgrade to rc3. The problem may have been triggered by the HW template updates (CPU/NIC) and not be related to ncm-cdispd 16.8 as it also seems to affect machines running older versions (16.2 for example).

This needs to be investigated before 16.8.0 release.

jouvin commented 7 years ago

I confirm that I see it on machines not upgraded to 16.8.0-rc2/3 (16.6.0). This may be the profile difference in a part not related to any component (/hardware). May be worth looking at it...

stdweird commented 7 years ago

@jouvin can you send me the /var/lib/ccm dir and any logs? can you also try to delete the 1700 cid and retry to see if it is reproducable:

new=1700
old=1699
cd /var/lib/ccm
echo $old > latest.cid
echo $old > current.cid
rm -Rf profile.$new
cat profile.$old/profile.xml > data/*

change debug = 5 in /etc/ncm-ncd.conf

and then run

ccm-fetch --debug 5

(it's best not to trigger ncm-ncd by hand)

jouvin commented 7 years ago

@stdweird I sent you privately the CCM dir contents. I restarted ncm-cdispd and I think it fixed the problem. I'll try to reproduce it the way you suggested and keep you informed.

jouvin commented 7 years ago

I should said that the modification that tirggered the problem (update of the CPU/NIC description using 16.8 templates) included a schema change (new properties added under /hardware), not only changed values. I tend to think that it is linked to this...

stdweird commented 7 years ago

diffing the 1699 and 1700 cid profile.xml anly shows changes to rpm versions and yum repos. nothing else. if you cannot repoduce this, it would be ideal if you still had a node that didn't receive the update rc2->rc3, and do that update with debug 5 in ncm-ncd.conf

jrha commented 7 years ago

I've looked at the VMs I upgraded to rc2 but not rc3 and I see it on them.

2016-09-13T10:31:25.555066+01:00 vm223 ncm-cdispd[29252]: Undefined ICLIST encountered. Report this to the developers (including the templates under /var/lib/ccm and any logs): old cid 12 new cid 14
stdweird commented 7 years ago

@jrha can you diff -u /var/lib/ccm/profile.12 /var/lib/ccm/profile.14 and report anything except the rpm version changes? also if you have an rc2 machine lying around, set debug 5 in ncm-ncd.conf and move to rc3

jrha commented 7 years ago

Nothing more than the following (no RPM version changes at all).

61,66d60
<       "features/core/sudo/config",
<       "features/core/pam/config",
<       "features/core/network-cards/config",
<       "features/nagios/client/config",
<       "features/core/syslog/config",
<       "features/local/magdb-discover/config",
73a68,74
>       "features/core/syslog/config",
>       "features/local/magdb-discover/config",
>       "features/core/network-cards/config",
>       "features/nagios/client/config",
>       "features/core/pam/config",
>       "features/core/sudo/config",
>       "features/core/ntp/config",
536,545d536
<             "command": "/usr/sbin/magdb-discover",
<             "name": "magdb-discover",
<             "timing": {
<               "hour": "13",
<               "minute": "30",
<               "smear": 30
<             },
<             "user": "root"
<           },
<           {
555a547,556
>           },
>           {
>             "command": "/usr/sbin/magdb-discover",
>             "name": "magdb-discover",
>             "timing": {
>               "hour": "13",
>               "minute": "30",
>               "smear": 30
>             },
>             "user": "root"
1550c1551
<             "url": "http://yum.quattor.org/testing/16.8.0-rc2/"
---
>             "url": "http://yum.quattor.org/16.8.0-rc2/"
1972a1974,1982
>       "ntp": {
>         "instance": "ral",
>         "server_ips": [],
>         "servers": [
>           "ntp1.rl.ac.uk",
>           "ntp2.rl.ac.uk",
>           "ntp3.rl.ac.uk"
>         ]
>       },
jouvin commented 7 years ago

I want to insist that the same error shows up with older versions of Quattor (tested with 16.2 and 16.6) so it looks like an old race condition triggered by some template changes (I expresssed my guess previously). Also note, that we should examine the first occurence of the problem as any profile change after it has been triggered will produce this error even though the next changes have nothing not related to configuration not bounded to components.

jrha commented 7 years ago

The above is the only occurrence on my test VMs.

stdweird commented 7 years ago

@jouvin sure, but if you can trigger this with a rc2->rc3 update, it narrows the possible changes down quite a bit, which will help with debugging this.

jouvin commented 7 years ago

Not sure it really helps as I'm seeing it on machines still running 16.2 or 16.6 after the profile changes that triggered the problem on all machines... So it is not related to a change in ncm-cdispd itself or it is related to a change in versions <= 16.2... This is my point!

As for the change that triggered the problem at GRIF, here it is:

diff -u profile.1687/profile.xml profile.1689/profile.xml
--- profile.1687/profile.xml    2016-09-13 01:57:57.000000000 +0200
+++ profile.1689/profile.xml    2016-09-13 09:29:32.000000000 +0200
@@ -10,8 +10,7 @@
           "maxspeed": 1000,
           "media": "Ethernet",
           "name": "Broadcom Tigon3 ethernet driver",
-          "pxe": true,
-          "type": "deprecated"
+          "pxe": true
         },
         "eth1": {
           "boot": false,
@@ -21,8 +20,7 @@
           "maxspeed": 1000,
           "media": "Ethernet",
           "name": "Broadcom Tigon3 ethernet driver",
-          "pxe": true,
-          "type": "deprecated"
+          "pxe": true
         }
       }
     },

which means it is not the change where the new HW templates where deployed but the next change where I used for NIC the new by-vendor templates rather than the legacy ones... I'll see if I missed an error occurence but I don't think...

stdweird commented 7 years ago

@jouvin i'm confused now. the message above said cid 1699 -> 1700. anyway, a debug 5 output of ncm-ncd will be very helpful.

jrha commented 7 years ago

Ok, it's not a 16.8 bug, let's figure it out for 16.10.

jouvin commented 7 years ago

I tend to agree with @jrha as I think (but I had not time to check in details) that the error is harmless...

@stdweird : it's true that I opened the issue with the last occurence of the error yesterday, a mistake of mine... But the first occurence was:

2016/09/13-09:23:19 [INFO] new profile arrived, examining...
updated /var/lib/ccm/profile.1688/ccm-active-profile.1688-4114529
2016/09/13-09:23:19 [INFO] Undefined ICLIST encountered. Report this to the developers (including the templates under /var/lib/ccm and any logs): old cid 1687 new cid 1688
2016/09/13-09:23:19 [INFO] new (and changed) profile detected
2016/09/13-09:23:22 [INFO] no components to be run by NCM - ncm-ncd won't be called

I checked again and I confirm that the problem was created by the change I reported in a previous comment, not by the initial schema update (that I backported to 16.2 and 16.6, BTW...).

stdweird commented 7 years ago

@jouvin ah, with the last line, i understand your initial remark. debug 5 output of ncm-cdispd is welcome.

jouvin commented 7 years ago

Unfortuantely, I tried your suggestion to reproduce the problem but I have not succeeded so far: ncm-cdispd now fails:

2016/09/14-11:27:47 [INFO] new profile arrived, examining...
updated /var/lib/ccm/profile.1687/ccm-active-profile.1687-2955180
2016/09/14-11:27:47 [INFO] new (and changed) profile detected
2016/09/14-11:27:47 [ERROR] fatal exception:
2016/09/14-11:27:47 [ERROR] EDG::WP4::CCM::Element->elementExists (EDG::WP4::CCM::Configuration=HASH(0x2f53c30), /software/components)
2016/09/14-11:27:47 [ERROR] exiting ncm-cdispd...

Probably a consequence of removing some of the profiles... I had no time yet to understand what the pb could be...

jrha commented 7 years ago

FWIW the upgrade from rc2 to rc3 didn't trigger another ICLIST message.

stdweird commented 7 years ago

i think i've located the issue. the initialisation and cleanup with $this_app->{ICLIST} = (); results in an undef. those should be = [] instead.

$ perl -e 'use warnings;use strict;use Data::Dumper; my $a->{x} = ();print defined($a->{x}),"\ndone\n"'

done
jouvin commented 7 years ago

Thanks for taking the time to look at it! I have been running from one meeting to another one... I'd say that if 16.8 is not released yet (or the release has not been started, it'd been worth to include a fix for the issue in it...

jrha commented 7 years ago

I saw this after I started the final build, but I don't think it was worth delaying any further for.