Closed jouvin closed 8 years ago
I confirm that I see it on machines not upgraded to 16.8.0-rc2/3 (16.6.0). This may be the profile difference in a part not related to any component (/hardware). May be worth looking at it...
@jouvin can you send me the /var/lib/ccm
dir and any logs? can you also try to delete the 1700 cid and retry to see if it is reproducable:
new=1700
old=1699
cd /var/lib/ccm
echo $old > latest.cid
echo $old > current.cid
rm -Rf profile.$new
cat profile.$old/profile.xml > data/*
change debug = 5
in /etc/ncm-ncd.conf
and then run
ccm-fetch --debug 5
(it's best not to trigger ncm-ncd
by hand)
@stdweird I sent you privately the CCM dir contents. I restarted ncm-cdispd
and I think it fixed the problem. I'll try to reproduce it the way you suggested and keep you informed.
I should said that the modification that tirggered the problem (update of the CPU/NIC description using 16.8 templates) included a schema change (new properties added under /hardware
), not only changed values. I tend to think that it is linked to this...
diffing the 1699 and 1700 cid profile.xml anly shows changes to rpm versions and yum repos. nothing else. if you cannot repoduce this, it would be ideal if you still had a node that didn't receive the update rc2->rc3, and do that update with debug 5 in ncm-ncd.conf
I've looked at the VMs I upgraded to rc2 but not rc3 and I see it on them.
2016-09-13T10:31:25.555066+01:00 vm223 ncm-cdispd[29252]: Undefined ICLIST encountered. Report this to the developers (including the templates under /var/lib/ccm and any logs): old cid 12 new cid 14
@jrha can you diff -u /var/lib/ccm/profile.12 /var/lib/ccm/profile.14
and report anything except the rpm version changes?
also if you have an rc2 machine lying around, set debug 5 in ncm-ncd.conf and move to rc3
Nothing more than the following (no RPM version changes at all).
61,66d60
< "features/core/sudo/config",
< "features/core/pam/config",
< "features/core/network-cards/config",
< "features/nagios/client/config",
< "features/core/syslog/config",
< "features/local/magdb-discover/config",
73a68,74
> "features/core/syslog/config",
> "features/local/magdb-discover/config",
> "features/core/network-cards/config",
> "features/nagios/client/config",
> "features/core/pam/config",
> "features/core/sudo/config",
> "features/core/ntp/config",
536,545d536
< "command": "/usr/sbin/magdb-discover",
< "name": "magdb-discover",
< "timing": {
< "hour": "13",
< "minute": "30",
< "smear": 30
< },
< "user": "root"
< },
< {
555a547,556
> },
> {
> "command": "/usr/sbin/magdb-discover",
> "name": "magdb-discover",
> "timing": {
> "hour": "13",
> "minute": "30",
> "smear": 30
> },
> "user": "root"
1550c1551
< "url": "http://yum.quattor.org/testing/16.8.0-rc2/"
---
> "url": "http://yum.quattor.org/16.8.0-rc2/"
1972a1974,1982
> "ntp": {
> "instance": "ral",
> "server_ips": [],
> "servers": [
> "ntp1.rl.ac.uk",
> "ntp2.rl.ac.uk",
> "ntp3.rl.ac.uk"
> ]
> },
I want to insist that the same error shows up with older versions of Quattor (tested with 16.2 and 16.6) so it looks like an old race condition triggered by some template changes (I expresssed my guess previously). Also note, that we should examine the first occurence of the problem as any profile change after it has been triggered will produce this error even though the next changes have nothing not related to configuration not bounded to components.
The above is the only occurrence on my test VMs.
@jouvin sure, but if you can trigger this with a rc2->rc3 update, it narrows the possible changes down quite a bit, which will help with debugging this.
Not sure it really helps as I'm seeing it on machines still running 16.2 or 16.6 after the profile changes that triggered the problem on all machines... So it is not related to a change in ncm-cdispd
itself or it is related to a change in versions <= 16.2... This is my point!
As for the change that triggered the problem at GRIF, here it is:
diff -u profile.1687/profile.xml profile.1689/profile.xml
--- profile.1687/profile.xml 2016-09-13 01:57:57.000000000 +0200
+++ profile.1689/profile.xml 2016-09-13 09:29:32.000000000 +0200
@@ -10,8 +10,7 @@
"maxspeed": 1000,
"media": "Ethernet",
"name": "Broadcom Tigon3 ethernet driver",
- "pxe": true,
- "type": "deprecated"
+ "pxe": true
},
"eth1": {
"boot": false,
@@ -21,8 +20,7 @@
"maxspeed": 1000,
"media": "Ethernet",
"name": "Broadcom Tigon3 ethernet driver",
- "pxe": true,
- "type": "deprecated"
+ "pxe": true
}
}
},
which means it is not the change where the new HW templates where deployed but the next change where I used for NIC the new by-vendor templates rather than the legacy ones... I'll see if I missed an error occurence but I don't think...
@jouvin i'm confused now. the message above said cid 1699 -> 1700. anyway, a debug 5 output of ncm-ncd will be very helpful.
Ok, it's not a 16.8 bug, let's figure it out for 16.10.
I tend to agree with @jrha as I think (but I had not time to check in details) that the error is harmless...
@stdweird : it's true that I opened the issue with the last occurence of the error yesterday, a mistake of mine... But the first occurence was:
2016/09/13-09:23:19 [INFO] new profile arrived, examining...
updated /var/lib/ccm/profile.1688/ccm-active-profile.1688-4114529
2016/09/13-09:23:19 [INFO] Undefined ICLIST encountered. Report this to the developers (including the templates under /var/lib/ccm and any logs): old cid 1687 new cid 1688
2016/09/13-09:23:19 [INFO] new (and changed) profile detected
2016/09/13-09:23:22 [INFO] no components to be run by NCM - ncm-ncd won't be called
I checked again and I confirm that the problem was created by the change I reported in a previous comment, not by the initial schema update (that I backported to 16.2 and 16.6, BTW...).
@jouvin ah, with the last line, i understand your initial remark. debug 5
output of ncm-cdispd is welcome.
Unfortuantely, I tried your suggestion to reproduce the problem but I have not succeeded so far: ncm-cdispd
now fails:
2016/09/14-11:27:47 [INFO] new profile arrived, examining...
updated /var/lib/ccm/profile.1687/ccm-active-profile.1687-2955180
2016/09/14-11:27:47 [INFO] new (and changed) profile detected
2016/09/14-11:27:47 [ERROR] fatal exception:
2016/09/14-11:27:47 [ERROR] EDG::WP4::CCM::Element->elementExists (EDG::WP4::CCM::Configuration=HASH(0x2f53c30), /software/components)
2016/09/14-11:27:47 [ERROR] exiting ncm-cdispd...
Probably a consequence of removing some of the profiles... I had no time yet to understand what the pb could be...
FWIW the upgrade from rc2 to rc3 didn't trigger another ICLIST message.
i think i've located the issue. the initialisation and cleanup with $this_app->{ICLIST} = ();
results in an undef. those should be = []
instead.
$ perl -e 'use warnings;use strict;use Data::Dumper; my $a->{x} = ();print defined($a->{x}),"\ndone\n"'
done
Thanks for taking the time to look at it! I have been running from one meeting to another one... I'd say that if 16.8 is not released yet (or the release has not been started, it'd been worth to include a fix for the issue in it...
I saw this after I started the final build, but I don't think it was worth delaying any further for.
When upgrading from 16.8.0-rc2 to 16.8.0-rc3, I get the following message:
It seems to have started after the upgrade to 16.8.0-rc2 and before the upgrade to rc3. The problem may have been triggered by the HW template updates (CPU/NIC) and not be related to
ncm-cdispd
16.8 as it also seems to affect machines running older versions (16.2 for example).This needs to be investigated before 16.8.0 release.