quattor / template-library-grid

Template library for configuring EMI grid middleware services
5 stars 14 forks source link

Problem with publication information by BDII-Local #135

Open jas01 opened 9 years ago

jas01 commented 9 years ago

I got a alert on

midmon.egi.eu

on my bdii about information the bdii publish

CRITICAL - errors 78, warnings 26, info 52
Summary per type of error, warning and info message:
E022 - Default value published (GLUE2ComputingShareWaitingJobs): 26
E023 - Default value published (GLUE2ComputingShareEstimatedAverageWaitingTime): 26
E024 - Default value published (GLUE2ComputingShareEstimatedWorstWaitingTime): 26
I046 - Number of seconds higher than 1 million (GLUE2ComputingShareEstimatedAverageWaitingTime): 26
I047 - Number of seconds higher than 1 million (GLUE2ComputingShareEstimatedWorstWaitingTime): 26
W025 - Incoherent number of total jobs (GLUE2ComputingShareTotalJobs): 26
Pansanel commented 9 years ago

Which is the version of glite-info-provider-service?

jas01 commented 9 years ago

Actually on my bdii-local I've

glite-info-provider-service-1.13.4-1.el6.noarch

Regards

Pansanel commented 9 years ago

Can you try to downgrade to 1.13.3-1 to see if it works better?

Pansanel commented 9 years ago

Dismiss the previous comment. I have found where the code is: /usr/libexec/info-dynamic-pbs Which version of lcg-info-dynamic-scheduler-pbs are you using?

jas01 commented 9 years ago

The downgrade doen't change anything. But you already known that ;-)

I've split my torque server and the cream-ce.

On both

lcg-info-dynamic-scheduler-pbs-2.4.5-1.el6.noarch

regards.

jrha commented 9 years ago

Is this a problem with the Quattor template library or a middleware problem?

Pansanel commented 9 years ago

It may be related to Quattor if it is a gip parser configuration problem. It is not sure. We have the same rpm (lcg-info-dynamic-scheduler-pbs-2.4.5-1.el6.noarch). Can you check the content of /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce and try to execute it manually?

jas01 commented 9 years ago

so in the /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce I've got a very simple

[root@torque-grid ~]# more /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce

!/bin/sh

cat /swareas/dteam/gip-ce-dynamic-info-maui.cache [root@torque-grid ~]#

and inside this /swareas/dteam/gip-ce-dynamic-info-maui.cache I've got lot of information, If I stay on one vo

dn: GlueCEUniqueID=cream-ce-grid.obspm.fr:8443/cream-pbs-glast.org,Mds-Vo-name=resource,o=grid GlueCEInfoLRMSVersion: 2.5.13 GlueCEInfoTotalCPUs: 112 GlueCEPolicyAssignedJobSlots: 112 GlueCEStateFreeCPUs: 0 GlueCEStateFreeJobSlots: 0 GlueCEPolicyMaxRunningJobs: 110 GlueCEPolicyMaxWallClockTime: 4320 GlueCEPolicyMaxCPUTime: 1440 GlueCEStateStatus: Production

dn: GLUE2ShareID=GLAST-ORG_GLAST-ORG_torque-grid.obspm.fr_ComputingElement,GLUE2ServiceID=torque-grid.obspm.fr_ComputingElement,GLUE2GroupID=resource,o=glue GLUE2ComputingShareFreeSlots: 0 GLUE2ComputingShareMaxRunningJobs: 110 GLUE2ComputingShareMaxWaitingJobs: 220 GLUE2ComputingShareMaxTotalJobs: 330 GLUE2ComputingShareServingState: Production GLUE2EntityCreationTime: 2015-03-18T14:55:02Z GLUE2ComputingShareMaxWallTime: 4320 GLUE2ComputingShareDefaultWallTime: 4320 GLUE2ComputingShareMaxCPUTime: 1440 GLUE2ComputingShareDefaultCPUTime: 1440 GLUE2ComputingShareMaxSlotsPerJob: 1 GLUE2ComputingShareMaxMainMemory: 2000 GLUE2ComputingShareMaxVirtualMemory: 20000

but if I try to find some from the nagios alerte it's doent' exist like

[root@torque-grid ~]# grep GLUE2ComputingShareEstimatedAverageWaitingTime /swareas/dteam/gip-ce-dynamic-info-maui.cache [root@torque-grid ~]#grep GLUE2ComputingShareEstimatedWorstWaitingTime /swareas/dteam/gip-ce-dynamic-info-maui.cache [root@torque-grid ~]#

Pansanel commented 9 years ago

Which version of template-library-grid are you using? Many work has been done on glue2 static publishing in November. Are you sure you are up to date?

jouvin commented 9 years ago

We started with an email exchange with @jas01 and he confirmed me that he was using 14.10.0. As you said, it is not something we see in other sites so probably due to some local conditions but I guess this is coming from some wrong Quattor configuration at site.

We are running the same version of glite-info-provider-service as @jas01 so I'll try to run the GLUE validator on our config again to check that we don't have the same problem...

jouvin commented 9 years ago

I have checked and cannot reproduce the problem on our CE... which doesn't mean that there is no problem! I'm trying to think about what could help to troubleshoot the problem...

jouvin commented 9 years ago

From what I can see running glue-validator (*), I have the feeling that something is not working with the GIP cache. This cache is made of 2 files: /swareas/dteam/gip-ce-dynamic-info-maui.cache and /swareas/dteam/gip-ce-dynamic-info-scheduler.cache. The problematic information is coming from the latter one (gip-ce-dynamic-info-scheduler.cache). Could you check that this file is created and that you have not error in /var/spool/maui/gip-info-dynamic-scheduler-plugin.sh when running the dynamic-info-scheduler plugin.

(*) glue-validator -H topbdii.grif.fr -p 2170 -b GLUE2DomainID=OBSPM,GLUE2GroupID=grid,o=glue -k -v 3 egi-profile

jas01 commented 9 years ago

Sorry I didn't see your email because it's tag as spam.

From what I can see running glue-validator (*), I have the feeling that something is not working with the GIP cache. This cache is made of 2 files: / swareas/dteam/gip-ce-dynamic-info-maui.cache and /swareas/dteam/ gip-ce-dynamic-info-scheduler.cache. The problematic information is coming from the latter one (gip-ce-dynamic-info-scheduler.cache). Could you check that this file is created and that you have not error in /var/spool/maui/ gip-info-dynamic-scheduler-plugin.sh when running the dynamic-info-scheduler plugin.

Your right that is the problem. I don't have the symbolic link between

/var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif

and

/var/glite/static-file-all-CE-pbs.ldif

I create that symlink manually and every thing seem correct now.

I don't understand how that link disapeare. I try to launch ncm-ncd and he didn't create this link. Should I modify my own quattor template so it's create this link ? (for next time I re-install the server)

Regards

pigay commented 9 years ago

Dear all,

We came to the same error in midmon (https://midmon.egi.eu/nagios/cgi-bin/extinfo.cgi?host=sitebdii.m3pec.u-bordeaux1.fr&type=2&service=org.bdii.GLUE2-Validate)

We run the same version of lcg-info-dynamic-scheduler-pbs as @jas01 also on 14.10 templates, but we can't find an error running /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce or /var/spool/maui/ gip-info-dynamic-scheduler-plugin.sh.

Can you help us?

pierre

pigay commented 9 years ago

Hi,

we investigated a bit further on our problem.

We didn't find any occurence of (for example) GLUE2ComputingShareEstimatedAverageWaitingTimein any of the /usr/libexec/lcg-info-dynamic-* scripts we have on the CE.

we have the following rpms:

We wonder which piece of software /should/ update these entries with the dynamic value...

Thanks in advance,

Pierre

Pansanel commented 9 years ago

Hi Pierre,

Both RPMs are up to date. What is the content of the following file: /etc/glite-ce-glue2/glite-ce-glue2.conf

Jerome

pigay commented 9 years ago

Dear Jérôme,

This file doesn't exist. We only have a /etc/glite-ce-glue2/glite-ce-glue2.conf.template from glite-ce-cream-utils-1.3.5-1.el6.x86_64.rpm

How the .conf file is supposed to be created?

Thanks,

Pierre

jouvin commented 9 years ago

Have you checked that the problem is not the same as for @jas01, the missing symlink? This symlink is needed because of a bug introduce in lcg-info-dynamic-scheduler and recently fixed (but may be not yet released).

glite-ce-glue2.conf should exist but in /etc/bdii/gip.

To troubleshoot this kind of problem, you may want to look at /var/log/maui-monitoring.ncm-cron.log. BTW, are you running the CE and the batch system on the same machine? Do you have GIP_CE_USE_CACHE true or false in your configuration? If true, you should have to plugins in /var/lib/bdii/gip/plugin/ on your CE that are just a cat command of some file): check the existence/content of these files.

pigay commented 9 years ago

Dear Michel, thanks for helping. Sorry for late reply, I was in holidays.

The symlink looks correct:

[root@ce0 ~]# ls -al /var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif /var/glite/static-file-all-CE-pbs.ldif
lrwxrwxrwx 1 root root    50 17 avril 16:49 /var/glite/static-file-all-CE-pbs.ldif -> /var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif
-rw-r--r-- 1 ldap ldap 69924 24 déc.  14:46 /var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif

/etc/bdii/gip/glite-ce-glue2.conf is present and looks ok, as far as I can see. The file /etc/glite-ce-glue2/glite-ce-glue2.conf mentionned by @Pansanel is missing however.

In /var/log/maui-monitoring.ncm-cron.log, I see that the cron runs every 5 minutes without errors.

We run 2 CEs, each has it's own batch system (and a separate set of WNs). GIP_CE_USE_CACHE is set to true in our templates.

Concerning the plugins, we have:

[root@ce0 ~]# ll /var/lib/bdii/gip/plugin/
total 12
-rwxr-xr-x 1 ldap ldap 49 14 mai    2014 lcg-info-dynamic-ce
-rwxr-xr-x 1 ldap ldap 54 14 mai    2014 lcg-info-dynamic-scheduler-wrapper
-rwxr-xr-x 1 ldap ldap 97 14 mai    2014 lcg-info-dynamic-software-wrapper

Each file contains a cat command (and runs without errors)...

jrha commented 9 years ago

I don't mind this kind of discussion here, but is this problem specific to the grid template library? If not, is there a better forum for this where more people could benefit from the discussion?

pigay commented 9 years ago

Dear @jrha, you're probably right. But I think I got it.

@jouvin : by the way, I noticed in quattor-grid mailing list a thread I missed, related to #113. We are not synchronized with this PR (we are in umd3-14.10 I think), so it seems normal that we miss glue2 information...

I'll try to get the correct version and post in quattor-grid if I have trouble.

sorry for the noise.

pigay commented 9 years ago

Finally, it turns out to be a template-grid-library issue.

As @jas01 stated before I pollute this thread, this is a missing symlink problem (I had the same issue with /var/glite/ComputingShare.ldif``).

I couldn't find how these links are created in the templates.

jrha commented 9 years ago

:relaxed:

jouvin commented 9 years ago

I had a look to the problem and proposed fix and I am not completly sure this is the right solution to the problem. I now remember that the problem comes from lcg-info-dynamic-scheduler that in EMI3 releases was ignoring ldif_file and ldif_file_glue2 in its config file (/etc/bdii/gip/lcg-info-dynamic-scheduler-pbs.conf): see https://ggus.eu/index.php?mode=ticket_info&ticket_id=110336 for the explanation. I would tend to refuse the PR and consider that the correct fix is the new version of lcg-info-dynamic-scheduler. I'll try to assess this before 15.4 release...

jouvin commented 9 years ago

See #137 for the discussion explaining why it was a workaround but not a real fix for the problem introduced by the need to work around a bug in lcg-info-dynamic-scheduler. #141 should be the real fix.

jouvin commented 9 years ago

I have been very bad following this issue, my apologizes... Anyway, just checking the status of everything after comment https://github.com/quattor/template-library-grid/pull/141#issuecomment-99881571, I realize that there may be some ambiguity about whether your are running a fixed version of the GIP plugin. The real culprit is not lcg-info-dynamic-sheduler-pbs but the underlying module dynsched-generic. The pb is fixed in 2.5.5-1. A for the short term, as we are very close to 15.4 release without enough time to test the things properly for something that is difficult to tests (some effects need time to be identified), I suggest that we merge #137 (which is an acceptable workaround) and work quietly on #141 afterwards... Any objection?

jouvin commented 9 years ago

In fact the required version of dynsched-generic has been released only in EMI repos (a while ago)... I try to understand the real status...

jouvin commented 9 years ago

For the record, the new version with the bug fix required for this PR will be released by UMD end of may. So let's delay it until 15.6.

jrha commented 9 years ago

137 is a workaround, the real fix should be incorporated into #141.

jouvin commented 8 years ago

After closing #141 (see comments inside), I propose to restart the work on this issue with the last version of GIP CE as provided in #162 and intended to be released in 16.2. The fixed UMD component should have been available for a long time now and I suggest to remove the workarounds and integrate properly the working version.

jrha commented 8 years ago

@jouvin - #162 is merged, what is left to do?

jrha commented 8 years ago

@jouvin ping?

jrha commented 8 years ago

@jouvin and @jas01 if this is fixed please close this issue, otherwise please move the milestone to 16.4.