scarlehoff / pyHepGrid

Tool for distributed computing management geared towards HEP applications.
GNU General Public License v3.0
6 stars 4 forks source link

Large copy time and non-zero exit status. GSIFTP dependence #54

Closed JBlack93 closed 4 years ago

JBlack93 commented 4 years ago

We have a dependence on gsiftp which results in large copy times and non-zero exit status if GSIFTP is down for any reason.

file_present = test_file_presence(outfile, args, protocol="gsiftp")
if retval == 0 and file_present:
    return retval
elif retval == 0 and not file_present:
    print_flush("Copy command succeeded, but failed to copy file. Retrying.")

After gfal-copy cycles through appropriate protocols we check for a successful copy through gfal-ls. Currently this is hard coded to use the gsiftp protocol. If the gsiftp server is down this causes a huge delay (currently we gfal-ls timeout is 1800s). This resulted in a "copy-time" of 1day 7hours for HEJ over this weekend.

Two-fold solution:

This issue exists through all "run" files.

scarlehoff commented 4 years ago

The right thing would be to check which protocol works and use that one instead of randomizing. Also, 1800s is a long-ish timeout for just checking for the existence of a file I would say...

JBlack93 commented 4 years ago

Agreed.

The protocol is only used for gfal-ls, so all the protocols currently used (xroot, gsiftp, srm) should work unless there is a problem with the corresponding service at the time.

We could simply use the protocol used for the copy, which would cycle it as need be, and avoids an unnecessary import random.

jcwhitehead commented 4 years ago

@JBlack93 - were you been running with the new copy log feature enabled - and if so, would you mind posting the log?

JBlack93 commented 4 years ago

Unfortunately not (I'll ensure to have this enabled in future runs).

However, it was clear in this case, through the stderr (and confirmations through sysadmins) that the GSIFTP service went down over the weekend, highlighting this particular issue.

jcwhitehead commented 4 years ago

Pity as that would be really useful.

Do you know if the gfal-ls command exited with an error while the gsiftp service was down?

JBlack93 commented 4 years ago

From stderr:

gfal-sum error: 110 (Connection timed out) - Operation timed out
gfal-sum error: 110 (Connection timed out) - Operation timed out
gfal-sum error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-copy error: 70 (Communication error on send) - DESTINATION SRM_PUT_TURL srm-ifce err: Communication error on send, err: [SE][PrepareToPut][] httpg://se01.dur.scotgrid.ac.uk/srm/managerv2: CGSI-gSOAP running on n127.dur.scotgrid.ac.uk reports Error reading token data: Connection closed

gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
Command timed out after 1800 seconds!
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
gfal-ls error: 110 (Connection timed out) - Operation timed out
jcwhitehead commented 4 years ago

Hi James, Marian and I have just had a look at this - is there further log output you chopped off the beginning? It'd be good to see the first errors that arose.

JBlack93 commented 4 years ago

I only chopped the python version number: stderr.log stdout.log

jcwhitehead commented 4 years ago

Cheers @JBlack93 , it was the gfal-sum errors at the top that threw me. @marianheil and I went through the both the standard error and the standard output and caught a few other possible bugs. Pull request #55 should fix them.