scarlehoff / pyHepGrid

Tool for distributed computing management geared towards HEP applications.
GNU General Public License v3.0
6 stars 4 forks source link

Code Duplication #46

Closed JBlack93 closed 4 years ago

JBlack93 commented 4 years ago

There is a HUGE amount of code duplicity between the different run scripts. This needs to be fixed if possible, such that there is less code which needs to be maintained.

For example, see #45 and #40 , a pause in the copy was implemented for both hejrun.py and sherparun.py, and only later ported to nnlojet.py

jcwhitehead commented 4 years ago

Don't think there's a straightforward way around this - ARC only allows us to upload a single driver script, which has to download and run everything else, hence the duplication of copy (etc) functionality between scripts. The only alternative I can immediately see would be concatenating a utilities.py script to the Xrun.py at pyHepGrid runtime, but I'm not sure it's worth it. @marianheil - any ideas?

marianheil commented 4 years ago

I also can't really think of a better solution then adding the files together in pyHepGrid. Its a bit ugly but probably not hard to do and reliable. The alternatives would be to either upload a file to somewhere (inconvenient and error prone) or submit multiple files with the job (is that possible? arc has the inputFiles option http://www.nordugrid.org/documents/xrsl.pdf, but dirac and slurm?)

jcwhitehead commented 4 years ago

We can make a general plan to hive off the utilities functions and concatenate at runtime, but I suggest we give ourselves a generous timescale (if it ain't broke etc) and implement some tests first - one of which would be that the Xrun.py created and sent actually is the file we want it to be.

In general I'd say that maintaining the Xrun.py files is currently the easiest thing to do; we each test our own so can assess the impact of our changes. Changing the backend code is currently hard because we can't test for undesired effects on other users.

GandalfTheWhite2 commented 4 years ago

But it is broken.

scarlehoff commented 4 years ago

All systems -usually- have ways of submitting multiple files. But in the past we found that to be unreliable, it roughly increased by a factor of two the number of jobs that were going to fail. The best solution was to have one single runscript and any extra files to be taken from the storage systems.

I should also say these were tests we did in the times when we were using ganga so ~2015/6?

In any case, as @marianheil pointed out, any alternative to the concatenate is bug prone in one way or another.

jcwhitehead commented 4 years ago

@GandalfTheWhite2 Could you elaborate on which aspects you think are broken?

The duplication of utility functions across multiple Xrun.py is potentially inconvenient for maintainers, but I'm not sure it breaks anything.

GandalfTheWhite2 commented 4 years ago

Currently, each job script is broken (see e.g. issue #44), and we cannot give ourselves a generous timescale to fix it. (specifically, it's not a case of "if it ain't broke, don't fix it"). The fact that much project independent code resides within the project specific Xrun.py will delay the propagation of fixes.

Xrun.py has been identified as a source of DDOS - this has to be fixed urgently. But since Xrun.py is a mix of problematic "system" code and user specific code, the fix will require each individual user to merge the required user changes with the necessary changes to the system code. This approach is prone to delays, oversights and further problems. If code associated with pyhepgrid continues to be the source of DDOS, then the sites have no other option but to block it from running.

One option is to release the system-side of pyhepgrid on cvmfs, and have much smaller user-specific scripts submitted as part of the run. pyhepgrid is a small code-base, so there is no problem in having several versions on cvmfs to choose from. Another option is to have the system files installed on a grid disk just as the actual executable is.

Disregard experience with disks based on experiments in 2015/2016. Several things have changed since then.

jcwhitehead commented 4 years ago

Hi Jeppe. I implemented the suggestions made in #44 in #43 yesterday afternoon, for all the job scripts; they were reviewed by @marianheil this morning and have since been pushed to master.

I'm not sure there's a problem with maintaining the utilities code in the Xrun.py files, since a copy-paste of any changes made suffices (or indeed of the whole utilities block, as in #43). Typically we've updated the three in parallel to minimise divergence.

As for 'DDOS', I've been down to speak to Paul. The issue raised in #44 is a theoretical one; I don't think it's correct to say that 'Xrun.py has been identified as a source of DDOS'. Paul's happy with the improvements we've made, has suggested some more, and would like us to continue to run with pyHepGrid (and send him some job IDs when we do) so he can eventually establish or rule out pyHepGrid as a possible factor in some intermittent problems seen with the disk server.

I hope that clears this up!

GandalfTheWhite2 commented 4 years ago

I asked this question amongst our user base a week ago, and haven't had any reports of copy failures back. I would be interested in hearing the experience from a wider user base: Has anyone seen any problems with gfal-copy when executed from the command line? Try 100 times. If there are no failures (except when the servers have been brought down by a DDOS), but one is experiencing significantly failure rates from within pyhepgrid, then I will chance a suggestion that the problem is inherently with the script....

The issue is not 'theoretical' in the sense of "non applied". Our servers have been repeatedly taken down when many instances of Xrun.py have been running. "Happy with improvements made" != "happy with how it was before".