saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

incorrect agent directory #155

Closed andre-merzky closed 10 years ago

andre-merzky commented 10 years ago

I am creating a compute pilot via ssh://localhost, and end up with an agent directory in /home/merzky/saga/saga-pilot/~/bj-7e5dc3a4-3a4c-11e3-a846-00231582da34 -- note the tilde inbetween. This looks incorrect, and in fact it took me a while to find it. I don't think I specify this anywhere - but am not sure how BJ derives that location. What am I / is BJ doing wrong?

melrom commented 10 years ago

Andre,

What is the working directory you attempted to set it to? You didn't use an env did you?

Melissa Romanus RADICAL The Cloud and Autonomic Computing Center Electrical and Computer Engineering Dept. Rutgers University Email: melissa@cac.rutgers.edu

On Oct 21, 2013, at 9:55 AM, Andre Merzky notifications@github.com wrote:

I am creating a compute pilot via ssh://localhost, and end up with an agent directory in /home/merzky/saga/saga-pilot/~/bj-7e5dc3a4-3a4c-11e3-a846-00231582da34 -- note the tilde inbetween. This looks incorrect, and in fact it took me a while to find it. I don't think I specify this anywhere - but am not sure how BJ derives that location. What am I / is BJ doing wrong?

— Reply to this email directly or view it on GitHubhttps://github.com/saga-project/BigJob/issues/155 .

andre-merzky commented 10 years ago

[Damned, answered by mail which does not show up here :/]

Hi Melissa,

on BJ support duty again? thanks! :)

What is the working directory you attempted to set it to?

I don't think I set anything. This is my minimal pilot description:

{'number_of_processes': 10, 'service_url': 'ssh://localhost'}

You didn't use an env did you?

Not that I can see:

 (ve)merzky@thinkie:~/saga/saga-pilot (feature/bigjob *%) $ env | grep -i bigj
COORDINATION_URL=redis://IWontTellYouThis@gw68.quarry.iu.teragrid.org:6379
REDIS_URL=redis://IWontTellYouThis@gw68.quarry.iu.teragrid.org:6379
REDIS_PASSWORD=IWontTellYouThis
BIGJOB_VERBOSE=100

What should I be looking for?

Best, Andre.

andre-merzky commented 10 years ago

I would like to up the priority on this one -- I caught myself at least twice today to almost typing rm -rf ~ :( This is an disaster waiting to happen...

AndreL, any idea what is going on, and where I can look?

Thanks, Andre.

oleweidner commented 10 years ago

Hi @andre-merzky, most definitely this comes from here: https://github.com/saga-project/BigJob/blob/master/bigjob/bigjob_manager.py#L231. This means that @drelu will argue that it's SAGA-Python's fault ;-) He would be right since Bliss actually did expand '~'.

Unless I'm completely wrong with my assessment, this probably needs to be fixed in the SAGA-Python SSH adaptor...

andre-merzky commented 10 years ago

Hi Ole,

That would be easy :) Alas, the job description we are getting from bigjob already contains the bogus path. Dumping the job description after this line (https://github.com/saga-project/BigJob/blob/master/bigjob/bigjob_manager.py#L373):

        ##############################################################################
        # Create and submit pilot job to job service
        logger.debug("Creating pilot job with description: %s" % str(jd))
        jd._attributes_dump ()
        self.job = self.js.create_job(jd)
        logger.debug("Trying to submit pilot job to: " + str(lrms_saga_url))

I see

10/22/2013 09:24:42 AM - bigjob - DEBUG - Creating pilot job with description: <class 'saga.job.description.Description'> <bound method Description.as_dict of <saga.job.description.Description object at 0x2789b50>>
---------------------------------------
<class 'saga.job.description.Description'>
---------------------------------------
 Extensible                     : True
 Private                        : True
 CamelCasing                    : True
---------------------------------------
'Registered' attributes
 CandidateHosts                 [string, vector, writeable,   0]: None
 Cleanup                        [  bool, scalar, writeable,   0]: None
 CPUArchitecture                [  enum, scalar, writeable,   0]: None
 Environment                    [string,   dict, writeable,   0]: None
 FileTransfer                   [string, vector, writeable,   0]: None
 Input                          [string, scalar, writeable,   0]: None
 Interactive                    [  bool, scalar, writeable,   0]: None
 JobContact                     [string, vector, writeable,   0]: None
 JobStartTime                   [  time, scalar, writeable,   0]: None
 Name                           [string, scalar, writeable,   0]: None
 NumberOfProcesses              [   int, scalar, writeable,   0]: None
 OperatingSystemType            [  enum, scalar, writeable,   0]: None
 ProcessesPerHost               [   int, scalar, writeable,   0]: None
 Project                        [string, scalar, writeable,   0]: None
 Queue                          [string, scalar, writeable,   0]: None
 ThreadsPerProcess              [   int, scalar, writeable,   0]: None
 TotalPhysicalMemory            [string, scalar, writeable,   0]: None
---------------------------------------
'Existing' attributes
 _env_is_list                   [   any, scalar, writeable,   0]: False
 Arguments                      [string, vector, writeable,   0]: ['python', '-c', '\'import sys\nimport os\nimport urllib\nimport sys\nimport time\nstart_time = time.time()\nhome = os.environ.get("HOME")\n#print "Home: " + home\nif home==None: home = os.getcwd()\nBIGJOB_AGENT_DIR= os.path.join(home, ".bigjob")\nif not os.path.exists(BIGJOB_AGENT_DIR): os.mkdir (BIGJOB_AGENT_DIR)\nBIGJOB_PYTHON_DIR=BIGJOB_AGENT_DIR+"/python/"\nif not os.path.exists(BIGJOB_PYTHON_DIR): os.mkdir(BIGJOB_PYTHON_DIR)\nBOOTSTRAP_URL="https://raw.github.com/saga-project/BigJob/master/bootstrap/bigjob-bootstrap.py"\nBOOTSTRAP_FILE=BIGJOB_AGENT_DIR+"/bigjob-bootstrap.py"\n#ensure that BJ in .bigjob is upfront in sys.path\nsys.path.insert(0, os.getcwd() + "/../")\np = list()\nfor i in sys.path:\n    if i.find(".bigjob/python")>1:\n          p.insert(0, i)\nfor i in p: sys.path.insert(0, i)\nprint "Python path: " + str(sys.path)\nprint "Python version: " + str(sys.version_info)\ntry: import saga\nexcept: print "SAGA not found.";\ntry: import bigjob.bigjob_agent\nexcept: \n    print "BigJob not installed. Attempt to install it."; \n    try:\n        opener = urllib.FancyURLopener({}); \n        opener.retrieve(BOOTSTRAP_URL, BOOTSTRAP_FILE);\n    except Exception, ex:\n        print "Unable to download bootstrap script: " + str(ex) + ". Please install BigJob manually."\n    print "Execute: " + "python " + BOOTSTRAP_FILE + " " + BIGJOB_PYTHON_DIR\n    os.system("/usr/bin/env")\n    try:\n        os.system("python " + BOOTSTRAP_FILE + " " + BIGJOB_PYTHON_DIR); \n        activate_this = os.path.join(BIGJOB_PYTHON_DIR, "bin/activate_this.py"); \n        execfile(activate_this, dict(__file__=activate_this))\n    except:\n        print "BJ installation failed. Trying system-level python (/usr/bin/python)";\n        os.system("/usr/bin/python " + BOOTSTRAP_FILE + " " + BIGJOB_PYTHON_DIR); \n        activate_this = os.path.join(BIGJOB_PYTHON_DIR, "bin/activate_this.py"); \n        execfile(activate_this, dict(__file__=activate_this))\n#try to import BJ once again\ntry:\n    import bigjob.bigjob_agent\nexcept Exception, ex:\n        print "Unable install BigJob: " + str(ex) + ". Please install BigJob manually."   \n# execute bj agent\nargs = list()\nargs.append("bigjob_agent.py")\nargs.append("redis://ILikeBigJob_wITH-REdIS@gw68.quarry.iu.teragrid.org:6379")\nargs.append("bigjob:bj-01a91670-3aeb-11e3-aed6-00231582da34:localhost")\nargs.append("PilotComputeServiceQueue-pcs-0142a2be-3aeb-11e3-aed6-00231582da34")\nprint "Bootstrap time: " + str(time.time()-start_time)\nprint "Starting BigJob Agents with following args: " + str(args)\nbigjob_agent = bigjob.bigjob_agent.bigjob_agent(args)\n\'']
 Error                          [string, scalar, writeable,   0]: /home/merzky/saga/saga-pilot/~/stderr-bj-01a91670-3aeb-11e3-aed6-00231582da34-agent.txt
 Executable                     [string, scalar, writeable,   0]: /usr/bin/env
 Output                         [string, scalar, writeable,   0]: /home/merzky/saga/saga-pilot/~/stdout-bj-01a91670-3aeb-11e3-aed6-00231582da34-agent.txt
 SPMDVariation                  [  enum, scalar, writeable,   0]: single
 TotalCPUCount                  [   int, scalar, writeable,   0]: 10
 WallTimeLimit                  [   int, scalar, writeable,   0]: 3600
 WorkingDirectory               [string, scalar, writeable,   0]: /home/merzky/saga/saga-pilot/~
---------------------------------------
'Extended' attributes
---------------------------------------
'Deprecated' attributes (aliases)
---------------------------------------
10/22/2013 09:24:43 AM - bigjob - DEBUG - Trying to submit pilot job to: ssh://localhost

with specifically

 WorkingDirectory               [string, scalar, writeable,   0]: /home/merzky/saga/saga-pilot/~

which cannot be sanely expanded in saga-python. FWIW, the shell adaptors should be able to expand ~ w/o troubles -- all paths are interpreted by, well, the shell (they are shell adaptors after all :)

Best, Andre.

melrom commented 10 years ago

Andre, I'm sure you've seen, but as a matter of recording in this ticket, the offending code is bigjob_manager.py (Line 231), it occurs when the working directory is set to None (i.e. not specified):

    else:
        # if no working dir is set assume use home directory
        # will fail if home directory is not the same on remote machine
        # but this is just a guess to avoid failing
        self.working_directory = "~"

I'd be happy to fix this if you have a better suggestion than tilde!

andre-merzky commented 10 years ago

Hi Melissa,

Ole pointed that line out earlier (see comments above). But I don't mind the '~' itself, that is fine -- I mind that BigJob somewhere, somehow converts that to <pwd>/~ which is not expandable to $HOME anymore... I looked through the code (around that line), but could not really see why that happens -- there are 4 or 5 variables floating around in that section which all seem to describe some for or part of the agent directory... :/

If nobody knows whats happening there, I can start to really debug it...

melrom commented 10 years ago

Whoops, yeah. Don't seem to see that in the code anywhere yet. But do you happen to know what os.mkdir() does when passed a ~?

melrom commented 10 years ago

Yeah andre:

From stackoverflow: Re: os.mkdir(): "You can't simply use ~. You must use os.path.expanduser to replace the ~ with a proper path." That's my guess. Could be way off though.

andre-merzky commented 10 years ago

The problem is not really os.mkdir() -- the dir is created in SAGA as working dir for the pilot, and saga is flawless ;) But really, if we would get ~/agen-uuid/, we could create the correct dir, no problem. But we get /home/merzky/saga/saga-pilot/~/agent-uuid/ -- and thus cannot expand anymore. So, the problem is not mkdir, but how that path is constructed.

I am not sure why I am the only one seeing this though scratch. I am using 0.50c-10-g01f5efc, FWIW.

melrom commented 10 years ago

Okay but....

I was not referring to saga-python...

I am referring to the BigJob code:

        if not os.path.isdir(working_directory) \
            and (lrms_saga_url.scheme.startswith("fork") or

lrms_saga_url.scheme.startswith("condor")) \ and working_directory.startswith("go:")==False: os.mkdir(working_directory) self.working_directory = working_directory

While your code should "not" be hitting this condition (since working_directory should be None), I still like this as the culprit (but I promise to stop guessing and defer to AL after this post). You could obviously test this by printing what working_directory is right after the if/else statement.

Also, I mean, most people define their working directory, so this is likely why no one else is having a problem? I can test the same thing if you want...

Okay, I'm done. :)

On Thu, Oct 24, 2013 at 4:52 PM, Andre Merzky notifications@github.comwrote:

The problem is not really os.mkdir() -- the dir is created in SAGA as working dir for the pilot, and saga is flawless ;) But really, if we would get ~/agen-uuid/, we could create the correct dir, no problem. But we get /home/merzky/saga/saga-pilot/~/agent-uuid/ -- and thus cannot expand anymore. So, the problem is not mkdir, but how that path is constructed.

I am not sure why I am the only one seeing this though scratch. I am using 0.50c-10-g01f5efc, FWIW.

— Reply to this email directly or view it on GitHubhttps://github.com/saga-project/BigJob/issues/155#issuecomment-27028787 .

andre-merzky commented 10 years ago

Thanks AndreL, can confirm fixed in master branch. Can that be merged into devel, please (not sure about the merge policies, happy to merge myself if that is ok)?

drelu commented 10 years ago

Yes, that's ok. Andre

On Sat, Oct 26, 2013 at 12:10 PM, Andre Merzky notifications@github.comwrote:

Thanks AndreL, can confirm fixed in master branch. Can that be merged into devel, please (not sure about the merge policies, happy to merge myself if that is ok)?

— Reply to this email directly or view it on GitHubhttps://github.com/saga-project/BigJob/issues/155#issuecomment-27149342 .

andre-merzky commented 10 years ago

done.