ubccr / hpc-toolset-tutorial

Tutorial for installing Open XDMoD, OnDemand, & ColdFront
GNU General Public License v3.0
121 stars 72 forks source link

pearc21 xdmod ood integration #94

Closed johrstrom closed 3 years ago

johrstrom commented 3 years ago

The instructions for integrating XDMoD and OOD are on the OOD message of the day (and in the message of the day when you ssh into OOD).

It's fairly straight forward, and seems to imply everything's already set on the XDMoD side and the only modifications we need is to OOD itself.

In any case, When I try to run this command I get these errors. Same with the API - it returns errors that the MySQL db isn't up.

[hpcadmin@xdmod ~]$ sudo -u xdmod /srv/xdmod/scripts/shred-ingest-aggregate-all.sh
2021-07-13 16:20:31 [notice] xdmod-slurm-helper start (process_start_time: 2021-07-13 16:20:31)
2021-07-13 16:20:31 [critical] Failed to create database connection: SQLSTATE[HY000] [2002] Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) (stacktrace: #0 /usr/share/xdmod/classes/CCR/DB/PDODB.php(88): PDO->__construct('mysql:host=loca...', 'xdmod', '')
#1 /usr/share/xdmod/classes/CCR/DB.php(111): CCR\DB\PDODB->connect()
#2 /usr/bin/xdmod-slurm-helper(137): CCR\DB::factory('shredder')
#3 /usr/bin/xdmod-slurm-helper(21): main()
#4 {main})
2021-07-13 16:20:32 [critical] SQLSTATE[HY000] [2002] Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) (stacktrace: #0 /usr/share/xdmod/classes/CCR/DB/PDODB.php(88): PDO->__construct('mysql:host=loca...', 'xdmod', '')
#1 /usr/share/xdmod/classes/CCR/DB.php(111): CCR\DB\PDODB->connect()
#2 /usr/bin/xdmod-ingestor(207): CCR\DB::factory('hpcdb')
#3 /usr/bin/xdmod-ingestor(21): main()
#4 {main})
2021-07-13T16:20:32.208 [INFO] archive indexer starting
2021-07-13T16:20:32.224 [ERROR] [Errno 2] No such file or directory: '/data/pcp-logs/my_cluster_name'
Traceback (most recent call last):
  File "/bin/indexarchives.py", line 11, in <module>
    load_entry_point('supremm==1.4.0', 'console_scripts', 'indexarchives.py')()
  File "/usr/lib64/python2.7/site-packages/supremm/indexarchives.py", line 473, in runindexing
    logging.debug("processed archive %s (fileio %s, dbacins %s)", archivefile, parse_end - start_time, db_end - parse_end)
  File "/usr/lib64/python2.7/site-packages/supremm/indexarchives.py", line 368, in __exit__
    dbac = XDMoDArchiveCache(self.config)
  File "/usr/lib64/python2.7/site-packages/supremm/xdmodaccount.py", line 316, in __init__
    self.con = getdbconnection(self.dbconfig)
  File "/usr/lib64/python2.7/site-packages/supremm/scripthelpers.py", line 53, in getdbconnection
    return MySQLdb.connect(**dbargs)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 193, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)")****
johrstrom commented 3 years ago

I know you have changes in a draft PR - so maybe you're fixing the same in those, I'm not sure - I just thought I'd make the ticket on the off chance that you're unaware of the issues.

dsajdak commented 3 years ago

@johrstrom When you started did you delete all images and docker containers and then re-download them? I did not do this at first, I just did a 'git pull' and when I started everything up I got similar error messages about the xdmod db. Just thought I'd mention it in case yours is the same issue. Once I stopped docker, deleted all containers and images, and then ran ./hpcts start, all new docker images were downloaded and everything worked.

johrstrom commented 3 years ago

:facepalm: I did refresh the images, but maybe not the volumes. Let me try that.

johrstrom commented 3 years ago

Yea I had to delete my volumes because that's where the mysql db is. I feel like that got me last year too...

In any case, I've got this error from a built container. Are you saying the containers you download from Dockerhub are OK - I assume they're last years?

xdmod        | 2021-07-13 16:37:39 [critical] Error while executing sacct: sacct: fatal: ReqGRES is deprecated, please use ReqTRES
xdmod        | 
xdmod exited with code 1
aebruno commented 3 years ago

In any case, I've got this error from a built container. Are you saying the containers you download from Dockerhub are OK - I assume they're last years?

xdmod        | 2021-07-13 16:37:39 [critical] Error while executing sacct: sacct: fatal: ReqGRES is deprecated, please use ReqTRES
xdmod        | 
xdmod exited with code 1

I'm assuming these should be fixed in #90. If you want you can merge in that PR and test. Otherwise, we'll have to wait until @ryanrath is finished with that PR.

johrstrom commented 3 years ago

I'm happy to wait.

johrstrom commented 3 years ago

I saw #90 got pulled in so I'll work though this and let you know if there's still any issue.

aebruno commented 3 years ago

I saw #90 got pulled in so I'll work though this and let you know if there's still any issue.

Sounds good. I just built and pushed the new docker images so feel free to pull from docker hub and test.

johrstrom commented 3 years ago

Yep! works just fine. I went through the integration instructions and it seems there's 1 step on the OnDemand side, then we need to shred - so we'll have to work through how that's going to go in the tutorial, who says what and what that handoff looks like.

I don't see any data after I ran the shred script, so I'll open a different ticket on that.

https://github.com/ubccr/hpc-toolset-tutorial/blob/master/ondemand/motd#L23-L44