usegalaxy-no / galaxyadmin

A repository for managing the work of the usegalaxy.no GalaxyAdmin team
0 stars 0 forks source link

upload data problem - related to nrec2? #75

Closed tothuhien closed 1 year ago

tothuhien commented 1 year ago

As a user report, from this morning, uploading data takes so long. I can reproduce it. It seems that uploading data runs on nrec2, and so many jobs are currently running on this node for more than 2 days. Uploading data from the upload button of a certain tool takes place on ecc1, that works fast as normal.

sveinugu commented 1 year ago

This will not solve the issue short-term, but it seems a long-time solution might have been implemented in Galaxy version 22.05:

"Enhanced Celery tasks and features Galaxy can optionally delegate the data upload job to Celery, and Galaxy can run the metadata script in Celery. This results in much shorter runtime for small jobs. To enable this, set enable_celery_tasks to true and ensure that at least one celery worker is started. If Celery tasks are enabled, it is also possible to change the datatype for many history items in batch."

kjetilkl commented 1 year ago

The Celery option seems like a nice feature to consider in the future, but the problem now was that some small jobs were sent to the nrec2.usegalaxy.no compute node, where 30 of the 32 cores are currently running Salmon processes at 100% (part of Trinity apparently). I have drained this node for now, so that new jobs are run on other nodes instead. I also moved a "tp_find_and_replace" job, which had been running on nrec2 for 5 minutes, over to ecc1, where it completed after 13 seconds.

sveinugu commented 1 year ago

@kjetilkl As I read you, the issue was due to hardware overload on the nrec2 node, and not really software dependent (except for the part of offloading to nrec2). However, having a a separate task queue for data uploads and other small jobs that is indepentent of slurm anyway seems a good idea and will probably reduce the overall experience of slowness when using usegalaxy.no (especially since data upload often is the first task performed by the user).

kjetilkl commented 1 year ago

I believe we actually ran the upload tool on the main node outside of Slurm way back in the day. But that changed when we started using the Sorting Hat to select tool destinations. I do agree that there are plenty of optimizations that can be done to fine-tune the performance of our job executions, and we should probably look into that some day if anyone of us has the time.

On a positive note, the 30 Trinity "Align reads and estimate abundance" jobs that had been running for 3 days on NREC2 have now all completed successfully, so I have opened up the node for other jobs again :-)