"The cluster DRM system terminated this job" occasionally

MingChen0919 commented 6 years ago

System information

Please enter the following information (if able).

Galaxy Instance: test

Issue description

I got the "The cluster DRM system terminated this job" error several times. It does not always happen. When the error occurs seems unpredictable. This makes it difficult to reproduce the error. For example, I ran an alignment job with the hisat2 tool on two reads files at the same time. One reads alignment was successfully complete, but the other failed. This error also occurred when uploading files. But it only occurred to one file. See the screenshots below:

Alignment with hisat2: one of the two alignment jobs failed.

Uploading files: one uploading failed

heh30 commented 6 years ago

I've adjusted the following parameter in the galaxy.yml file and have restarted.

retry_job_output_collection: 10

Let me know if you continue to receive the DRM system terminated messages at random intervals.

MingChen0919 commented 6 years ago

all 5 workflows failed with an DRM issue somewhere in the workflow.

heh30 commented 6 years ago

galaxy.jobs.runners.drmaa INFO 2018-06-08 14:02:09,296 [p:3612,w:1,m:0] [DRMAARunner.work_thread-2] (4273) queued as 3234392 galaxy.jobs DEBUG 2018-06-08 14:02:09,297 [p:3612,w:1,m:0] [DRMAARunner.work_thread-2] (4273) Persisting job destination (destination id: sge_default) galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:10,162 [p:3612,w:1,m:0] [Dummy-5] (4275/3234391) state change: job is queued and active galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:10,179 [p:3612,w:1,m:0] [Dummy-5] (4273/3234392) state change: job is queued and active galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:11,215 [p:3612,w:1,m:0] [Dummy-5] (4275/3234391) state change: job is running galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:11,334 [p:3612,w:1,m:0] [Dummy-5] (4273/3234392) state change: job is running galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:12,484 [p:3612,w:1,m:0] [Dummy-5] (4275/3234391) state change: job finished, but failed galaxy.jobs DEBUG 2018-06-08 14:02:12,584 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] fail(): Moved /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_dataset_6363.dat to /galaxy/ test/htdocs/database/files/006/dataset_6156.dat galaxy.jobs DEBUG 2018-06-08 14:02:12,737 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Pausing Job '4277', Execution of this dataset's job is paused because its input datasets are in an err or state. galaxy.tools.error_reports DEBUG 2018-06-08 14:02:12,817 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7 ff90cf0de10> generated response None galaxy.model.metadata DEBUG 2018-06-08 14:02:12,820 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Cleaning up external metadata files galaxy.model.metadata DEBUG 2018-06-08 14:02:12,852 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Failed to cleanup MetadataTempFile temp files from /galaxy/test/htdocs/database/jobs_directo ry/004/4275/metadata_out_HistoryDatasetAssociation_6363_osLW4t: No JSON object could be decoded galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,889 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.sh: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.sh' galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,903 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.o: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.o' galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,919 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.e: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.e' galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,936 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.ec: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.ec' galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:28,084 [p:3612,w:1,m:0] [Dummy-5] (4273/3234392) state change: job finished normally

heh30 commented 6 years ago

Errors are similar to what was seen before for the DRM issue. Can waits be put into the workflow? 4273 is running and 4275 gets launched, I see a message for 'created job 4274' but not indication that the job has been launched

heh30 commented 6 years ago

I see something similar for jobs 4270-4272

MingChen0919 commented 6 years ago

I am sorry, what do you mean by putting waits into the workflows?

Could you manually delete these highlighted folders (but not the 4198 folder, that's stephen's job running):

I suspect that these folders might be the problem. They are not related to any existing jobs. Galaxy somehow failed to delete them.

Once you delete them, I'll retry and see how it goes.

heh30 commented 6 years ago

Is this still an issue or did a reload of the dependent programs resolve it?

MingChen0919 commented 6 years ago

it seems the test galaxy has been down for a while. I thought you took it down.

heh30 commented 6 years ago

I’ll check on it. I was out of the office all last week.

Regards, Heidi

From: Ming Chen [mailto:notifications@github.com] Sent: Tuesday, June 26, 2018 9:02 AM To: bcuser30/ML_GalaxyQueue ML_GalaxyQueue@noreply.github.com Cc: Hough, Heidi heidi.hough@wsu.edu; Comment comment@noreply.github.com Subject: Re: [bcuser30/ML_GalaxyQueue] "The cluster DRM system terminated this job" occasionally (#8)

it seems the test galaxy has been down for a while. I thought you took it down.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bcuser30_ML-5FGalaxyQueue_issues_8-23issuecomment-2D400366654&d=DwMCaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=RIEFIUIh4wnoEZFOB_TUkr8FKyuRjnDRS3tlTglKEgo&m=6Iy-i-2hNlkYJHf11mkQjPiOY6YNSCbwgxxC6WrQOsI&s=4Irzd-cDtINPAPzvWmtGJoRXsX0jMdBj1SWf_9Y1Bfo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQI-5FrQZ0Pt8juc2pgzKYDbw4VTUWfxF5ks5uAlsUgaJpZM4UdUCi&d=DwMCaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=RIEFIUIh4wnoEZFOB_TUkr8FKyuRjnDRS3tlTglKEgo&m=6Iy-i-2hNlkYJHf11mkQjPiOY6YNSCbwgxxC6WrQOsI&s=bzFUqyZ-iB30lGghmjcpYsLiR7taKmhPE2BQhjAx44c&e=.

MingChen0919 commented 6 years ago

Surprisingly, i submitted 14 jobs in a short time, All of them completed successfully!

MingChen0919 commented 6 years ago

hi @bcuser30 , i got this DRM issue again. Also, I saw these folders in the /main/sites/galaxy/test-galaxy/htdocs/database/jobs_directory/004 again.

These job working folders (highlighted folders) actually do not host any running jobs. Last time when you deleted them, the DRM issue was fixed (please see my previous comment). Could you delete these folders again, thanks!

Also, I don't understand why exactly the same folders are generated this time. This doesn't make sense to me.

heh30 / ML_GalaxyQueue

"The cluster DRM system terminated this job" occasionally #8

System information

Issue description