Open MingChen0919 opened 6 years ago
I've adjusted the following parameter in the galaxy.yml file and have restarted.
retry_job_output_collection: 10
Let me know if you continue to receive the DRM system terminated messages at random intervals.
all 5 workflows failed with an DRM issue somewhere in the workflow.
galaxy.jobs.runners.drmaa INFO 2018-06-08 14:02:09,296 [p:3612,w:1,m:0] [DRMAARunner.work_thread-2] (4273) queued as 3234392 galaxy.jobs DEBUG 2018-06-08 14:02:09,297 [p:3612,w:1,m:0] [DRMAARunner.work_thread-2] (4273) Persisting job destination (destination id: sge_default) galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:10,162 [p:3612,w:1,m:0] [Dummy-5] (4275/3234391) state change: job is queued and active galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:10,179 [p:3612,w:1,m:0] [Dummy-5] (4273/3234392) state change: job is queued and active galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:11,215 [p:3612,w:1,m:0] [Dummy-5] (4275/3234391) state change: job is running galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:11,334 [p:3612,w:1,m:0] [Dummy-5] (4273/3234392) state change: job is running galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:12,484 [p:3612,w:1,m:0] [Dummy-5] (4275/3234391) state change: job finished, but failed galaxy.jobs DEBUG 2018-06-08 14:02:12,584 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] fail(): Moved /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_dataset_6363.dat to /galaxy/ test/htdocs/database/files/006/dataset_6156.dat galaxy.jobs DEBUG 2018-06-08 14:02:12,737 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Pausing Job '4277', Execution of this dataset's job is paused because its input datasets are in an err or state. galaxy.tools.error_reports DEBUG 2018-06-08 14:02:12,817 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7 ff90cf0de10> generated response None galaxy.model.metadata DEBUG 2018-06-08 14:02:12,820 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Cleaning up external metadata files galaxy.model.metadata DEBUG 2018-06-08 14:02:12,852 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] Failed to cleanup MetadataTempFile temp files from /galaxy/test/htdocs/database/jobs_directo ry/004/4275/metadata_out_HistoryDatasetAssociation_6363_osLW4t: No JSON object could be decoded galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,889 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.sh: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.sh' galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,903 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.o: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.o' galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,919 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.e: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.e' galaxy.jobs.runners DEBUG 2018-06-08 14:02:12,936 [p:3612,w:1,m:0] [DRMAARunner.work_thread-0] (4275/3234391) Unable to cleanup /galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4 275.ec: [Errno 2] No such file or directory: '/galaxy/test/htdocs/database/jobs_directory/004/4275/galaxy_4275.ec' galaxy.jobs.runners.drmaa DEBUG 2018-06-08 14:02:28,084 [p:3612,w:1,m:0] [Dummy-5] (4273/3234392) state change: job finished normally
Errors are similar to what was seen before for the DRM issue. Can waits be put into the workflow? 4273 is running and 4275 gets launched, I see a message for 'created job 4274' but not indication that the job has been launched
I see something similar for jobs 4270-4272
I am sorry, what do you mean by putting waits into the workflows?
Could you manually delete these highlighted folders (but not the 4198 folder, that's stephen's job running):
I suspect that these folders might be the problem. They are not related to any existing jobs. Galaxy somehow failed to delete them.
Once you delete them, I'll retry and see how it goes.
Is this still an issue or did a reload of the dependent programs resolve it?
it seems the test galaxy has been down for a while. I thought you took it down.
I’ll check on it. I was out of the office all last week.
Regards, Heidi
From: Ming Chen [mailto:notifications@github.com] Sent: Tuesday, June 26, 2018 9:02 AM To: bcuser30/ML_GalaxyQueue ML_GalaxyQueue@noreply.github.com Cc: Hough, Heidi heidi.hough@wsu.edu; Comment comment@noreply.github.com Subject: Re: [bcuser30/ML_GalaxyQueue] "The cluster DRM system terminated this job" occasionally (#8)
it seems the test galaxy has been down for a while. I thought you took it down.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bcuser30_ML-5FGalaxyQueue_issues_8-23issuecomment-2D400366654&d=DwMCaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=RIEFIUIh4wnoEZFOB_TUkr8FKyuRjnDRS3tlTglKEgo&m=6Iy-i-2hNlkYJHf11mkQjPiOY6YNSCbwgxxC6WrQOsI&s=4Irzd-cDtINPAPzvWmtGJoRXsX0jMdBj1SWf_9Y1Bfo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQI-5FrQZ0Pt8juc2pgzKYDbw4VTUWfxF5ks5uAlsUgaJpZM4UdUCi&d=DwMCaQ&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=RIEFIUIh4wnoEZFOB_TUkr8FKyuRjnDRS3tlTglKEgo&m=6Iy-i-2hNlkYJHf11mkQjPiOY6YNSCbwgxxC6WrQOsI&s=bzFUqyZ-iB30lGghmjcpYsLiR7taKmhPE2BQhjAx44c&e=.
Surprisingly, i submitted 14 jobs in a short time, All of them completed successfully!
hi @bcuser30 , i got this DRM issue again. Also, I saw these folders in the /main/sites/galaxy/test-galaxy/htdocs/database/jobs_directory/004
again.
These job working folders (highlighted folders) actually do not host any running jobs. Last time when you deleted them, the DRM issue was fixed (please see my previous comment). Could you delete these folders again, thanks!
Also, I don't understand why exactly the same folders are generated this time. This doesn't make sense to me.
System information
Please enter the following information (if able).
Galaxy Instance: test
Issue description
I got the "The cluster DRM system terminated this job" error several times. It does not always happen. When the error occurs seems unpredictable. This makes it difficult to reproduce the error. For example, I ran an alignment job with the hisat2 tool on two reads files at the same time. One reads alignment was successfully complete, but the other failed. This error also occurred when uploading files. But it only occurred to one file. See the screenshots below:
Alignment with hisat2: one of the two alignment jobs failed.
Uploading files: one uploading failed