phac-nml / irida

Canada’s Integrated Rapid Infectious Disease Analysis Platform for Genomic Epidemiology
https://irida.ca
Apache License 2.0
40 stars 31 forks source link

Slowness of submitting jobs in Galaxy #301

Open apetkau opened 5 years ago

apetkau commented 5 years ago

Describe the bug

I have noticed that IRIDA can get into a state where jobs are still being submitted to Galaxy, but only very slowly. This mainly occurs if there are a lot of single-sample pipelines submitted (e.g., SISTR or Assembly). The status in IRIDA will say Submitting or Preparing and it won't change for a long time, despite the many resources given to Galaxy.

Steps to reproduce the problem

Link IRIDA to Galaxy backed by a large compute cluster and submit large numbers of SISTR or Assembly jobs. Despite the large resources provided by the cluster, only a small number of jobs will be scheduled at once (e.g., upload jobs).

Expected behaviour

I expected IRIDA to be able to schedule many jobs at once if the cluster resources are available.

Additional context

None.

tom114 commented 5 years ago

Fixed in #300

apetkau commented 5 years ago

It looks like this is still an issue and was not completely solved in the previous merge. I'm re-opening this issue.

JeffreyThiessen commented 4 years ago

Investigated potential problems

Blend4j run_workflow / invoke_workflow:

Work already done. (PR once added to blend4j)

Changed the way workflows are executed as run_workflow was deprecated. This may be faster on systems that have hundreds of jobs running concurrently, but when testing locally and on a test server, no additional speedup was found.

Blend4j workflowClient.importWorkflow:

Work already done. (PR once added to blend4j)

Updated the api call to include a missing parameter in the json/dict when invoking a workflow. Unlikely to cause a speedup, but updates the deprecated invoke to the new one.

Blend4j historiesClient.downloadDataset:

Bioblend has a separate client for downloading datasets, but we are using the history client to download the dataset. While this way of downloading is not deprecated, it is not used in bioblend.

It is used in 3 tests and 1 place in code: ca.corefacility.bioinformatics.irida.service.analysis.workspace.galaxy.AnalysisWorkspaceServiceGalaxy#buildOutputFile with galaxyHistoriesService.downloadDatasetTo(analysisId, datasetId, outputFile); when retrieving analysis results.

Problem? : Unlikely, we are using API calls that are in Galaxy, and a quick look through the code didn't bring up anything that looked like it would be slower/faster in the other style.

Work for change: Moderate, would need to implement a new client with input and output objects in blend4j.

Blend4j workflowClient.exportWorkflow:

This function is deprecated, but we don't use it so no actual impact.

Work for change: Low, but likely not worth the effort

Upload data to library vs upload to history/workflow.

I had thought I had found a way to get data to a workflow in fewer steps, potentially speeding up the data linking.

This is a false trail, uploading to a history via dataset is the same as uploading to library and linking it to a history.

We should however look at if the data library count being large is affecting the speed of new libraries coming in.

Link multiple files at a time when "uploading"

Problem: When we use the blend4j call LibrariesClient.uploadFilesystemPaths we do so one file at a time, instead of in batch.

Solution: The galaxy api supports linking multiple files when using newline characters to indicate next file path to link, however the blend4j api needs to be changed to return a List<GalaxyObject> instead of a single GalaxyObject for this to work. com.github.jmchilton.blend4j.galaxy.LibrariesClientImpl#uploadFilesystemPaths returns with .get(0), which must be removed.

In the IRIDA code, the linking of files occurs here: ca.corefacility.bioinformatics.irida.pipeline.upload.galaxy.GalaxyLibrariesService#fileToLibrary

Problems with Solution: 1. What is returned by galaxy does not indicate which file that was linked relates to which id that was returned. We would likely need to change the galaxy api to include this information, otherwise we need to make calls to the galaxy api again to verify which file is which, and this would remove any speedup we would have gotten from batch linking the files. We need to know which file corresponds to which id so that when we build the collection in ca.corefacility.bioinformatics.irida.service.analysis.workspace.galaxy.AnalysisCollectionServiceGalaxy#uploadSequenceFilesSingleEnd and ...#uploadSequenceFilesPaired we can assign files to samples.

2. When we use the api to add the dataset files to a history, it must be done one at a time. The galaxy api could be changed to accept multiple dataset id's to speed this up as well.

3. Using folders to add files in batch, and then adding the folder to the history would be an alternative solution, but we would have to rework how we add files to a collection now that they are within a folder to start off.

4. After linking files, we still must wait for each individual file to be verified that the linking is complete, as each file that was uploaded returns an id for that specific files status.

Conclusion: While this will likely cause some speedup in submitting jobs on installations with heavy use, it is more likely that our specific issues with jobs submitting slowly is an infrastructure issue.