Open apetkau opened 5 years ago
Fixed in #300
It looks like this is still an issue and was not completely solved in the previous merge. I'm re-opening this issue.
Work already done. (PR once added to blend4j)
Changed the way workflows are executed as run_workflow was deprecated. This may be faster on systems that have hundreds of jobs running concurrently, but when testing locally and on a test server, no additional speedup was found.
Work already done. (PR once added to blend4j)
Updated the api call to include a missing parameter in the json/dict when invoking a workflow. Unlikely to cause a speedup, but updates the deprecated invoke to the new one.
Bioblend has a separate client for downloading datasets, but we are using the history client to download the dataset. While this way of downloading is not deprecated, it is not used in bioblend.
It is used in 3 tests and 1 place in code: ca.corefacility.bioinformatics.irida.service.analysis.workspace.galaxy.AnalysisWorkspaceServiceGalaxy#buildOutputFile
with galaxyHistoriesService.downloadDatasetTo(analysisId, datasetId, outputFile);
when retrieving analysis results.
Problem? : Unlikely, we are using API calls that are in Galaxy, and a quick look through the code didn't bring up anything that looked like it would be slower/faster in the other style.
Work for change: Moderate, would need to implement a new client with input and output objects in blend4j.
This function is deprecated, but we don't use it so no actual impact.
Work for change: Low, but likely not worth the effort
I had thought I had found a way to get data to a workflow in fewer steps, potentially speeding up the data linking.
This is a false trail, uploading to a history via dataset is the same as uploading to library and linking it to a history.
We should however look at if the data library count being large is affecting the speed of new libraries coming in.
Problem: When we use the blend4j call LibrariesClient.uploadFilesystemPaths
we do so one file at a time, instead of in batch.
Solution: The galaxy api supports linking multiple files when using newline characters to indicate next file path to link, however the blend4j api needs to be changed to return a List<GalaxyObject>
instead of a single GalaxyObject
for this to work. com.github.jmchilton.blend4j.galaxy.LibrariesClientImpl#uploadFilesystemPaths
returns with .get(0)
, which must be removed.
In the IRIDA code, the linking of files occurs here: ca.corefacility.bioinformatics.irida.pipeline.upload.galaxy.GalaxyLibrariesService#fileToLibrary
Problems with Solution:
1. What is returned by galaxy does not indicate which file that was linked relates to which id
that was returned. We would likely need to change the galaxy api to include this information, otherwise we need to make calls to the galaxy api again to verify which file is which, and this would remove any speedup we would have gotten from batch linking the files. We need to know which file corresponds to which id so that when we build the collection in ca.corefacility.bioinformatics.irida.service.analysis.workspace.galaxy.AnalysisCollectionServiceGalaxy#uploadSequenceFilesSingleEnd
and ...#uploadSequenceFilesPaired
we can assign files to samples.
2. When we use the api to add the dataset files to a history, it must be done one at a time. The galaxy api could be changed to accept multiple dataset id's to speed this up as well.
3. Using folders to add files in batch, and then adding the folder to the history would be an alternative solution, but we would have to rework how we add files to a collection now that they are within a folder to start off.
4. After linking files, we still must wait for each individual file to be verified that the linking is complete, as each file that was uploaded returns an id for that specific files status.
Conclusion: While this will likely cause some speedup in submitting jobs on installations with heavy use, it is more likely that our specific issues with jobs submitting slowly is an infrastructure issue.
Describe the bug
I have noticed that IRIDA can get into a state where jobs are still being submitted to Galaxy, but only very slowly. This mainly occurs if there are a lot of single-sample pipelines submitted (e.g., SISTR or Assembly). The status in IRIDA will say Submitting or Preparing and it won't change for a long time, despite the many resources given to Galaxy.
Steps to reproduce the problem
Link IRIDA to Galaxy backed by a large compute cluster and submit large numbers of SISTR or Assembly jobs. Despite the large resources provided by the cluster, only a small number of jobs will be scheduled at once (e.g., upload jobs).
Expected behaviour
I expected IRIDA to be able to schedule many jobs at once if the cluster resources are available.
Additional context
None.