Closed Hego-CCTB closed 3 years ago
I may not be understanding it properly, but it seems like a level of parallelism that should be handled by a bash script. How would you like to change the behavior of the --batch option? To me, --batch works fine to do parallel tasks equivalent to this script at least in getfastq
. Isn't --batch mode implemented in quant
yet?
https://github.com/kfuku52/amalgkit/blob/master/util/kallisto_20180207.sh
--batch mode is not implemented for quant yet. I need to modify that script for quant, since it's handling download, quality control and quantification.
OK, please go ahead with implementing --batch in quant. Your plan sounds good. --batch should take an integer as input, and quant should then look for the corresponding row of metadata and start downloading SRA->QC->kallisto.
Ah, we may want different things here. I already handled download and QC through amalgkit getfastq, so I only need automated quantification at this stage. I was thinking something like:
The script above combines quant and getfastq, whereas I was looking to preserve the compartmentalised nature of amalgkit.
Ah, yes, getfastq should manage the download and QC. Sounds good. One thing, the species fasta indexing would compete between jobs so we may need some tricks to avoid the mess.
Wait, I did use amalgkit quant --batch
more than a year ago. The option is indeed available in the current version... I had no idea what we were talking about here...
It seems I have confused what --batch is supposed to do, because it only processes a single ID.
--batch was there, but I thought it was supposed to call a shell script that sends an array job for all viable entries in the metadata file to a SLURM or SGE queue.
On my devel branch I changed --batch to read in the metadata file, check if all necessary Index files are there (based on species name) and if all SRR-IDs can actually be found and contain data. All SRR-IDs that satisfy both conditions get passed to a shell script, which sends an array job to SLURM (currently only SLURM is supported) that processes everything in parallel. Currently testing the shell script.
I will move my changed --batch to a different flag/mode and keep the current --batch as is, so a user can make their own array script.
We shouldn't touch any job management systems like SLURM/SGE from amalgkit. There are as many resource request formats as there are computing centers. You may know that our SLURMs are different already between Julia and CCTB.
I see what you mean.
I still find it useful to confirm if all relevant input data is there. With the amount of data I have, chances are, a download was not successful, or a species Index was overlooked. I'll move that to it's own function for the time being and revert the changes made to --batch.
These parts would be useful for a sanity check and for reporting the number of array jobs actually needed. Could you move this part to a new subcommand like amalgkit sanity
? I wouldn't keep the SLURM job submission because it seems like a task that should be done on a higher level. If you need a ready-to-use template for amalgkit in SLURM, you can write a script in gfe_pipeline, which is developed exactly for such purposes with our local computing environments.
On my devel branch I changed --batch to read in the metadata file, check if all necessary Index files are there (based on species name) and if all SRR-IDs can actually be found and contain data.
amalgkit check
would be a better name of such function, which:
getfastq
leaves a log file as quant
does for safely deleted fastqs. check
can read the log to report a summary.kallisto quant
succeeded/failed.pip
cannot handles. e.g., R packages.
Any thought?For the same reason, we need to rework the --batch
option in curate
.
amalgkit check
would be a better name of such function, which:
- can be called at any stage of the analysis.
- takes --work_dir and --metadata as minimal input.
- checks reference index files. Create if not.
- checks how many fastqs were downloaded/safely-deleted-after-quant. If download failures should be counted, you can let
getfastq
leaves a log file asquant
does for safely deleted fastqs.check
can read the log to report a summary.- checks how many
kallisto quant
succeeded/failed.- optional: check the availability of optional/mandatory dependencies that
pip
cannot handles. e.g., R packages. Any thought?
Yeah, I was thinking the same. My Idea would be to have options like amalgkit check --quant
to check everything quant related, or amalgkit check --curate
to check everything curate related.
Sounds good! They may be defaulted "yes" for lazy people like me who want to get all reports anyway.
continued in https://github.com/kfuku52/amalgkit/issues/33
I'm currently looking at ~2000 SRR Runs of about 20 species. Up until now I manually ordered all downloaded files by species and let a local bash script run amalgkit quant for me. For this amount of data, though, this seems unpractical.
Since the metadata file contains all relevant information, it should be fairly easy to implement automatic quantification, as long as the user provides the folder containing all getfastq outputs, as well as an Index folder containing relevant kallisto indices.
While I'm at it, I can turn this into a --batch mode, where all quantification processes are sent to a cluster queue.