kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Metadata assisted quantification (batch mode) #20

Closed Hego-CCTB closed 3 years ago

Hego-CCTB commented 3 years ago

I'm currently looking at ~2000 SRR Runs of about 20 species. Up until now I manually ordered all downloaded files by species and let a local bash script run amalgkit quant for me. For this amount of data, though, this seems unpractical.

Since the metadata file contains all relevant information, it should be fairly easy to implement automatic quantification, as long as the user provides the folder containing all getfastq outputs, as well as an Index folder containing relevant kallisto indices.

While I'm at it, I can turn this into a --batch mode, where all quantification processes are sent to a cluster queue.

kfuku52 commented 3 years ago

I may not be understanding it properly, but it seems like a level of parallelism that should be handled by a bash script. How would you like to change the behavior of the --batch option? To me, --batch works fine to do parallel tasks equivalent to this script at least in getfastq. Isn't --batch mode implemented in quant yet? https://github.com/kfuku52/amalgkit/blob/master/util/kallisto_20180207.sh

Hego-CCTB commented 3 years ago

--batch mode is not implemented for quant yet. I need to modify that script for quant, since it's handling download, quality control and quantification.

kfuku52 commented 3 years ago

OK, please go ahead with implementing --batch in quant. Your plan sounds good. --batch should take an integer as input, and quant should then look for the corresponding row of metadata and start downloading SRA->QC->kallisto.

Hego-CCTB commented 3 years ago

Ah, we may want different things here. I already handled download and QC through amalgkit getfastq, so I only need automated quantification at this stage. I was thinking something like:

The script above combines quant and getfastq, whereas I was looking to preserve the compartmentalised nature of amalgkit.

kfuku52 commented 3 years ago

Ah, yes, getfastq should manage the download and QC. Sounds good. One thing, the species fasta indexing would compete between jobs so we may need some tricks to avoid the mess.

kfuku52 commented 3 years ago

Wait, I did use amalgkit quant --batch more than a year ago. The option is indeed available in the current version... I had no idea what we were talking about here...

Hego-CCTB commented 3 years ago

It seems I have confused what --batch is supposed to do, because it only processes a single ID.

--batch was there, but I thought it was supposed to call a shell script that sends an array job for all viable entries in the metadata file to a SLURM or SGE queue.

Hego-CCTB commented 3 years ago

On my devel branch I changed --batch to read in the metadata file, check if all necessary Index files are there (based on species name) and if all SRR-IDs can actually be found and contain data. All SRR-IDs that satisfy both conditions get passed to a shell script, which sends an array job to SLURM (currently only SLURM is supported) that processes everything in parallel. Currently testing the shell script.

I will move my changed --batch to a different flag/mode and keep the current --batch as is, so a user can make their own array script.

kfuku52 commented 3 years ago

We shouldn't touch any job management systems like SLURM/SGE from amalgkit. There are as many resource request formats as there are computing centers. You may know that our SLURMs are different already between Julia and CCTB.

Hego-CCTB commented 3 years ago

I see what you mean.

I still find it useful to confirm if all relevant input data is there. With the amount of data I have, chances are, a download was not successful, or a species Index was overlooked. I'll move that to it's own function for the time being and revert the changes made to --batch.

kfuku52 commented 3 years ago

These parts would be useful for a sanity check and for reporting the number of array jobs actually needed. Could you move this part to a new subcommand like amalgkit sanity? I wouldn't keep the SLURM job submission because it seems like a task that should be done on a higher level. If you need a ready-to-use template for amalgkit in SLURM, you can write a script in gfe_pipeline, which is developed exactly for such purposes with our local computing environments.

On my devel branch I changed --batch to read in the metadata file, check if all necessary Index files are there (based on species name) and if all SRR-IDs can actually be found and contain data.

kfuku52 commented 3 years ago

amalgkit check would be a better name of such function, which:

kfuku52 commented 3 years ago

For the same reason, we need to rework the --batch option in curate.

Hego-CCTB commented 3 years ago

amalgkit check would be a better name of such function, which:

  • can be called at any stage of the analysis.
  • takes --work_dir and --metadata as minimal input.
  • checks reference index files. Create if not.
  • checks how many fastqs were downloaded/safely-deleted-after-quant. If download failures should be counted, you can let getfastq leaves a log file as quant does for safely deleted fastqs. check can read the log to report a summary.
  • checks how many kallisto quant succeeded/failed.
  • optional: check the availability of optional/mandatory dependencies that pip cannot handles. e.g., R packages. Any thought?

Yeah, I was thinking the same. My Idea would be to have options like amalgkit check --quant to check everything quant related, or amalgkit check --curate to check everything curate related.

kfuku52 commented 3 years ago

Sounds good! They may be defaulted "yes" for lazy people like me who want to get all reports anyway.

Hego-CCTB commented 3 years ago

continued in https://github.com/kfuku52/amalgkit/issues/33