kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Adding amalgkit sanity #33

Closed Hego-CCTB closed 2 years ago

Hego-CCTB commented 3 years ago

Ideas for adding a new functionality to amalgkit, called amalgkit sanity. The original idea was having a functionality that's basically an automatic check-list. Looking for presence/absence of required inputs and outputs of the various core amalgkit functionality (metadata, getfastq, quant and curate). In other issues here, we found other purposes for this as well, so here is a list of things that can be handled by amalgkit sanity:

I think this is more than enough to justify a separate functionality, rather than expanding existing ones. Any other ideas for tasks that could be handled by sanity?

kfuku52 commented 3 years ago

amalgkit quant --index "infer": I'm not sure how sanity and quant should interact for this. sanity checks the availability of ref fasta (and return a warning if not found), and quant --index "infer" tries to locate ref fasta (and return an error if not found). It's reasonable to reuse the same function, but do you expect something more to interact?

Hego-CCTB commented 3 years ago

No, not really. But depending on how I implement the "lookup function" in sanity, it could be reused/repurposed/called within quant.

Hego-CCTB commented 3 years ago

Work in progress, stable version 0.5.0.1:

Generally assumes a folder structure like this:

But the user can specify custom paths to any of these directories with the appropriate flags

d71822eb88b861105632c1b9ca10c6ea8b4c78cb

ToDo:

Hego-CCTB commented 3 years ago

fixed an issue where amalgkit sanity --quant did not print the correct list of SRA-IDs missing quant output files. https://github.com/kfuku52/amalgkit/commit/36fd6c0b0db5a310590d7f5893c11c37c6be4bd3

kfuku52 commented 3 years ago

I just tried sanity and detected some duplicated metadata entries that caused trouble. So useful! It also reported missing getfastq/quant outputs as below.

Looking for  SRR584192
Could not find getfastq output for:  SRR584192

Perhaps it would be more user-friendly if sanity could suggest rerun commands for missing runs. In particular, Identifying --batch manually is laborious when dealing with a big dataset, so such feature will be very much appreciated. Here's an example:

Looking for  SRR584192
Could not find getfastq output for:  SRR584192
Example command for rerun: amalgkit getfastq -w ./amalgkit_out/ --batch 867 --metadata ./amalgkit_out/metadata/metadata/metadata_03_curated_20210623.tsv --entrez_email aaa@bbb.com --threads 4
Hego-CCTB commented 3 years ago

That sounds like a good idea! Would it make sense to put those example commands into a separate file? Or would you prefer to have it just in the STDOUT?

kfuku52 commented 3 years ago

STDOUT would be sufficient for me.

kfuku52 commented 2 years ago

@Hego-CCTB bump

Hego-CCTB commented 2 years ago

Ah, I forgot to close this issue. This is was implemented as part of a different update (I'll have to look for the exact commit). For grepping from STDOUT, lines start with: "Suggested command for rerun:".

                print("Could not find getfastq output for: ", sra_id, "\n")
                print("Suggested command for rerun: getfastq -e email@adress.com --id ", sra_id, " -w ", args.out_dir, "--redo yes --gcp yes --aws yes --ncbi yes")
                data_unavailable.append(sra_id)
kfuku52 commented 2 years ago

Thank you!