Closed Hego-CCTB closed 2 years ago
amalgkit quant --index "infer"
: I'm not sure how sanity and quant should interact for this. sanity
checks the availability of ref fasta (and return a warning if not found), and quant --index "infer"
tries to locate ref fasta (and return an error if not found). It's reasonable to reuse the same function, but do you expect something more to interact?
No, not really. But depending on how I implement the "lookup function" in sanity
, it could be reused/repurposed/called within quant
.
Work in progress, stable version 0.5.0.1:
amalgkit sanity
is now a new subfunctionality of amalgkit
getfastq
files, based on metadata entries, i.e. fastq files and updated metadata files
amalgkit sanity --getfastq
/out_dir/sanity/
amalgkit sanity --index
genus_species_subspecies.*
or genus_species.*
out_dir/sanity/
amalgkit sanity --quant
/out_dir/sanity/
--all
runs amalgkit sanity as if --quant
--index
and --getfastq
were all setGenerally assumes a folder structure like this:
But the user can specify custom paths to any of these directories with the appropriate flags
--updated_metadata_dir FULL_PATH
--index_dir FULL_PATH
--getfastq_dir FULL_PATH
--quant_dir FULL_PATH
d71822eb88b861105632c1b9ca10c6ea8b4c78cb
ToDo:
Curate
inputs/outputscstmm
inputs/outputsfixed an issue where amalgkit sanity --quant
did not print the correct list of SRA-IDs missing quant output files.
https://github.com/kfuku52/amalgkit/commit/36fd6c0b0db5a310590d7f5893c11c37c6be4bd3
I just tried sanity
and detected some duplicated metadata entries that caused trouble. So useful! It also reported missing getfastq/quant outputs as below.
Looking for SRR584192
Could not find getfastq output for: SRR584192
Perhaps it would be more user-friendly if sanity
could suggest rerun commands for missing runs. In particular, Identifying --batch manually is laborious when dealing with a big dataset, so such feature will be very much appreciated. Here's an example:
Looking for SRR584192
Could not find getfastq output for: SRR584192
Example command for rerun: amalgkit getfastq -w ./amalgkit_out/ --batch 867 --metadata ./amalgkit_out/metadata/metadata/metadata_03_curated_20210623.tsv --entrez_email aaa@bbb.com --threads 4
That sounds like a good idea! Would it make sense to put those example commands into a separate file? Or would you prefer to have it just in the STDOUT?
STDOUT would be sufficient for me.
@Hego-CCTB bump
Ah, I forgot to close this issue. This is was implemented as part of a different update (I'll have to look for the exact commit). For grepping from STDOUT, lines start with: "Suggested command for rerun:".
print("Could not find getfastq output for: ", sra_id, "\n")
print("Suggested command for rerun: getfastq -e email@adress.com --id ", sra_id, " -w ", args.out_dir, "--redo yes --gcp yes --aws yes --ncbi yes")
data_unavailable.append(sra_id)
Thank you!
Ideas for adding a new functionality to amalgkit, called
amalgkit sanity
. The original idea was having a functionality that's basically an automatic check-list. Looking for presence/absence of required inputs and outputs of the various coreamalgkit
functionality (metadata
,getfastq
,quant
andcurate
). In other issues here, we found other purposes for this as well, so here is a list of things that can be handled byamalgkit sanity
:amalgkit quant --index "infer"
may call uponamalgkit sanity
(as mentioned in https://github.com/kfuku52/amalgkit/issues/28), although this is could be handled byquant
alone.I think this is more than enough to justify a separate functionality, rather than expanding existing ones. Any other ideas for tasks that could be handled by
sanity
?