Closed aclum closed 9 months ago
@aclum @poeli @hubin-keio is this work planned for this week or next sprint?
@aclum @ssarrafan I am confused about "human readable description". Examples and/or scenarios will help me understand what to implement exactly. I chatted with @hubin-keio today and will have more discussions.
example IMG annotation methods: cat *imgap.info
IMGAP Version: 5.1.13
Structural Annotation Programs Used: GeneMark.hmm-2 v1.25_lic; INFERNAL 1.1.3 (Nov 2019); Prodigal v2.6.3
Structural Annotation DBs Used: Rfam 13.0
Functional Annotation Programs Used: HMMER 3.1b2; lastal 1256
Functional Annotation DBs Used: COG 2003; Cath-Funfam v4.2.0; IMG-NR 20211118; Pfam v34.0; SMART 01_06_2016; SuperFamily v1.75; TIGRFAM v15.0
The make_info_file task in https://github.com/microbiomedata/mg_annotation/blob/a8c172beeb4ce93e8f8373c11e348181ade47e79/annotation_full.wdl is how I've implemented generating this file for the annotation workflow.
example metatranscriptome assembly methods: "The readset was assembled with megahit version v1.2.9(1). This was run using the following command line options: megahit -t 16 --k-list 23,43,63,83,103,123 -m 100000000000 -o out.megahit --12 reads.input.fastq.gz.
The input read set was mapped to the final assembly and coverage information generated with bbmap version 38.86(2). This was run using the following command line options: bbmap.sh build=1 overwrite=true fastareadlen=500 -Xmx100g threads=16 nodisk=true interleaved=true ambiguous=random rgid=filename in=reads.fastq.gz ref=reference.fasta out=pairedMapped.bam.
(1) MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015. (2) B. Bushnell: BBTools software package, http://bbtools.jgi.doe.gov/ "
@poeli and @aclum any update on this? Can this issue be closed? Is it actively being worked on?
This request is for a readme for reads based analysis, what I saw being worked on was the reads qc process. We need both but the reads based analysis is higher priority because that is the main workflow we want to run on Bioscales and GROW for GSP. @hubin-keio
@ssarrafan @aclum I committed the updated version c91abd1
to the development branch. The changes include:
[outdir]/profiler.info
.Additional output example:
{ "ReadbasedAnalysis.info_file": "test/output/profiler.info",
"ReadbasedAnalysis.info": "Taxonomy profiling tools and databases used:\nKraken2 v2.1.2 (database version: k2_standard_08gb_20221209)\nCentrifuge v1.0.4 (database version: RS_bahv_compressed_201612)"
}
db_ver.info
, need to be added to each database directory.SingleM
has been added to the WDL.Great thanks
@aclum @ssarrafan I don't have permission to write centrifuge database directory. Please help move centrifuge's db_ver.info.
mv /global/cfs/projectdirs/m3408/aim2/database/db_ver.info /global/cfs/projectdirs/m3408/aim2/database/centrifuge
@poeli is this still a problem? I see the file there.
Moving to current sprint. Please remove from sprint if you're not actively working on this.
Closing this per @aclum
@poeli @hubin-keio Gottcha2 still does not list a version. Please update this. ie nmdc_wfrbt-11-vq06gn88.1_profiler.info Taxonomy profiling tools and databases used: Kraken2 v2.1.2 (database version: Refseq: bacteria, archaea, viral, human 2020/01) Centrifuge v1.0.4 (database version: Refseq: bacteria, archaea (compressed) 2018/04) Gottcha2 v (database version: RefSeq-r90 Bacteria Archaea Viruses (complete genomes))
@aclum The issue has been resolved (output buffer didn't flush). I will proceed to rebuild the container and initiate the testing process.
@aclum New docker container has been built and tested. I don't have access to NMDC dockerhub, so I pushed to my account docker. Please let me know if you have any questions or additional issues.
@poeli I pulled your image and pushed it back to the nmdc repo https://hub.docker.com/layers/microbiomedata/nmdc_taxa_profilers/1.0.5/images/sha256-808a5194b42503d50b93c06a6f4dd5ab83fdc85453872abae91e653a8a2c26c6?context=explore so you can reference it in the workflow.
Updated to the master branch
@Michal-Babins @mbthornton-lbl We'll need a new release of https://github.com/microbiomedata/ReadbasedAnalysis repo and the nmdc automation repo needs to be updated to use this new version. With this fix the expected behavior is that the *_profiler.info file has a version number for Gottcha2 populated.
Assets were updated to include the updated version. v1.0.5 is the current release and it now correctly reflected by the ReadbasedAnalysis.wdl and bundle.zip. Moving forward, we need to make sure any changes made to any branch, when merged to master reflect a major, minor, or patch update with the changes.
@poeli, does this version of gottcha2 not write out ${prefix}.full.tsv or only if nothing is found? I am seeing some workflows fail because cromwell is unable to fine ${prefix}.full.tsv
It has been determined that the sequencing workflows output a human readable description of workflow. ReadbasedAnalysis is higher priority b/c we'll need to run it on datasets for GSP. FYI @ssarrafan