microbiomedata / ReadbasedAnalysis

0 stars 6 forks source link

create README details tools, versions, and parameters #14

Closed aclum closed 9 months ago

aclum commented 1 year ago

It has been determined that the sequencing workflows output a human readable description of workflow. ReadbasedAnalysis is higher priority b/c we'll need to run it on datasets for GSP. FYI @ssarrafan

ssarrafan commented 1 year ago

@aclum @poeli @hubin-keio is this work planned for this week or next sprint?

poeli commented 1 year ago

@aclum @ssarrafan I am confused about "human readable description". Examples and/or scenarios will help me understand what to implement exactly. I chatted with @hubin-keio today and will have more discussions.

aclum commented 1 year ago

example IMG annotation methods: cat *imgap.info

IMGAP Version: 5.1.13
Structural Annotation Programs Used: GeneMark.hmm-2 v1.25_lic; INFERNAL 1.1.3 (Nov 2019); Prodigal v2.6.3
Structural Annotation DBs Used: Rfam 13.0
Functional Annotation Programs Used: HMMER 3.1b2; lastal 1256
Functional Annotation DBs Used: COG 2003; Cath-Funfam v4.2.0; IMG-NR 20211118; Pfam v34.0; SMART 01_06_2016; SuperFamily v1.75; TIGRFAM v15.0

The make_info_file task in https://github.com/microbiomedata/mg_annotation/blob/a8c172beeb4ce93e8f8373c11e348181ade47e79/annotation_full.wdl is how I've implemented generating this file for the annotation workflow.

example metatranscriptome assembly methods: "The readset was assembled with megahit version v1.2.9(1). This was run using the following command line options: megahit -t 16 --k-list 23,43,63,83,103,123 -m 100000000000 -o out.megahit --12 reads.input.fastq.gz.

The input read set was mapped to the final assembly and coverage information generated with bbmap version 38.86(2). This was run using the following command line options: bbmap.sh build=1 overwrite=true fastareadlen=500 -Xmx100g threads=16 nodisk=true interleaved=true ambiguous=random rgid=filename in=reads.fastq.gz ref=reference.fasta out=pairedMapped.bam.

(1) MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015. (2) B. Bushnell: BBTools software package, http://bbtools.jgi.doe.gov/ "

ssarrafan commented 1 year ago

@poeli and @aclum any update on this? Can this issue be closed? Is it actively being worked on?

aclum commented 1 year ago

This request is for a readme for reads based analysis, what I saw being worked on was the reads qc process. We need both but the reads based analysis is higher priority because that is the main workflow we want to run on Bioscales and GROW for GSP. @hubin-keio

poeli commented 1 year ago

@ssarrafan @aclum I committed the updated version c91abd1 to the development branch. The changes include:

Additional output example:

{ "ReadbasedAnalysis.info_file": "test/output/profiler.info",
  "ReadbasedAnalysis.info": "Taxonomy profiling tools and databases used:\nKraken2 v2.1.2 (database version: k2_standard_08gb_20221209)\nCentrifuge v1.0.4 (database version: RS_bahv_compressed_201612)"
}
aclum commented 1 year ago

Great thanks

poeli commented 1 year ago

@aclum @ssarrafan I don't have permission to write centrifuge database directory. Please help move centrifuge's db_ver.info. mv /global/cfs/projectdirs/m3408/aim2/database/db_ver.info /global/cfs/projectdirs/m3408/aim2/database/centrifuge

aclum commented 1 year ago

@poeli is this still a problem? I see the file there.

ssarrafan commented 1 year ago

Moving to current sprint. Please remove from sprint if you're not actively working on this.

ssarrafan commented 1 year ago

Closing this per @aclum

aclum commented 10 months ago

@poeli @hubin-keio Gottcha2 still does not list a version. Please update this. ie nmdc_wfrbt-11-vq06gn88.1_profiler.info Taxonomy profiling tools and databases used: Kraken2 v2.1.2 (database version: Refseq: bacteria, archaea, viral, human 2020/01) Centrifuge v1.0.4 (database version: Refseq: bacteria, archaea (compressed) 2018/04) Gottcha2 v (database version: RefSeq-r90 Bacteria Archaea Viruses (complete genomes))

poeli commented 10 months ago

@aclum The issue has been resolved (output buffer didn't flush). I will proceed to rebuild the container and initiate the testing process.

poeli commented 10 months ago

@aclum New docker container has been built and tested. I don't have access to NMDC dockerhub, so I pushed to my account docker. Please let me know if you have any questions or additional issues.

aclum commented 10 months ago

@poeli I pulled your image and pushed it back to the nmdc repo https://hub.docker.com/layers/microbiomedata/nmdc_taxa_profilers/1.0.5/images/sha256-808a5194b42503d50b93c06a6f4dd5ab83fdc85453872abae91e653a8a2c26c6?context=explore so you can reference it in the workflow.

poeli commented 10 months ago

Updated to the master branch

aclum commented 9 months ago

@Michal-Babins @mbthornton-lbl We'll need a new release of https://github.com/microbiomedata/ReadbasedAnalysis repo and the nmdc automation repo needs to be updated to use this new version. With this fix the expected behavior is that the *_profiler.info file has a version number for Gottcha2 populated.

Michal-Babins commented 9 months ago

Assets were updated to include the updated version. v1.0.5 is the current release and it now correctly reflected by the ReadbasedAnalysis.wdl and bundle.zip. Moving forward, we need to make sure any changes made to any branch, when merged to master reflect a major, minor, or patch update with the changes.

Michal-Babins commented 9 months ago

@poeli, does this version of gottcha2 not write out ${prefix}.full.tsv or only if nothing is found? I am seeing some workflows fail because cromwell is unable to fine ${prefix}.full.tsv