jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
76 stars 8 forks source link

[Question] <Is there documentation on how to run VEBA using the docker containers?> #86

Closed grendon closed 5 months ago

grendon commented 6 months ago

Please confirm that you've checked the FAQ section: https://github.com/jolespin/veba/blob/main/FAQ.md

If you still have a question, feel free to ask here.

I can't run docker on the cluster. But I have been able to run singularity provided I pull the containers to the cluster using singularity as shown here:

singularity pull docker://jolespin/veba_binning-viral:2.0.0

Could you tell where I can find the details on how to run that container: veba_binning-viral:2.0.0 ? In particular, how do I set these parameters --path_config and --veba_database?

Do I need to run veba_database container first; if so, then how do I do that, I mean, which command should I run?

$ singularity run veba_binning-viral_2.0.0.sif binning-viral.py -h
usage: binning-viral.py -f  -l  -n  -o  [Requires at least 20GB]

    Running: binning-viral.py v2023.11.30 via Python v3.10.9 | /opt/conda/bin/python

options:
  -h, --help            show this help message and exit

Required I/O arguments:
  -f FASTA, --fasta FASTA
                        path/to/scaffolds.fasta
  -n NAME, --name NAME  Name of sample
  -o PROJECT_DIRECTORY, --project_directory PROJECT_DIRECTORY
                        path/to/project_directory [Default: veba_output/binning/viral]
  -b BAM [BAM ...], --bam BAM [BAM ...]
                        path/to/mapped.sorted.bam files separated by spaces.

Utility arguments:
  --path_config PATH_CONFIG
                        path/to/config.tsv [Default: CONDA_PREFIX]
  -p N_JOBS, --n_jobs N_JOBS
                        Number of threads [Default: 1]
  --random_state RANDOM_STATE
                        Random state [Default: 0]
  --restart_from_checkpoint RESTART_FROM_CHECKPOINT
                        Restart from a particular checkpoint [Default: None]
  -v, --version         show program's version number and exit

Database arguments:
  --veba_database VEBA_DATABASE
                        VEBA database location.  [Default: $VEBA_DATABASE environment variable]

Binning arguments:
  -a ALGORITHM, --algorithm ALGORITHM
                        Binning algorithm to use: {genomad, virfinder}  [Default: genomad]
  -m MINIMUM_CONTIG_LENGTH, --minimum_contig_length MINIMUM_CONTIG_LENGTH
                        Minimum contig length.  [Default: 1500]
  --include_provirus_detection
                        Include provirus viral detection

Gene model arguments:
  --prodigal_genetic_code PRODIGAL_GENETIC_CODE
                        Prodigal-GV -g translation table (https://github.com/apcamargo/prodigal-gv) [Default: 11]

geNomad arguments
Using --relaxed mode by default.  Adjust settings according to the following table: https://portal.nersc.gov/genomad/post_classification_filtering.html#default-parameters-and-presets:
  --genomad_qvalue GENOMAD_QVALUE
                        Maximum accepted false discovery rate. [Default: 1.0; 0.0 < x ≤ 1.0]
  --sensitivity SENSITIVITY
                        MMseqs2 marker search sensitivity. Higher values will annotate more proteins, but the search will be slower and consume more memory. [Default: 4.0; x ≥ 0.0]
  --splits SPLITS       Split the data for the MMseqs2 search. Higher values will reduce memory usage, but will make the search slower. If the MMseqs2 search is failing, try to increase the number of splits. Also used for VirFinder. [Default: 0; x ≥ 0]
  --composition COMPOSITION
                        Method for estimating sample composition. (auto|metagenome|virome) [Default: auto]
  --minimum_score MINIMUM_SCORE
                        Minimum score to flag a sequence as virus or plasmid. By default, the sequence is classified as virus/plasmid if its virus/plasmid score is higher than its chromosome score, regardless of the value. [Default: 0; 0.0 ≤ x ≤ 1.0]
  --minimum_plasmid_marker_enrichment MINIMUM_PLASMID_MARKER_ENRICHMENT
                        Minimum allowed value for the plasmid marker enrichment score, which represents the total enrichment of plasmid markers in the sequence. Sequences with multiple plasmid markers will have higher values than the ones that encode few or no markers.[Default: -100]
  --minimum_virus_marker_enrichment MINIMUM_VIRUS_MARKER_ENRICHMENT
                        Minimum allowed value for the virus marker enrichment score, which represents the total enrichment of plasmid markers in the sequence. Sequences with multiple plasmid markers will have higher values than the ones that encode few or no markers. [Default: -100]
  --minimum_plasmid_hallmarks MINIMUM_PLASMID_HALLMARKS
                        Minimum number of plasmid hallmarks in the identified plasmids.  [Default: 0; x ≥ 0]
  --minimum_virus_hallmarks MINIMUM_VIRUS_HALLMARKS
                        Minimum number of virus hallmarks in the identified viruses.  [Default: 0; x ≥ 0]
  --maximum_universal_single_copy_genes MAXIMUM_UNIVERSAL_SINGLE_COPY_GENES
                        Maximum allowed number of universal single copy genes (USCGs) in a virus or a plasmid. Sequences with more than this number of USCGs will not be classified as viruses or plasmids, regardless of their score.  [Default: 100]
  --genomad_options GENOMAD_OPTIONS
                        geNomad | More options (e.g. --arg 1 ) [Default: '']

VirFinder arguments:
  --virfinder_pvalue VIRFINDER_PVALUE
                        VirFinder statistical test threshold [Default: 0.05]
  --mmseqs2_evalue MMSEQS2_EVALUE
                        Maximum accepted E-value in the MMseqs2 search. Used by genomad annotate when VirFinder is used as binning algorithm [Default: 1e-3]
  --use_qvalue          Use qvalue (FDR) instead of pvalue
  --use_minimal_database_for_taxonomy
                        Use a smaller marker database to annotate proteins. This will make execution faster but sensitivity will be reduced.
  --virfinder_options VIRFINDER_OPTIONS
                        VirFinder | More options (e.g. --arg 1 ) [Default: '']

CheckV arguments:
  --checkv_options CHECKV_OPTIONS
                        CheckV | More options (e.g. --arg 1 ) [Default: '']
  --multiplier_viral_to_host_genes MULTIPLIER_VIRAL_TO_HOST_GENES
                        Minimum number of viral genes [Default: 5]
  --checkv_completeness CHECKV_COMPLETENESS
                        Minimum completeness [Default: 50.0]
  --checkv_quality CHECKV_QUALITY
                        Comma-separated string of acceptable arguments between {High-quality,Medium-quality,Complete} [Default: High-quality,Medium-quality,Complete]
  --miuvig_quality MIUVIG_QUALITY
                        Comma-separated string of acceptable arguments between {High-quality,Medium-quality,Complete} [Default: High-quality,Medium-quality,Complete]

featureCounts arguments:
  --long_reads          featureCounts | Use this if long reads are being used
  --featurecounts_options FEATURECOUNTS_OPTIONS
                        featureCounts | More options (e.g. --arg 1 ) [Default: ''] | http://bioinf.wehi.edu.au/featureCounts/

Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)

Thanks in advance for your help.

jolespin commented 5 months ago

Apologies for the delay. For some reason I’m not getting notifications for new issues. Here’s a docker tutorial I made but I’m going to make a simplified version next week along with a YouTube video walking through the process: https://github.com/jolespin/veba/blob/main/walkthroughs/docs/adapting_commands_for_docker.md

jolespin commented 5 months ago

Also, I don’t have a singularity tutorial yet but here is a thread where myself and another researcher implemented VEBA via singularity using 2 different approaches: https://github.com/jolespin/veba/issues/45#issuecomment-1933221335

feel free to reopen if this doesn’t address your questions.