Added profile-pathway.py module and associated scripts for building HUMAnN databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method via HUMAnN using binned genomes as the database.
Added marker_gene_clustering.py script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space.
Added module_completion_ratios.py script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend of annotate.py.
Updated annotate.py and merge_annotations.py to provide better annotations for clustered proteins.
Added merge_genome_quality.py and merge_taxonomy_classifications.py which compiles genome quality and taxonomy, respectively, for all organisms.
Added BGC clustering in protein and nucleotide space to biosynthetic.py. Also, produces prevalence tables that can be used for further clustering of BGCs.
Added pangenome_core_sequences in cluster.py writes both protein and CDS sequences for each genome cluster.
Added PDF visualization of newick trees in phylogeny.py.
VEBA Database (VDB_v5.2):
Added CAZy
Added MicrobeAnnotator-KEGG
**Release v1.3.0 Details**
* Update `annotate.py` and `merge_annotations.py` to handle `CAZy`. They also properly address clustered protein annotations now.
* Added `module_completion_ratio.py` script which is a fork of `MicrobeAnnotator` [`ko_mapper.py`](https://github.com/cruizperez/MicrobeAnnotator/blob/master/microbeannotator/pipeline/ko_mapper.py). Also included a database [Zenodo: 10020074](https://zenodo.org/records/10020074) which will be included in `VDB_v5.2`
* Added a checkpoint for `tRNAscan-SE` in `binning-prokaryotic.py` and `eukaryotic_gene_modeling_wrapper.py`.
* Added `profile-pathway.py` module and `VEBA-profile_env` environments which is a wrapper around `HUMAnN` for the custom database created from `annotate.py` and `compile_custom_humann_database_from_annotations.py`
* Added `GenoPype version` to log output
* Added `merge_genome_quality.py` which combines `CheckV`, `CheckM2`, and `BUSCO` results.
* Added `compile_custom_humann_database_from_annotations.py` which compiles a `HUMAnN` protein database table from the output of `annotate.py` and taxonomy classifications.
* Added functionality to `merge_taxonomy_classifications.py` to allow for `--no_domain` and `--no_header` which will serve as input to `compile_custom_humann_database_from_annotations.py`
* Added `marker_gene_clustering.py` script which gets core marker genes unique to each SLC (i.e., pangenome). `average_number_of_copies_per_genome` to protein clusters.
* Added `--minimum_core_prevalence` in `global_clustering.py`, `local_clustering.py`, and `cluster.py` which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove `--no_singletons` from `cluster.py` to avoid complications with marker genes. Relabeled `--input` to `--genomes_table` in clustering scripts/module.
* Added a check in `coverage.py` to see if the `mapped.sorted.bam` files are created, if they are then skip them. Not yet implemented for GNU parallel option.
* Changed default representative sequence format from table to fasta for `mmseqs2_wrapper.py`.
* Added `--nucleotide_fasta_output` to `antismash_genbank_to_table.py` which outputs the actual BGC DNA sequence. Changed `--fasta_output` to `--protein_fasta_output` and added output to `biosynthetic.py`. Changed BGC component identifiers to `[bgc_id]_[position_in_bgc]|[start]:[end]([strand])` to match with `MetaEuk` identifiers. Changed `bgc_type` to `protocluster_type`. `biosynthetic.py` now supports GFF files from `MetaEuk` (exon and gene features not supported by `antiSMASH`). Fixed error related to `antiSMASH` adding CDS (i.e., `allorf_[start]_[end]`) that are not in GFF so `antismash_genbank_to_table.py` failed in those cases.
* Added `ete3` to `VEBA-phylogeny_env.yml` and automatically renders trees to PDF.
* Added presets for `MEGAHIT` using the `--megahit_preset` option.
* The change for using `--mash_db` with `GTDB-Tk` violated the assumption that all prokaryotic classifications had a `msa_percent` field which caused the cluster-level taxonomy to fail. `compile_prokaryotic_genome_cluster_classification_scores_table.py` fixes this by uses `fastani_ani` as the weight when genomes were classified using ANI and `msa_percent` for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications.
* Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs.
* Fixed critical error where descriptions in header were not being removed in `eukaryota.scaffolds.list` and did not remove eukaryotic scaffolds in `seqkit grep` so `DAS_Tool` output eukaryotic MAGs in `identifier_mapping.tsv` and `__DASTool_scaffolds2bin.no_eukaryota.txt`
* Fixed `krona.html` in `biosynthetic.py` which was being created incorrectly from `compile_krona.py` script.
* Create `pangenome_core_sequences` in `global_clustering.py` and `local_clustering.py` which writes both protein and CDS sequences for each SLC. Also made default in `cluster.py` to NOT do local clustering switching `--no_local_clustering` to `--local_clustering`.
* `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` in `biosynthetic.py` when `Diamond` finds multiple regions in one hit that matches. Added `--sort_by` and `--ascending` to `concatenate_dataframes.py` along with automatic detection and removal of duplicate indices. Also added `--sort_by bitscore` in `biosynthetic.py`.
* Added core pangenome and singleton hits to clustering output
* Updated `--megahit_memory` default from 0.9 to 0.99
* Fixed error in `genomad_taxonomy_wrapper.py` where `viral_taxonomy.tsv` should have been `taxonomy.tsv`.
* Fixed minor error in `assembly.py` that was preventing users from using `SPAdes` programs that were not `spades.py`, `metaspades.py`, or `rnaspades.py` that was the result of using an incorrect string formatting.
* Updated `bowtie2` in preprocess, assembly, and mapping modules. Updated `fastp` and `fastq_preprocessor` in preprocess module.
Release v1.3.0:
VEBA
Modules:profile-pathway.py
module and associated scripts for buildingHUMAnN
databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method viaHUMAnN
using binned genomes as the database.marker_gene_clustering.py
script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space.module_completion_ratios.py
script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend ofannotate.py
.annotate.py
andmerge_annotations.py
to provide better annotations for clustered proteins.merge_genome_quality.py
andmerge_taxonomy_classifications.py
which compiles genome quality and taxonomy, respectively, for all organisms.biosynthetic.py
. Also, produces prevalence tables that can be used for further clustering of BGCs.pangenome_core_sequences
incluster.py
writes both protein and CDS sequences for each genome cluster.phylogeny.py
.VEBA
Database (VDB_v5.2
):CAZy
MicrobeAnnotator-KEGG
**Release v1.3.0 Details**
* Update `annotate.py` and `merge_annotations.py` to handle `CAZy`. They also properly address clustered protein annotations now. * Added `module_completion_ratio.py` script which is a fork of `MicrobeAnnotator` [`ko_mapper.py`](https://github.com/cruizperez/MicrobeAnnotator/blob/master/microbeannotator/pipeline/ko_mapper.py). Also included a database [Zenodo: 10020074](https://zenodo.org/records/10020074) which will be included in `VDB_v5.2` * Added a checkpoint for `tRNAscan-SE` in `binning-prokaryotic.py` and `eukaryotic_gene_modeling_wrapper.py`. * Added `profile-pathway.py` module and `VEBA-profile_env` environments which is a wrapper around `HUMAnN` for the custom database created from `annotate.py` and `compile_custom_humann_database_from_annotations.py` * Added `GenoPype version` to log output * Added `merge_genome_quality.py` which combines `CheckV`, `CheckM2`, and `BUSCO` results. * Added `compile_custom_humann_database_from_annotations.py` which compiles a `HUMAnN` protein database table from the output of `annotate.py` and taxonomy classifications. * Added functionality to `merge_taxonomy_classifications.py` to allow for `--no_domain` and `--no_header` which will serve as input to `compile_custom_humann_database_from_annotations.py` * Added `marker_gene_clustering.py` script which gets core marker genes unique to each SLC (i.e., pangenome). `average_number_of_copies_per_genome` to protein clusters. * Added `--minimum_core_prevalence` in `global_clustering.py`, `local_clustering.py`, and `cluster.py` which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove `--no_singletons` from `cluster.py` to avoid complications with marker genes. Relabeled `--input` to `--genomes_table` in clustering scripts/module. * Added a check in `coverage.py` to see if the `mapped.sorted.bam` files are created, if they are then skip them. Not yet implemented for GNU parallel option. * Changed default representative sequence format from table to fasta for `mmseqs2_wrapper.py`. * Added `--nucleotide_fasta_output` to `antismash_genbank_to_table.py` which outputs the actual BGC DNA sequence. Changed `--fasta_output` to `--protein_fasta_output` and added output to `biosynthetic.py`. Changed BGC component identifiers to `[bgc_id]_[position_in_bgc]|[start]:[end]([strand])` to match with `MetaEuk` identifiers. Changed `bgc_type` to `protocluster_type`. `biosynthetic.py` now supports GFF files from `MetaEuk` (exon and gene features not supported by `antiSMASH`). Fixed error related to `antiSMASH` adding CDS (i.e., `allorf_[start]_[end]`) that are not in GFF so `antismash_genbank_to_table.py` failed in those cases. * Added `ete3` to `VEBA-phylogeny_env.yml` and automatically renders trees to PDF. * Added presets for `MEGAHIT` using the `--megahit_preset` option. * The change for using `--mash_db` with `GTDB-Tk` violated the assumption that all prokaryotic classifications had a `msa_percent` field which caused the cluster-level taxonomy to fail. `compile_prokaryotic_genome_cluster_classification_scores_table.py` fixes this by uses `fastani_ani` as the weight when genomes were classified using ANI and `msa_percent` for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications. * Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs. * Fixed critical error where descriptions in header were not being removed in `eukaryota.scaffolds.list` and did not remove eukaryotic scaffolds in `seqkit grep` so `DAS_Tool` output eukaryotic MAGs in `identifier_mapping.tsv` and `__DASTool_scaffolds2bin.no_eukaryota.txt` * Fixed `krona.html` in `biosynthetic.py` which was being created incorrectly from `compile_krona.py` script. * Create `pangenome_core_sequences` in `global_clustering.py` and `local_clustering.py` which writes both protein and CDS sequences for each SLC. Also made default in `cluster.py` to NOT do local clustering switching `--no_local_clustering` to `--local_clustering`. * `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` in `biosynthetic.py` when `Diamond` finds multiple regions in one hit that matches. Added `--sort_by` and `--ascending` to `concatenate_dataframes.py` along with automatic detection and removal of duplicate indices. Also added `--sort_by bitscore` in `biosynthetic.py`. * Added core pangenome and singleton hits to clustering output * Updated `--megahit_memory` default from 0.9 to 0.99 * Fixed error in `genomad_taxonomy_wrapper.py` where `viral_taxonomy.tsv` should have been `taxonomy.tsv`. * Fixed minor error in `assembly.py` that was preventing users from using `SPAdes` programs that were not `spades.py`, `metaspades.py`, or `rnaspades.py` that was the result of using an incorrect string formatting. * Updated `bowtie2` in preprocess, assembly, and mapping modules. Updated `fastp` and `fastq_preprocessor` in preprocess module.