jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
77 stars 9 forks source link

v1.3.0 #31

Closed jolespin closed 1 year ago

jolespin commented 1 year ago

Release v1.3.0:

**Release v1.3.0 Details** * Update `annotate.py` and `merge_annotations.py` to handle `CAZy`. They also properly address clustered protein annotations now. * Added `module_completion_ratio.py` script which is a fork of `MicrobeAnnotator` [`ko_mapper.py`](https://github.com/cruizperez/MicrobeAnnotator/blob/master/microbeannotator/pipeline/ko_mapper.py). Also included a database [Zenodo: 10020074](https://zenodo.org/records/10020074) which will be included in `VDB_v5.2` * Added a checkpoint for `tRNAscan-SE` in `binning-prokaryotic.py` and `eukaryotic_gene_modeling_wrapper.py`. * Added `profile-pathway.py` module and `VEBA-profile_env` environments which is a wrapper around `HUMAnN` for the custom database created from `annotate.py` and `compile_custom_humann_database_from_annotations.py` * Added `GenoPype version` to log output * Added `merge_genome_quality.py` which combines `CheckV`, `CheckM2`, and `BUSCO` results. * Added `compile_custom_humann_database_from_annotations.py` which compiles a `HUMAnN` protein database table from the output of `annotate.py` and taxonomy classifications. * Added functionality to `merge_taxonomy_classifications.py` to allow for `--no_domain` and `--no_header` which will serve as input to `compile_custom_humann_database_from_annotations.py` * Added `marker_gene_clustering.py` script which gets core marker genes unique to each SLC (i.e., pangenome). `average_number_of_copies_per_genome` to protein clusters. * Added `--minimum_core_prevalence` in `global_clustering.py`, `local_clustering.py`, and `cluster.py` which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove `--no_singletons` from `cluster.py` to avoid complications with marker genes. Relabeled `--input` to `--genomes_table` in clustering scripts/module. * Added a check in `coverage.py` to see if the `mapped.sorted.bam` files are created, if they are then skip them. Not yet implemented for GNU parallel option. * Changed default representative sequence format from table to fasta for `mmseqs2_wrapper.py`. * Added `--nucleotide_fasta_output` to `antismash_genbank_to_table.py` which outputs the actual BGC DNA sequence. Changed `--fasta_output` to `--protein_fasta_output` and added output to `biosynthetic.py`. Changed BGC component identifiers to `[bgc_id]_[position_in_bgc]|[start]:[end]([strand])` to match with `MetaEuk` identifiers. Changed `bgc_type` to `protocluster_type`. `biosynthetic.py` now supports GFF files from `MetaEuk` (exon and gene features not supported by `antiSMASH`). Fixed error related to `antiSMASH` adding CDS (i.e., `allorf_[start]_[end]`) that are not in GFF so `antismash_genbank_to_table.py` failed in those cases. * Added `ete3` to `VEBA-phylogeny_env.yml` and automatically renders trees to PDF. * Added presets for `MEGAHIT` using the `--megahit_preset` option. * The change for using `--mash_db` with `GTDB-Tk` violated the assumption that all prokaryotic classifications had a `msa_percent` field which caused the cluster-level taxonomy to fail. `compile_prokaryotic_genome_cluster_classification_scores_table.py` fixes this by uses `fastani_ani` as the weight when genomes were classified using ANI and `msa_percent` for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications. * Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs. * Fixed critical error where descriptions in header were not being removed in `eukaryota.scaffolds.list` and did not remove eukaryotic scaffolds in `seqkit grep` so `DAS_Tool` output eukaryotic MAGs in `identifier_mapping.tsv` and `__DASTool_scaffolds2bin.no_eukaryota.txt` * Fixed `krona.html` in `biosynthetic.py` which was being created incorrectly from `compile_krona.py` script. * Create `pangenome_core_sequences` in `global_clustering.py` and `local_clustering.py` which writes both protein and CDS sequences for each SLC. Also made default in `cluster.py` to NOT do local clustering switching `--no_local_clustering` to `--local_clustering`. * `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` in `biosynthetic.py` when `Diamond` finds multiple regions in one hit that matches. Added `--sort_by` and `--ascending` to `concatenate_dataframes.py` along with automatic detection and removal of duplicate indices. Also added `--sort_by bitscore` in `biosynthetic.py`. * Added core pangenome and singleton hits to clustering output * Updated `--megahit_memory` default from 0.9 to 0.99 * Fixed error in `genomad_taxonomy_wrapper.py` where `viral_taxonomy.tsv` should have been `taxonomy.tsv`. * Fixed minor error in `assembly.py` that was preventing users from using `SPAdes` programs that were not `spades.py`, `metaspades.py`, or `rnaspades.py` that was the result of using an incorrect string formatting. * Updated `bowtie2` in preprocess, assembly, and mapping modules. Updated `fastp` and `fastq_preprocessor` in preprocess module.
jolespin commented 1 year ago

v1.3.0