Visualise set-level workflow results against a phylogeny

Context

Users can run the TheiaProk workflow to characterize bacterial pathogens, obtaining information such as MLST, ARGs, plasmid types, and serotypes. They can also generate phylogenetic trees with the mashtree, KSNP3, core_gene_snp, and Snippy_Tree workflows. Currently, ways to visualize the characteristics of each bacterial isolate next to the phylogenetic tree, for comparative analysis and interpretation within a dataset, are manual and complex.

Proposal

*A task in the phylogenetics workflows that enables users to format sample-level outputs for visualization against the phylogenetic tree

Workflow inputs: column headers from sample-level data table containing data that should be formatted to visualize vs phylogenetic tree
Task processing: Make an output file with a row for every sample in the data set. For columns in sample-level data tables containing a single item of data (e.g. an MLST ST) per sample, add this data to the file for each sample, keeping the same column header as the original data table. For columns in sample-level data tables containing a comma-separated list of data items (e.g. ARGs, plasmid rep types) per sample, find all unique items in the column amongst the whole dataset, use these as column headers and show a matrix of presence/absence of these data items per sample.
Workflow outputs: 2 files: 1) CSV/TSV file containing summarised data from every column and sample specified in the workflow inputs, with a roe per sample and a column per distinct data type, 2) This same file in Phandango-compatible format (https://github.com/jameshadfield/phandango/wiki/Input%20data%20formats#metadata)

theiagen / public_health_bacterial_genomics

Visualise set-level workflow results against a phylogeny #203

Context

Proposal