[FEATURE REQUEST] Outputs for function- and pathway-level variability analysis

ivagljiva commented 1 year ago

The need

I am opening this issue on behalf of @tdelmont, who wants a strategy that connects the SNV and SAAV data stored in the profile database to functions and pathways. The output provided by anvi-gen-variability-profile is a table of information per variant (in each sample), which includes the corresponding gene call for the variant but not its functional annotation.

Instead, it would be nice to get a table of functions (that is, genes with their annotations) with the variability summarized per gene call, as in a SNV density (# SNV positions / length of gene in nucleotide base pairs) or SAAV density (# SAAV positions / length of gene in amino acids). We could also include pN/pS of the gene if we want (this is already reported by anvi-gen-pn-ps-ratio, but I guess it doesn't hurt to add a flag for this to report it along with SNV density so the user doesn't have to run two programs).

We also discussed another table output, this time at the level of pathways (ie, modules from KEGG, or user-defined pathways). We could either compute SNV density for a pathway as (# SNV positions across all genes in pathway / cumulative length of all genes in pathway) or we could take the average of the SNV density across all genes in the pathway. But for this, I am not sure of the best way to handle genes encoding alternative enzymes - we could 1) use a 'pathwise' strategy to identify sets of non-redundant genes (and compute SNV density for one or more individual paths) or 2) we could do something a bit more complex to aggregate data for the alternative enzymes before computing the density for the entire pathway (something like 'stepwise') or 3) we could just concatenate all the genes regardless of their relationship within the pathway.

With these outputs, the functions/pathways could be sorted according to their level of diversity so the user could identify the most variable ones in their samples.

The solution

I see two possibilities here. Either we update anvi-gen-variability-profile to have a flag for generating function-level output (and one for pathway-level output), or we make an entirely new program to do these things. Regardless, we will need to add a couple of methods for computing the densities per gene call. Generating the pathway-level output will be a bit more complex and will require the new methods to use metabolism data (for module definitions) and some of the methods that are currently stored in kegg.py.

Beneficiaries

@tdelmont and anyone else who wants to analyze variability of functions and pathways.

@meren, on the surface, this seems rather easy to implement, but I am not so familiar with the variability and variabilityops part of the codebase and I'm sure I am missing some of the nuances :) What do you think of this?

jreveillaud commented 1 year ago

This would be awesome, thanks @ivagljiva @tdelmont ! The team here in Montpellier (@HansSCHR and @btrouche) would be really interested by this and happy beneficiaries too. Please let us know if we can help in any way (providing feedback on our datasets or something else)

meren commented 1 year ago

I think the way to do this is to follow the footsteps of anvi-get-pn-ps-ratio or anvi-gen-fixation-index-matrix and implement something that works with a variability-profile-txt artifact directly.

If we can come up with a good design, all programs that work with variability-profile-txt would accept flags that produce such summaries at the function- or metabolism-level :)

merenlab / anvio