merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
440 stars 145 forks source link

A new design for self table of anvi'o databases to retain their history #1673

Open meren opened 3 years ago

meren commented 3 years ago

I wanted to mention this as a potential future design effort. I will use the contigs database as an example, but it is applicable to any anvi'o database.

The problem

The codebase includes many classes that operate on db artifacts and update these artifacts with new data. For instance, when you run anvi-run-kegg-kofams you get your KOfams that influence the results of anvi-estimate-metabolism. But the class anvi-run-kegg-kofams inherits is configured with many default or user-defined parameters, yet all these key details to make sense of the results stored in a contigs database that has lost its connection to its creator (like these ones for instance) are forever lost in the log files of whoever run any program on any given anvi'o database.

Currently we keep key information for a given contigs database in its self table (the contents of which is printed out anytime someone runs anvi-db-info on a database and is used by many programs). But the current design of this two-column table does not have much room for expansion.

The solution

We could solve this problem one more column to the self table, and by editing all classes to take advantage of that. For instance, this is an example self table from a v7 contigs db:

key value
version 20
db_type contigs
db_variant unknown
project_name Contigs DB for anvi'o mini self-test
description
contigs_db_hash hash71162a87
split_length 1000
kmer_size 4
num_contigs 6
total_length 57030
num_splits 38
gene_level_taxonomy_source
genes_are_called 1
external_gene_calls 0
external_gene_amino_acid_seqs 0
skip_predict_frame 0
splits_consider_gene_calls 1
scg_taxonomy_was_run 0
scg_taxonomy_database_version
trna_taxonomy_was_run 0
trna_taxonomy_database_version
creation_date 1611341932.01437
gene_function_sources ProSiteProfiles,Coils,TIGRFAM,Pfam,SMART,Hamap,SUPERFAMILY,Gene3D,PRINTS,PIRSF,ProSitePatterns

I think this would've been a better design:

key value data_group
version 20 self
db_type contigs self
db_variant unknown self
project_name Contigs DB for anvi'o mini self-test self
description self
contigs_db_hash hash71162a87 self
split_length 1000 self
kmer_size 4 self
num_contigs 6 self
total_length 57030 self
num_splits 38 self
gene_level_taxonomy_source self
genes_are_called 1 self
external_gene_calls 0 self
external_gene_amino_acid_seqs 0 self
skip_predict_frame 0 self
splits_consider_gene_calls 1 self
scg_taxonomy_was_run 0 self
scg_taxonomy_database_version self
trna_taxonomy_was_run 0 self
trna_taxonomy_database_version self
creation_date 1611341932.01437 self
gene_function_sources ProSiteProfiles,Coils,TIGRFAM,Pfam,SMART,Hamap,SUPERFAMILY,Gene3D,PRINTS,PIRSF,ProSitePatterns self

Practical implications

This design would enable any anvi'o program or external programs that modify things in anvi'o contigs databases to store their configuration this way:

db.store_configuration(key_value_dict, data_group=data_group)

And any other program that may need the configuration of a particular program (such as anvi-gen-genomes-storage that doesn't want to create a genomes storage from contigs dbs that contain incompatible data) to retrieve it this way:

data_config = db.read_configuration(data_group=data_group)

Continuing with the example of anvi-run-kegg-kofams, when it is done running, it would update the self table with the following information:

key value data_group
version 20 self
db_type contigs self
db_variant unknown self
project_name Contigs DB for anvi'o mini self-test self
description self
contigs_db_hash hash71162a87 self
split_length 1000 self
kmer_size 4 self
num_contigs 6 self
total_length 57030 self
(...) (...) (...)
num_threads 4 RunKOfams
hmmer_program hmmearch RunKOfams
keep_all_hits False RunKOfams
log_bitscores False RunKOfams
skip_bitscore_heuristic False RunKOfams
bitscore_heuristic_e_value 1e-05 RunKOfams
bitscore_heuristic_bitscore_fraction 0.5 RunKOfams
kegg_db_version 0.1 RunKOfams
ekiefl commented 3 years ago

I like this. Certainly this would be very useful for users (anvi-db-info) and for programmers (with the store_configuration and read_configuration API).

However, in my opinion this not necessarily storing a databases history, but rather storing its current state. For example if someone ran KOfams again on this database, it would overwrite these rows with the updated information.

I'm being pedantic only because the mention of history made me think of storing all operations that a database partakes in, maybe in a table called history. Any program that takes the db artifact as input or output could be added to this history table, so I complete log of what the DB has undergone is available. One use case could be so a user can retrieve prior commands they have ran. Another more ambitious use case would be for parsing history in order to create a reproducible workflow.

meren commented 3 years ago

However, in my opinion this not necessarily storing a databases history, but rather storing its current state

Yes, indeed. I meant the state (but couldn't say it since state is so so associated with the interactive interface) :)