merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
441 stars 146 forks source link

The need for a standard table to keep additional data for items in anvi'o profile databases #662

Closed meren closed 6 years ago

meren commented 6 years ago

Currently anvi-interactive (and anvi-display-pan) allows the user to add additional layers of information into the display per item through the --additional-layers flag.

The advantage of this flag is to let the user expand their investigation with additional information in an ad hoc and painless manner. Although there are two disadvantages to keep it as is: (1) it becomes necessary to carry around the additional data files with profile databases to make things reproducible, and (2) the lack of a standard table to keep such information and work with external files prevents elegant implementations to add new layers into profile databases smoothly (see #661).

What do we need?

We need a new standard table, item_additional_data, and programs to manipulate the information in this table through a nice class in the anvio/dbops. These programs should be able to import additional data in the current additional data files, show the information in these tables, export them, or remove one or more columns.

With this infrastructure, wandering into less explored territories will be much more straightforward.

meren commented 6 years ago

Some notes for future hackers.

To gene clusters items in anvi'o pan genomes, we were generating a table on-the-fly for some additional data :

additional_data_structure = ['gene_cluster', 'num_genomes_gene_cluster_has_hits', 'num_genes_in_gene_cluster', 'SCG']
dbops.TablesForViews(self.pan_db_path).create_new_view(
                                data_dict=self.additional_view_data,
                                table_name='additional_data',
                                table_structure=additional_data_structure,
                                table_types=['text', 'numeric', 'numeric', 'numeric'],
                                view_name = None)

# add additional data structure to the self table, so we can have them initially ordered
# in the interface the way additional_data_structure suggests:
pan_db = dbops.PanDatabase(self.pan_db_path, quiet=True)
pan_db.db.set_meta_value('additional_data_headers', ','.join(additional_data_structure[1:]))
pan_db.disconnect()

And then generating ad hoc clustering recipes that accesses to these data to merge it with existing clustering recipes for pan genomes. The default clustering recipes contained instructions to order gene clusters either based on their binary occurrences across genomes, or based on number of genes contributed from each genome. For instance a clustering recipe for presence-absence looked like this:

[general]

[PresenceAbsenceData !PAN.db::gene_cluster_presence_absence]
normalize = False

And the following code in panops.py 'enhanced' these config files like this prior to clustering using the additional data that was being generated and stored in the additional data table:

for config_name in constants.clustering_configs['pan']:
    config_path = constants.clustering_configs['pan'][config_name]

    # now we have the config path. we first get a temporary file path:
    enhanced_config_path = filesnpaths.get_temp_file_path()

    # setup the additional section based on the number of genomes we have:
    if config_name == 'presence-absence':
        additional_config_section="""\n[AdditionalData !PAN.db::item_additional_data]\ntable_form=dataframe\ncolumns_to_use = %s\nnormalize = False\n""" \
                                % ','.join(['num_genomes_gene_cluster_has_hits'] * (int(round(len(self.genomes) / 2))))
    elif config_name == 'frequency':
        additional_config_section="""\n[AdditionalData !PAN.db::item_additional_data]\ntable_form=dataframe\ncolumns_to_use = %s\nnormalize = False\nlog=True\n""" \
                                % ','.join(['num_genes_in_gene_cluster'] * (int(round(math.sqrt(len(self.genomes))))))

    # write the content down in to file at the new path:
    open(enhanced_config_path, 'w').write(open(config_path).read() + additional_config_section)

    # update the clustering configs:
    updated_clustering_configs[config_name] = enhanced_config_path

    dbops.do_hierarchical_clustering_of_items(self.pan_db_path, updated_clustering_configs, database_paths={'PAN.db': self.pan_db_path},\
                                              input_directory=self.output_dir, default_clustering_config=constants.pan_default,\
                                              distance=self.distance, linkage=self.linkage, run=self.run, progress=self.progress)

so the ini file on the top looked like this prior to clustering:

[general]

[PresenceAbsenceData !PAN.db::gene_cluster_presence_absence]
normalize = False

[AdditionalData !PAN.db::additional_data]
columns_to_use = num_genomes_gene_cluster_has_hits,num_genomes_gene_cluster_has_hits,(...)
normalize = False

Which was probably one of the least embarrassing hacks in the codebase, and as a result of which we had nice looking orders for gene clusters.

But since now we have a standard item_additional_data table, the table called as table_name='additional_data' was not necessary, and the code on the very top could be replaced with this one:

item_additional_data_table = dbops.TableForItemAdditionalData(self.args)
item_additional_data_keys = ['num_genomes_gene_cluster_has_hits', 'num_genes_in_gene_cluster', 'SCG']
item_additional_data_table.add(item_additional_data_keys, self.additional_view_data)

But because the form of the item_additional_data is not the matrix form, the class handling the clustering configuration module was freaking out. With the commit 52629da119f165afcfb99cccd6f6a47026ef7fe0, panops.py generates configs that look like this one:

[general]

[PresenceAbsenceData !PAN.db::gene_cluster_presence_absence]
normalize = False

[AdditionalData !PAN.db::additional_data]
table_form = dataframe
columns_to_use = num_genomes_gene_cluster_has_hits,num_genomes_gene_cluster_has_hits,(...)
normalize = False

And thanks to the commit a59b5f9887f358136a80c4ef4cefe461e8cc1be7, which introduces a new variable, 'table_form' in matrix sections of clustering recipes, the clusteringconfigurations.py knows how to deal with the situation.

A Sunday fun.

meren commented 6 years ago

Some info: http://merenlab.org/2017/12/11/items-additional-data-tables/