merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
444 stars 145 forks source link

Interactive gene view for bins: --gene-view flag for anvi-interactive #663

Closed meren closed 5 years ago

meren commented 6 years ago

This is the summary of our meeting today (@ozcan, @ShaiberAlon, @meren) during which we concluded that anvi'o needs a native 'gene view' mode, where every split is a gene with all perks of anvi-interactive (except binning; see below). The following is mostly Alon's notes from the meeting, which contain some of the implementation details and plans for the future.

Target

This mode will require a collection name and a bin name.

Views

Only views that will be available in the interface are going to be those in gene_coverge_stats_dictionary (which is recovered through anvio/dbops). Only after everything works completely we may consider adding all views that are available through the atomic data table for single profiles (such as variability view, max-normalized-ratio view, etc).

Ordering/clustering

Gene view would have the clusterings calculated on-the-fly (similarly to the strategy implemented in refine_mode in anvio/interactive). The available item orders would be:

Interactive right-click options

Sequence

When right-clicking the same options that are available for contigs, will be available for the gene.

Inspection

In gene view mode there will be to inspect options in the right-click menu:

• “inspect global” - opens the split (to which the gene belongs) with zoom-in on the gene of interest. • “inspect local” - splices out the gene start/stop as the only context to display coverages (+ 100 nucleotides in each direction). This would allow much faster loading of the inspection page when there are many samples.

Collections

No selections would be available, and no collections would be available at this time.

Future

"gene-mode" for anvi-split

We envision that there may be a “gene-mode” also for the program anvi-split in the future. This would create a gene profile database for a given bin. The gene profile database would be familiar with the parent contigs and profile databases (through their hash). This would allow us to use all anvi'o functionality (including creating collections and anvi-summarize tasks) on gene profile databases. When the parent databases are provided, it would then be possible to inspect a gene in the context of its parent contig.

searching for functions in genes

Allow to search for functions in "gene view".

Gene caller

Genes would be computed using the results from the default gene caller, but anvi-interactive will require a --gene-caller flag if the user wishes to use a different gene caller for gene view. This will wait for Meren to finish redesigning that is occurring in this branch: https://github.com/merenlab/anvio/tree/gene_caller_option

ShaiberAlon commented 6 years ago

I'm putting a note here that we didn't discuss storing states in the profile database.

We need to make sure that we can save states for --gene-view, and that these states are not confused with the full mode view.

ShaiberAlon commented 6 years ago

We worked on this today and committed all changes to gene-view branch. So far: We created load_gene_view, in which we:

  1. initialize the gene_level_coverage_stats_dict
  2. get all the possible views from the gene_level_coverage_stats_dict
  3. get all data from gene_level_coverage_stats_dict for the genes in the splits_of_interest (according to the collection and bin that was provided by the user), and use this to populate the dict in self.views

@meren, we hit a roadblock when trying to create the newick for ordering. The way ordering is created for other interactive modes is by referring to a table in the database, and then clustering is computed , but for --gene-view we don't have a table in the database, so we need to use the dictionary. So right now the solution seems a little ugly, we will use the dict that we create in self.views[view]['dict'], convert it into a pandas dataframe, then we will create a class in clustering.py that takes a dataframe and returns a newick tree. @meren, if you maybe see a different solution, please let us know.

meren commented 6 years ago

There is already a solution for that. Please see the source code of anvi-matrix-to-newick :)

Best,

ShaiberAlon commented 6 years ago

Got it. @ozcan, I'm putting a not for us here, it seems that once we create a pandas dataframe (or simply a numpy matrix) then we run the following lines (from clusteting.get_newick_tree_data):

    vectors = get_normalized_vectors(vectors, norm=norm, progress=progress)

    tree = get_clustering_as_tree(vectors, linkage, distance, progress)
    newick = get_tree_object_in_newick(tree, id_to_sample_dict)

It seems to me that we should actually change clusteting.get_newick_tree_data so instead of accepting a path to a matrix, it would accept a matrix, and then have a wrapper for it for the case of anvi-matrix-to-newick.

ShaiberAlon commented 6 years ago

Something to consider: anvi-mcg-classifier computes a table of the presence/absence of genes in samples, I think that this is something that people would definitely want to use in gene views. This means that we need to add it as an available view, and I also think we should add an ordination according to this information. My suggestion is to add a flag --skip-mcg-classification, and that unless it is provoked then mcgclassifier would run (without producing plots), and would add this gene presence/absence information in an ad-hoc manner. @ozcan, @meren, let me know what you think.

meren commented 6 years ago

This is very doable. Generating additional views, and newick item orders based on views is something we do in both pangenomic and metagenomic workflows, so there are tools to do that in place :) We can take a look at this together if you guys like.

ShaiberAlon commented 6 years ago

Update on the current state: Bottom line: we reached a road-block that probably requires Meren's help.

We have all the view data and the items orders created in the proper format (including the newick trees).

The problem we have is because we have a contigs database, then the Interactive class is kind of treating our gene_mode in a similar manner to full and refine modes. As an example to the source of the problem see these lines: https://github.com/merenlab/anvio/blob/7826b132c682a906959759fda7e236098ec439af/anvio/interactive.py#L971

https://github.com/merenlab/anvio/blob/7826b132c682a906959759fda7e236098ec439af/anvio/interactive.py#L998

Because we initialize the contigs database, we can expect both self.genes_in_splits_summary_dict to exist and also self.splits_taxonomy_dict (if there is taxonomy), even though we don't want to show split summaries nor taxonomy in gene_mode.

@meren, if you want to see the exact status, just go into the test-output folder, and run: anvi-interactive -c CONTIGS.db -p SAMPLES-MERGED/PROFILE.db --gene-view -C CONCOCT -b Bin_1

meren commented 6 years ago

I want to look into this, but currently the gene-view can't run the mini test due to this error:

Traceback (most recent call last):
  File "/Users/meren/github/anvio/bin/anvi-merge", line 50, in <module>
    merger.MultipleRuns(args).merge()
  File "/Users/meren/github/anvio/anvio/merger.py", line 404, in merge
    self.populate_layers_additional_data_and_layer_orders()
  File "/Users/meren/github/anvio/anvio/merger.py", line 482, in populate_layers_additional_data_and_layer_orders
    TableForLayerOrders(args).add(layer_orders_data_dict)
  File "/Users/meren/github/anvio/anvio/tables/miscdata.py", line 339, in add
    TableForLayerOrders.check_names(self, data_dict)
  File "/Users/meren/github/anvio/anvio/tables/miscdata.py", line 629, in check_names
    data_dict[data_key]['order_type'],
KeyError: 'order_type'

This error is due to changes in clustering.py. with latest changes, it returns the clustering of items, rather than layers, even when the transpose parameter is sent.

meren commented 6 years ago

(going back to master, running mini test, and coming back to gene-view branch works).

meren commented 6 years ago

we don't want to show split summaries nor taxonomy in gene_mode.

The best way to achieve this is to reset those dictionaries (until we find a smart way to populate them with other information) in gene_mode. see the workaround in 613e50f.

ShaiberAlon commented 6 years ago

@ozcan, I played with the gene-view today. And I realize that we need to address what will happen when the user hits "next in inspect context. Right now it goes to the next gene, which means that it is very likely to be in the same split, and then the page is reloaded just to show up with the same information.

This is what I suggest: A simple solution could be that it would continue to go to the next gene, but if the next gene is in the same split then the page will not be reloaded, and the only thing that would happen is that the marker showing which gene is inspected would move to the next gene (we said that there would be a color showing which gene in the split is that one that was inspected, so simply the color would move to the next gene). @meren, @ozcan, let me know what you think about this solution.

In inspect gene, the solution should be simple and the page would just reload with the information relevant to the next gene.

In both cases we want the "next gene" to be the one following according to the currently chosen "items order".

ShaiberAlon commented 6 years ago

We forgot to talk about this:

Only after everything works completely we may consider adding all views that are available through the atomic data table for single profiles (such as variability view, max-normalized-ratio view, etc).

(this is copied from the views section of the top comment in this page).

Specifically I think it would be nice to allow to visualize variability of genes, and from what I remember I think @ekiefl built the infrastructure that would allow that.

ShaiberAlon commented 6 years ago

@ozcan , things that need to be done:

ozcan commented 6 years ago

(anvio-python-3) ➜  INFANT-GUT-TUTORIAL anvi-interactive -p PROFILE.db -c CONTIGS.db --gene-view
Contigs DB ...................................: Initialized: CONTIGS.db (v. 10)
Interactive mode .............................: gene
Auxiliary Data ...............................: Found: AUXILIARY-DATA.db (v. 2)
Profile Super ................................: Initialized with all 4784 splits: PROFILE.db (v. 23)

Config Error: There is no "None" I know of. Probably something is spelled wrong somewhere? In
              case you are a programmer and accessing to the collections from your program,
              here is a reminder for you: are you sure `populate_collections_dict` was called
              for whatever database you are trying to get collections from? If you are a user,
              you can always try to use the `--list-collections` flag and hope for the best.

giving no parameters gives this error after spending a lot of time recovering gene level coverage dict

ShaiberAlon commented 6 years ago

Ozcan, gene mode should only work when given a collection and a bin. I guess we forgot to put a basic sanity check to make sure that a collection and bin were specified.

meren commented 6 years ago

Yes, also whatever function is looking for that "None" should complain much earlier in its context about the fact that it is None :)

ShaiberAlon commented 6 years ago

@ozcan I added this sanity check: https://github.com/merenlab/anvio/commit/f0cd5d6512f5fcd3273dadc59a8e9f1e3cd20842

ShaiberAlon commented 6 years ago

I'm now using gene view for my real data (oral metagenomes) and I see it is quite annoying to wait for the gene level coverage stats to be computed. I think we should provide a way for this data to be written to the disk so that if a user is repeatedly using gene view, they don't have to wait every time from this to be computed. @meren, @ozcan, what do you think about this? And if you agree what would be the best way to do this?

meren commented 6 years ago

I think the best way to do it is to implement a --gene-mode flag into anvi-split, so you generate a self-contained profile for genes.

ShaiberAlon commented 6 years ago

That makes perfect sense to Meren. So this will wait to the next version I guess.

meren commented 6 years ago

I think it should. Unless we are ahead of our schedule and have a full day to work on this.

The design is quite streamlined in anvi-split, and it may be easy to extend it with genes without breaking things. But we need to look at the code to see whether this is the optimist in me.

ShaiberAlon commented 6 years ago

I wanted to close this issue, but then I saw that the first comment on this pages includes the following things:

Future "gene-mode" for anvi-split We envision that there may be a “gene-mode” also for the program anvi-split in the future. This would create a gene profile database for a given bin. The gene profile database would be familiar with the parent contigs and profile databases (through their hash). This would allow us to use all anvi'o functionality (including creating collections and anvi-summarize tasks) on gene profile databases. When the parent databases are provided, it would then be possible to inspect a gene in the context of its parent contig.

searching for functions in genes Allow to search for functions in "gene view".

Gene caller Genes would be computed using the results from the default gene caller, but anvi-interactive will require a --gene-caller flag if the user wishes to use a different gene caller for gene view. This will wait for Meren to finish redesigning that is occurring in this branch: https://github.com/merenlab/anvio/tree/gene_caller_option

The first thing has been completed. @ozcan, are you interested in completing the second thing (function search)? @meren, what about the gene-caller remark? It refers to a branch that hasn't been touched in 10 months (https://github.com/merenlab/anvio/tree/gene_caller_option)

meren commented 6 years ago

The function search already works I think. Is it not? In which case there is nothing to do here for @ozcan at this time :)

We can close it. Gene caller option will require some design decisions I will make with @ozcan at some point within the next 3-6 months.