anvi-get-sequences-for-gene-clusters improvements

rbartelme commented 6 years ago

Could you add these flags, found in the hmm export function, to export pc alignments as well?

  --max-num-genes-missing-from-bin INTEGER
                        This filter removes bins (or genomes) from your
                        analysis. If you have a list of gene names, you can
                        use this parameter to omit any bin (or external
                        genome) that is missing more than a number of genes
                        you desire. For instance, if you have 100 genome bins,
                        and you are interested in working with 5 ribosomal
                        proteins, you can use '--max-num-genes-missing-from-
                        bin 4' to remove remove the bins that are missing more
                        than 4 of those 5 genes. This is especially useful for
                        phylogenomic analyses. Parameter 0 will remove any bin
                        that is missing any of the genes.
  --min-num-bins-gene-occurs INTEGER
                        This filter removes genes from your analysis. Let's
                        assume you have 100 bins to get sequences for HMM
                        hits. If you want to work only with genes among all
                        the hits that occur in at least X number of bins, and
                        discard the rest of them, you can use this flag. If
                        you say '--min-num-bins-gene-occurs 90', each gene in
                        the analysis will be required at least to appear in 90
                        genomes. If a gene occurs in less than that number of
                        genomes, it simply will not be reported. This is
                        especially useful for phylogenomic analyses, where you
                        may want to only focus on genes that are prevalent
                        across the set of genomes you wish to analyze.

meren commented 6 years ago

I see. I think this is a brilliant idea. I feel like we should allow users to be able to select a subset of gene clusters in the interface by playing with these parameters, too. So they can create their own collections of "core-like" genes, etc.

meren commented 6 years ago

I will take a stab at the algorithm, and then will talk to @ozcan to see what he thinks.

rbartelme commented 6 years ago

I also briefly mentioned to @ozcan, since he's here today, that having the partition data from the concatenations would be very useful. It let's one tell a phylogenetic program that the concatenated alignment consists of multiple genes, and where those genes start and end.

So if I had a concatenation of geneA, geneB, and geneC, and all were 300 characters long the partition would look something like: geneA: 1-300 geneB: 300-600 geneC: 600-900

This would significantly improve the interface with standard external phylogenetics programs like RAxML, IQ-TREE, MrBayes, BEAST, etc. Even if it was just output as a separate text file.

meren commented 6 years ago

Would it be useful if we reported in the FASTA defline the beginning and end position of each concatenated gene? Or do you need a fixed number?

rbartelme commented 6 years ago

I think it might make the deflines really messy, unless they were added in between the gene name pipes (if that's still a thing the export does?). If it was like this...

Bacteria_Rthebest GenA:1-300|GenB:300-600|GenC:600-900 MNFDKFJSOFIJSIDFHSIDFKSDFSKFSKHAFJASHFASH----JKNKNS...etc

That could work. Just thinking about how many non-FASTA formats phylogenetics software require, but I think what you're suggestion might be workable!

meren commented 6 years ago

It would be more like this:

GENOME GenA,1:257;GenB,257:555;GenC,555:724

Or it would be a fixed distance (i.e., the longest gene with gaps, such as a length of 644, and the defline would look like this:

GENOME genes:GenA,GenB,GenC|gene_block:644

The second one is wasteful. The first one is complex. I don't know what is the solution to this.

rbartelme commented 6 years ago

Thinking more like the first one.

meren commented 6 years ago

Yes, although it looks painful, it is extremely easy to parse with a Python script. I APPROVE YOUR ATTITUDE.

osvatic commented 6 years ago

Adding to this issue, it would be really nice if there was a way to select all unique ortholog groups (belonging to a selection genome or group of genomes) and output them in some way.

meren commented 6 years ago

Thanks, Jay. I think we can use the collections infrastructure to store this kind of information.

meren commented 6 years ago

Dear @rbartelme and @osvatic,

During my flight from France to the US I had some time to improve anvi-export-pc-alignments. It is now caled anvi-get-sequences-for-gene-clusters (similar to anvi-get-sequences-for-hmm-hits.

The program now contains the following filters:

 $ anvi-get-sequences-for-gene-clusters -h

    Do cool stuff with gene clusters in anvi'o pan genomes

(...)

ADVANCED FILTERS:
  If you are here you must be looking for ways to specify exactly what you
  want from that database of gene clusters. These filters will be applied to
  what your previous selections reported.

  --min-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        least X number of genomes. If you say '--min-num-
                        genomes-gene-cluster-occurs 90', each gene cluster in
                        the analysis will be required at least to appear in 90
                        genomes. If a gene occurs in less than that number of
                        genomes, it simply will not be reported. This is
                        especially useful for phylogenomic analyses, where you
                        may want to only focus on gene clusters that are
                        prevalent across the set of genomes you wish to
                        analyze.
  --max-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        most X number of genomes. If you say '--min-num-
                        genomes-gene-cluster-occurs 1', you will get gene
                        clusters that are singletons. Combining this paramter
                        with --min-num-genomes-gene-cluster-occurs can give
                        you a very precise way to filter your gene clusters.
  --min-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--min-num-genes-from-each-genome
                        2', this filter will remove every gene cluster, to
                        which every genome in your analysis contributed less
                        than 2 genes. This can be useful to find out gene
                        clusters with many genes from many genomes (such as
                        conserved multi-copy genes within a clade).
  --max-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--max-num-genes-from-each-genome
                        1', every gene cluster that has more than one gene
                        from any genome that contributes to it will be removed
                        from your analysis. This could be useful to remove
                        gene clusters with paralogs from your report for
                        appropriate phylogenomic analyses. For instance, using
                        '--max-num-genes-from-each-genome 1' and 'min-num-
                        genomes-gene-cluster-occurs X' where X is the total
                        number of your genomes, would give you the single-copy
                        gene cluters in your pan genome.
  --max-num-gene-clusters-missing-from-genome INTEGER
                        This filter will remove genomes from your report. If
                        you have a list of gene cluster names, you can use
                        this parameter to omit any genome from your report if
                        it is missing more than a number of genes you desire.
                        For instance, if you have 100 genomes in your pan
                        genome, and you are interested in working only with
                        genomes that have all 5 specific gene clusters of your
                        choice, you can use '--max-num-gene-clusters-missing-
                        from-genome 4' to remove remove the bins that are
                        missing more than 4 of those 5 genes. This is
                        especially useful for phylogenomic analyses. Parameter
                        0 will remove any genome that is missing any of the
                        genes.
  --add-into-items-additional-data-table NAME
                        If you use any of the filters, and would like to add
                        the resulting item names into the items additional
                        data table of your database, you can use this
                        parameter. You will need to give a name for these
                        results to be saved. If the given name is already in
                        the items additoinal data table, its contents will be
                        replaced with the new one. Then you can run anvi-
                        interactive or anvi-display-pan to 'see' the results
                        of your filters.

(...)

As you can see, the program can also store resulting gene clusters after applying all filters into item additional data table in the pan database with the parameter --add-into-items-additional-data-table (which means they can be visualized as additional layers --see this post for more information on these new tables and learn how those tables can be manipulated).

The code is a bit ugly and will likely change as we move forward, but those changes will not affect the user experience. So you can help us by testing it using the master repo, and suggest more ways to improve your experience.

I will talk to @ozcan about improving the interface using this new functionality. Then it will be much more fun!

Thanks,

rbartelme commented 6 years ago

So does " --add-into-items-additional-data-table NAME If you use any of the filters, and would like to add the resulting item names into the items additional data table of your database, you can use this parameter. You will need to give a name for these results to be saved. If the given name is already in the items additoinal data table, its contents will be replaced with the new one. Then you can run anvi- interactive or anvi-display-pan to 'see' the results of your filters."

List the gene names and potentially the AA positions in a concatenated aligned output??

meren commented 6 years ago

@rbartelme, are you referring to this:

https://github.com/merenlab/anvio/issues/668#issuecomment-351567826

The answer is no, it is not yet in :)

merenlab / anvio

anvi-get-sequences-for-gene-clusters improvements #668