Closed rbartelme closed 6 years ago
I see. I think this is a brilliant idea. I feel like we should allow users to be able to select a subset of gene clusters in the interface by playing with these parameters, too. So they can create their own collections of "core-like" genes, etc.
I will take a stab at the algorithm, and then will talk to @ozcan to see what he thinks.
I also briefly mentioned to @ozcan, since he's here today, that having the partition data from the concatenations would be very useful. It let's one tell a phylogenetic program that the concatenated alignment consists of multiple genes, and where those genes start and end.
So if I had a concatenation of geneA, geneB, and geneC, and all were 300 characters long the partition would look something like: geneA: 1-300 geneB: 300-600 geneC: 600-900
This would significantly improve the interface with standard external phylogenetics programs like RAxML, IQ-TREE, MrBayes, BEAST, etc. Even if it was just output as a separate text file.
Would it be useful if we reported in the FASTA defline the beginning and end position of each concatenated gene? Or do you need a fixed number?
I think it might make the deflines really messy, unless they were added in between the gene name pipes (if that's still a thing the export does?). If it was like this...
Bacteria_Rthebest GenA:1-300|GenB:300-600|GenC:600-900 MNFDKFJSOFIJSIDFHSIDFKSDFSKFSKHAFJASHFASH----JKNKNS...etc
That could work. Just thinking about how many non-FASTA formats phylogenetics software require, but I think what you're suggestion might be workable!
It would be more like this:
GENOME GenA,1:257;GenB,257:555;GenC,555:724
Or it would be a fixed distance (i.e., the longest gene with gaps, such as a length of 644, and the defline would look like this:
GENOME genes:GenA,GenB,GenC|gene_block:644
The second one is wasteful. The first one is complex. I don't know what is the solution to this.
Thinking more like the first one.
Yes, although it looks painful, it is extremely easy to parse with a Python script. I APPROVE YOUR ATTITUDE.
Adding to this issue, it would be really nice if there was a way to select all unique ortholog groups (belonging to a selection genome or group of genomes) and output them in some way.
Thanks, Jay. I think we can use the collections infrastructure to store this kind of information.
Dear @rbartelme and @osvatic,
During my flight from France to the US I had some time to improve anvi-export-pc-alignments
. It is now caled anvi-get-sequences-for-gene-clusters
(similar to anvi-get-sequences-for-hmm-hits
.
The program now contains the following filters:
$ anvi-get-sequences-for-gene-clusters -h
Do cool stuff with gene clusters in anvi'o pan genomes
(...)
ADVANCED FILTERS:
If you are here you must be looking for ways to specify exactly what you
want from that database of gene clusters. These filters will be applied to
what your previous selections reported.
--min-num-genomes-gene-cluster-occurs INTEGER
This filter will remove gene clusters from your
report. Let's assume you have 100 genomes in your pan
genome analysis. You can use this parameter if you
want to work only with gene clusters that occur in at
least X number of genomes. If you say '--min-num-
genomes-gene-cluster-occurs 90', each gene cluster in
the analysis will be required at least to appear in 90
genomes. If a gene occurs in less than that number of
genomes, it simply will not be reported. This is
especially useful for phylogenomic analyses, where you
may want to only focus on gene clusters that are
prevalent across the set of genomes you wish to
analyze.
--max-num-genomes-gene-cluster-occurs INTEGER
This filter will remove gene clusters from your
report. Let's assume you have 100 genomes in your pan
genome analysis. You can use this parameter if you
want to work only with gene clusters that occur in at
most X number of genomes. If you say '--min-num-
genomes-gene-cluster-occurs 1', you will get gene
clusters that are singletons. Combining this paramter
with --min-num-genomes-gene-cluster-occurs can give
you a very precise way to filter your gene clusters.
--min-num-genes-from-each-genome INTEGER
This filter will remove gene clusters from your
report. If you say '--min-num-genes-from-each-genome
2', this filter will remove every gene cluster, to
which every genome in your analysis contributed less
than 2 genes. This can be useful to find out gene
clusters with many genes from many genomes (such as
conserved multi-copy genes within a clade).
--max-num-genes-from-each-genome INTEGER
This filter will remove gene clusters from your
report. If you say '--max-num-genes-from-each-genome
1', every gene cluster that has more than one gene
from any genome that contributes to it will be removed
from your analysis. This could be useful to remove
gene clusters with paralogs from your report for
appropriate phylogenomic analyses. For instance, using
'--max-num-genes-from-each-genome 1' and 'min-num-
genomes-gene-cluster-occurs X' where X is the total
number of your genomes, would give you the single-copy
gene cluters in your pan genome.
--max-num-gene-clusters-missing-from-genome INTEGER
This filter will remove genomes from your report. If
you have a list of gene cluster names, you can use
this parameter to omit any genome from your report if
it is missing more than a number of genes you desire.
For instance, if you have 100 genomes in your pan
genome, and you are interested in working only with
genomes that have all 5 specific gene clusters of your
choice, you can use '--max-num-gene-clusters-missing-
from-genome 4' to remove remove the bins that are
missing more than 4 of those 5 genes. This is
especially useful for phylogenomic analyses. Parameter
0 will remove any genome that is missing any of the
genes.
--add-into-items-additional-data-table NAME
If you use any of the filters, and would like to add
the resulting item names into the items additional
data table of your database, you can use this
parameter. You will need to give a name for these
results to be saved. If the given name is already in
the items additoinal data table, its contents will be
replaced with the new one. Then you can run anvi-
interactive or anvi-display-pan to 'see' the results
of your filters.
(...)
As you can see, the program can also store resulting gene clusters after applying all filters into item additional data table in the pan database with the parameter --add-into-items-additional-data-table
(which means they can be visualized as additional layers --see this post for more information on these new tables and learn how those tables can be manipulated).
The code is a bit ugly and will likely change as we move forward, but those changes will not affect the user experience. So you can help us by testing it using the master
repo, and suggest more ways to improve your experience.
I will talk to @ozcan about improving the interface using this new functionality. Then it will be much more fun!
Thanks,
So does " --add-into-items-additional-data-table NAME If you use any of the filters, and would like to add the resulting item names into the items additional data table of your database, you can use this parameter. You will need to give a name for these results to be saved. If the given name is already in the items additoinal data table, its contents will be replaced with the new one. Then you can run anvi- interactive or anvi-display-pan to 'see' the results of your filters."
List the gene names and potentially the AA positions in a concatenated aligned output??
@rbartelme, are you referring to this:
https://github.com/merenlab/anvio/issues/668#issuecomment-351567826
The answer is no, it is not yet in :)
Could you add these flags, found in the hmm export function, to export pc alignments as well?