Closed meren closed 6 years ago
Some 'top' functions with most significant enrichment stats appear to occur in other clades in my tests, which should not happen :)
So this is not really happening, but it does point to a problematic feature of our current approach:
$ grep LL_II enriched-per-clade.txt | sort -nrk 2 | head -n 40
LL_III 1.00 -6.37 glyceraldehyde-3-phosphate dehydrogenase
LL_III 1.00 -6.37 WD40 domain-containing protein
LL_III 1.00 -6.37 Phospholipase/Carboxylesterase
LL_III 1.00 -6.37 Nucleic-acid-binding protein containing Zn-ribbon domain (DUF2082)
LL_III 1.00 -6.37 Inherit from COG: biosynthesis protein CelD
LL_III 1.00 -6.37 Formyl transferase
LL_II 1.00 -19.76 virion core protein (Lumpy skin disease
LL_II 1.00 -19.76 type I restriction-modification
LL_II 1.00 -19.76 methylase
LL_II 1.00 -19.76 heme oxygenase
LL_II 1.00 -19.76 gtra family
LL_II 1.00 -19.76 growth
LL_II 1.00 -19.76 YeeC-like protein
LL_II 1.00 -19.76 Restriction modification system DNA (Specificity
LL_II 1.00 -19.76 Peptidase family M50
LL_II 1.00 -19.76 One of two assembly initiator proteins, it binds directly to the 5'-end of the 23S rRNA, where it nucleates assembly of the 50S subunit (By similarity)
LL_II 1.00 -19.76 Leucine rich repeat variant
LL_II 1.00 -19.76 Glucose-6-phosphate dehydrogenase subunit
LL_II 1.00 -19.76 DNA polymerase iii delta prime subunit
LL_II 1.00 -19.76 6-carboxy-5,6,7,8-tetrahydropterin synthase
We choose to search Peptidase family M50
in the interactive and find four different GCs:
We select to inspect the GC on the left (marked by a selection named Peptidase family M50):
We can see that the gene cluster contains genes from other clades. But is this a bug or a feature? Sadly, it is a feature, because if we look at the non-clade members (MIT9303, MIT9313, and MIT9211), then we get, for example:
Namely, EGGNOG_BACT annotated this gene as Peptidase M50
and not Peptidase family M50
. If we look at COG or pfam then they are all the same, which means that maybe we should use these annotations for this purpose instead of EGGNOG.
We can add another filter for the results: --min-gene-clusters-enrichment
which will take the GCs that are associated with each enriched function and will only return the result if the GC has an enrichment greater than the threshold (how to calculate enrichment for each GC is described below).
Similarly, we can add a --min-gene-cluster-portion-occurence-in-group
(similar to --min-portion-occurence-in-group
(see anvi-script-get-enriched-functions-per-pan-group -h
for details).
Since a function could be associated with multiple gene clusters then it is not straight-forward to calculate gene clusters enrichment. To address this, I propose to merge the occurrences of the gene clusters, and then treat it as an artificial gene cluster, and calculate enrichment in the same manner as enrichment for functions. The merging would be taking an or
of two (or more) boolean presence/absence vectors.
This is what we decided to do:
[x] We will first assign a function to each gc, using the worst method (except for all other methods)
[x] Then for each function we will create a fake merged gc, with an occurence vector which is the "or" product of all gcs that match this function.
[x] We will calculate the enrichment score for this fake merged gc.
[ ] And then we will go home victorious.
After playing with the results of the presence absence of functions in Prochlorococcus, I can see some nice things and some problematic things.
Let's start with the nice things: here is a function that is a core function of all the Prochlorococcus, but it doesn't belong to a core gene cluster: UPF0367 protein
. If you search for it you get:
But our method detects it as a core function by merging the occurrence vectors of the two gene clusters. Very nice.
When we search in the interactive then we get every match even if the query is a sub-string of the hit function.
So in the Prochlorococcus case, the function restriction endonuclease
(with annotation source EGGNOG_BACT
) matches an exact match to only two gene clusters 'PC_00001532', 'PC_00005630'
, but when we search on the interactive it matches 17 gene clusters, because it matches genes that were annotated as 5-methylcytosine-specific restriction endonuclease McrA
(plus it used all sources and not just one source).
We decided to add the following columns: occurence_in_group
, occurence_in_outgroup
, portion_occurence_in_group
, portion_occurence_in_outgroup
, wilcoxin_p_value
, wilcoxin_p_value_corrected
, (and I still need to add the GC ids).
In addition, we will add flag --export-functional-occurence-table FILE
to save the full table. When a filter is invoked then only the filtered functions would be included in the output.
Since our data is Boolean, instead of wilcoxin test, maybe we should try McNemar test.
As Wikipedia states for small sample size, we should calculate p value according to a binomial distribution. We should check if this implementation makes sense: https://gist.github.com/kylebgorman/c8b3fb31c1552ecbaafb
If we want, for large sample size, we will use chi square statistics to compute p value.
McNemar test is fitting for paired data, i.e. when the two compared groups are naturally paired, so in our case it doesn't match because there is no pairing between a plaque genome and a tongue genome. To clarify, since McNemar is ran on a contingency table, there is no one unique way to generate a contingency table from our data (since if we just reorder the genome the contingency table will look different).
We could just do multiple iteration, where in each iteration a random pairing would be generated.
In the Assumptions and formal statement of hypotheses section of the wikipedia page it says: All of the above assumptions hold for boolean data.
Done.
To make it easier to offer suggestions, I am first going to build a work environment anyone who uses the
master
repository follow.Setting the stage
I downloaded the Prochlorococcus metapangenome from our recent study (details are here: http://merenlab.org/data/2018_Delmont_and_Eren_Metapangenomics/).
Then I displayed the pan:
Which gave me this:
Good. We have that
Clade
layer there.But it is too busy. Let's simplify it.
Simplifying the additional layer data section
Here I created a simpler additional data layers by removing everything, and adding only two categories:
Clade
, andLight
.Exported the additional data layers:
Remove extra columns and add a new one
Then I went full Voldemort to remove everything but 'Clade' column from this file, and add a new 'Light' column with an AWK one-liner:
So the
new-layers.txt
looked like this:Replace old layer data with the new one
Then I deleted previous layers from the pan db, and added the new ones:
Re-visualize
Then I visualized the pan again, and saw this:
Good.
Run functional enrichment analysis program
Then I went back to my terminal, and run this:
So far so good. The file goes like this:
When I do this to see top enriched functions for HL group II,
And then say search for
FR47-like protein
in the pan genome, there is a perfect, singular hit that is exactly where you would like it to be:Suggestions:
[x] The output file could be sorted (from most to least significant functions per clade (so we first sort by clades, and within them by enrichment factor)
[x] The output file could contain a column of comma-separated list of gene cluster IDs for each entry.
[x] Some 'top' functions with most significant enrichment stats appear to occur in other clades in my tests, which should not happen :)
[x] Add the following columns: occurence_in_group, occurence_in_outgroup, portion_occurence_in_group, portion_occurence_in_outgroup, wilcoxon_p_value, wilcoxon_statistic, wilcoxon_p_value_corrected.
[x] Add flag --export-functional-occurence-table FILE to save the full table. When a filter is invoked then only the filtered functions would be included in the output.