meren commented 6 years ago

To make it easier to offer suggestions, I am first going to build a work environment anyone who uses the master repository follow.

Setting the stage

I downloaded the Prochlorococcus metapangenome from our recent study (details are here: http://merenlab.org/data/2018_Delmont_and_Eren_Metapangenomics/).

# download stuff
wget https://ndownloader.figshare.com/files/9416623 -O ANVIO-METAPANGENOME-FOR-PROCHLOROCOCCUS-ISOLATES.tar.gz
tar -zxvf ANVIO-METAPANGENOME-FOR-PROCHLOROCOCCUS-ISOLATES.tar.gz
cd ANVIO-METAPANGENOME-FOR-PROCHLOROCOCCUS-ISOLATES

# migrate DBs
export ANVIO_SAMPLES_DB=Prochlorococcus-METAPAN-SAMPLES.db
anvi-migrate-db Prochlorococcus-PAN-PAN.db
anvi-migrate-db Prochlorococcus-GENOMES.h5

Then I displayed the pan:

anvi-display-pan -p Prochlorococcus-PAN-PAN.db -g Prochlorococcus-GENOMES.db

Which gave me this:

Good. We have that Clade layer there.

But it is too busy. Let's simplify it.

Simplifying the additional layer data section

Here I created a simpler additional data layers by removing everything, and adding only two categories: Clade, and Light.

Exported the additional data layers:

anvi-export-misc-data -p Prochlorococcus-PAN-PAN.db -t layers -o layers.txt

Remove extra columns and add a new one

Then I went full Voldemort to remove everything but 'Clade' column from this file, and add a new 'Light' column with an AWK one-liner:

awk '
{
    if (NR==1) {
        for (i=1; i<=NF; i++) {
            if ($i == "Clade") {
                c=i; print $1 "\t" $c "\tLight"
            }
        }
     } if (NR>1) {
        print $1 "\t" $c "\t" substr($c, 1, 2)
    }
}' layers.txt > new-layers.txt

So the new-layers.txt looked like this:

 $ column -t new-layers.txt
layers    Clade   Light
AS9601    HL_II   HL
CCMP1375  LL_II   LL
EQPAC1    HL_I    HL
GP2       HL_II   HL
LG        LL_II   LL
MED4      HL_I    HL
MIT9107   HL_II   HL
MIT9116   HL_II   HL
MIT9123   HL_II   HL
MIT9201   HL_II   HL
MIT9202   HL_II   HL
MIT9211   LL_III  LL
MIT9215   HL_II   HL
MIT9301   HL_II   HL
MIT9302   HL_II   HL
MIT9303   LL_IV   LL
MIT9311   HL_II   HL
MIT9312   HL_II   HL
MIT9313   LL_IV   LL
MIT9314   HL_II   HL
MIT9321   HL_II   HL
MIT9322   HL_II   HL
MIT9401   HL_II   HL
MIT9515   HL_I    HL
NATL1A    LL_I    LL
NATL2A    LL_I    LL
PAC1      LL_I    LL
SB        HL_II   HL
SS2       LL_II   LL
SS35      LL_II   LL
SS51      LL_II   LL

Replace old layer data with the new one

Then I deleted previous layers from the pan db, and added the new ones:

anvi-delete-misc-data -p Prochlorococcus-PAN-PAN.db -t layers --just-do-it
anvi-import-misc-data -p Prochlorococcus-PAN-PAN.db -t layers new-layers.txt

Re-visualize

Then I visualized the pan again, and saw this:

anvi-display-pan -p Prochlorococcus-PAN-PAN.db -g Prochlorococcus-GENOMES.db

Good.

Run functional enrichment analysis program

Then I went back to my terminal, and run this:

anvi-script-get-enriched-functions-per-pan-group -p Prochlorococcus-PAN-PAN.db \
                                                 -g Prochlorococcus-GENOMES.db \
                                                 --category Clade \
                                                 --annotation EGGNOG_BACT \
                                                 -o enriched-per-clade.txt

anvi-script-get-enriched-functions-per-pan-group -p Prochlorococcus-PAN-PAN.db \
                                                 -g Prochlorococcus-GENOMES.db \
                                                 --category Light \
                                                 --annotation EGGNOG_BACT \
                                                 -o enriched-per-light-condition.txt

So far so good. The file goes like this:

category	enrichment	weighted_enrichment	EGGNOG_BACT
HL_II	0.06	-1.81	activity domain protein
HL_II	0.79	-24.19	pirin domain protein
HL_II	0.14	-4.40	YGGT family
HL_II	0.50	-15.40	OsmC family
HL_II	0.06	-1.81	surface antigen
HL_II	0.18	-5.43	Bacterial sugar transferase
HL_II	0.11	-3.23	phosphoribulokinase uridine kinase
HL_II	0.06	-1.81	peptidase u35 phage prohead
HL_II	0.14	-4.40	PKHD-type hydroxylase
(...)	(...)	(...)	(...)

When I do this to see top enriched functions for HL group II,

$ grep HL_II enriched-per-clade.txt | sort -nrk 2 | head
HL_II   0.86    -26.39  Photosystem II reaction centre N protein (psbN)
HL_II   0.86    -26.39  FR47-like protein
HL_II   0.82    -25.36  fructokinase (EC 2.7.1.4)
HL_II   0.79    -24.19  transcriptional activator (TenA
HL_II   0.79    -24.19  secreted proteins, Ya
HL_II   0.79    -24.19  pirin domain protein
HL_II   0.79    -24.19  phosphomethylpyrimidine kinase
HL_II   0.79    -24.19  of the drug metabolite transporter
HL_II   0.79    -24.19  Steroid 5-alpha reductase, C-terminal
HL_II   0.79    -24.19  MazG family

And then say search for FR47-like protein in the pan genome, there is a perfect, singular hit that is exactly where you would like it to be:

Suggestions:

[x] The output file could be sorted (from most to least significant functions per clade (so we first sort by clades, and within them by enrichment factor)
[x] The output file could contain a column of comma-separated list of gene cluster IDs for each entry.
[x] Some 'top' functions with most significant enrichment stats appear to occur in other clades in my tests, which should not happen :)
[x] Add the following columns: occurence_in_group, occurence_in_outgroup, portion_occurence_in_group, portion_occurence_in_outgroup, wilcoxon_p_value, wilcoxon_statistic, wilcoxon_p_value_corrected.
[x] Add flag --export-functional-occurence-table FILE to save the full table. When a filter is invoked then only the filtered functions would be included in the output.

ShaiberAlon commented 6 years ago

Steps to reproduce the "error" mentioned above

Some 'top' functions with most significant enrichment stats appear to occur in other clades in my tests, which should not happen :)

So this is not really happening, but it does point to a problematic feature of our current approach:

$ grep LL_II enriched-per-clade.txt | sort -nrk 2 | head -n 40
LL_III  1.00    -6.37   glyceraldehyde-3-phosphate dehydrogenase
LL_III  1.00    -6.37   WD40 domain-containing protein
LL_III  1.00    -6.37   Phospholipase/Carboxylesterase
LL_III  1.00    -6.37   Nucleic-acid-binding protein containing Zn-ribbon domain (DUF2082)
LL_III  1.00    -6.37   Inherit from COG: biosynthesis protein CelD
LL_III  1.00    -6.37   Formyl transferase
LL_II   1.00    -19.76  virion core protein (Lumpy skin disease
LL_II   1.00    -19.76  type I restriction-modification
LL_II   1.00    -19.76  methylase
LL_II   1.00    -19.76  heme oxygenase
LL_II   1.00    -19.76  gtra family
LL_II   1.00    -19.76  growth
LL_II   1.00    -19.76  YeeC-like protein
LL_II   1.00    -19.76  Restriction modification system DNA (Specificity
LL_II   1.00    -19.76  Peptidase family M50
LL_II   1.00    -19.76  One of two assembly initiator proteins, it binds directly to the 5'-end of the 23S rRNA, where it nucleates assembly of the 50S subunit (By similarity)
LL_II   1.00    -19.76  Leucine rich repeat variant
LL_II   1.00    -19.76  Glucose-6-phosphate dehydrogenase subunit
LL_II   1.00    -19.76  DNA polymerase iii delta prime subunit
LL_II   1.00    -19.76  6-carboxy-5,6,7,8-tetrahydropterin synthase

We choose to search Peptidase family M50 in the interactive and find four different GCs:

We select to inspect the GC on the left (marked by a selection named Peptidase family M50):

We can see that the gene cluster contains genes from other clades. But is this a bug or a feature? Sadly, it is a feature, because if we look at the non-clade members (MIT9303, MIT9313, and MIT9211), then we get, for example:

Namely, EGGNOG_BACT annotated this gene as Peptidase M50 and not Peptidase family M50. If we look at COG or pfam then they are all the same, which means that maybe we should use these annotations for this purpose instead of EGGNOG.

A possible solution

We can add another filter for the results: --min-gene-clusters-enrichment which will take the GCs that are associated with each enriched function and will only return the result if the GC has an enrichment greater than the threshold (how to calculate enrichment for each GC is described below).

Similarly, we can add a --min-gene-cluster-portion-occurence-in-group (similar to --min-portion-occurence-in-group (see anvi-script-get-enriched-functions-per-pan-group -h for details).

Calculating gene cluster enrichment in a functional analysis context

Since a function could be associated with multiple gene clusters then it is not straight-forward to calculate gene clusters enrichment. To address this, I propose to merge the occurrences of the gene clusters, and then treat it as an artificial gene cluster, and calculate enrichment in the same manner as enrichment for functions. The merging would be taking an or of two (or more) boolean presence/absence vectors.

ShaiberAlon commented 6 years ago

This is what we decided to do:

[x] We will first assign a function to each gc, using the worst method (except for all other methods)
[x] Then for each function we will create a fake merged gc, with an occurence vector which is the "or" product of all gcs that match this function.
[x] We will calculate the enrichment score for this fake merged gc.
[ ] And then we will go home victorious.

ShaiberAlon commented 6 years ago

After playing with the results of the presence absence of functions in Prochlorococcus, I can see some nice things and some problematic things.

Gene clusters are not core, but function is core

Let's start with the nice things: here is a function that is a core function of all the Prochlorococcus, but it doesn't belong to a core gene cluster: UPF0367 protein. If you search for it you get:

But our method detects it as a core function by merging the occurrence vectors of the two gene clusters. Very nice.

Unspecific names of functions makes them seem unique to a subset of gene clusters, but searching in the interactive shows that it is clearly not specific

When we search in the interactive then we get every match even if the query is a sub-string of the hit function.

So in the Prochlorococcus case, the function restriction endonuclease (with annotation source EGGNOG_BACT) matches an exact match to only two gene clusters 'PC_00001532', 'PC_00005630', but when we search on the interactive it matches 17 gene clusters, because it matches genes that were annotated as 5-methylcytosine-specific restriction endonuclease McrA (plus it used all sources and not just one source).

ShaiberAlon commented 6 years ago

We decided to add the following columns: occurence_in_group, occurence_in_outgroup, portion_occurence_in_group, portion_occurence_in_outgroup, wilcoxin_p_value, wilcoxin_p_value_corrected, (and I still need to add the GC ids).

In addition, we will add flag --export-functional-occurence-table FILE to save the full table. When a filter is invoked then only the filtered functions would be included in the output.

ShaiberAlon commented 6 years ago

Since our data is Boolean, instead of wilcoxin test, maybe we should try McNemar test.

As Wikipedia states for small sample size, we should calculate p value according to a binomial distribution. We should check if this implementation makes sense: https://gist.github.com/kylebgorman/c8b3fb31c1552ecbaafb

If we want, for large sample size, we will use chi square statistics to compute p value.

Why we shouldn't use McNemar test (#lol)

McNemar test is fitting for paired data, i.e. when the two compared groups are naturally paired, so in our case it doesn't match because there is no pairing between a plaque genome and a tongue genome. To clarify, since McNemar is ran on a contingency table, there is no one unique way to generate a contingency table from our data (since if we just reorder the genome the contingency table will look different).

How it might be ok to use McNemar test

We could just do multiple iteration, where in each iteration a random pairing would be generated.

Why I chose to just do Wilcoxon

In the Assumptions and formal statement of hypotheses section of the wikipedia page it says: All of the above assumptions hold for boolean data.

ShaiberAlon commented 6 years ago

Done.

merenlab / anvio

Improvements for the search for enriched functions in pangenomes #831

Setting the stage

Simplifying the additional layer data section

Exported the additional data layers:

Remove extra columns and add a new one

Replace old layer data with the new one

Re-visualize

Run functional enrichment analysis program

Suggestions:

Steps to reproduce the "error" mentioned above

A possible solution

Calculating gene cluster enrichment in a functional analysis context

Gene clusters are not core, but function is core

Unspecific names of functions makes them seem unique to a subset of gene clusters, but searching in the interactive shows that it is clearly not specific

Why we shouldn't use McNemar test (#lol)

How it might be ok to use McNemar test

Why I chose to just do Wilcoxon