Closed bioinfo2016 closed 4 years ago
Dataset:
From UniProtKB, we downloaded Homo Sapiens proteins (73,947).
When searching for Camelus Dromedarius, we found a very small set of proteins. To enlarge this dataset, we decided to include other closely related organisms of camels. We adopted all Camelus genus, that include: Camelus Dromedarius, Camelus Bacterianus, and Camelus Ferus (20,745 proteins).
Since the two datasets are very large, we removed duplicated sequences by applying threshold of similarity %60 (CD-HIT). That resulted in 22,168 proteins for Homo Sapiens and 18,338 for Camelus.
Table 1S. Summary of intrinsic disorder metrics for Camel and Human datasets. Results shown for IUPred prediction methods (short and long).
Representation of the GO ‘‘Biological Processes’’ significantly enriched in disordered proteins in Camel dataset. Disordered proteins here correspond to those with one or more ‘‘long disordered windows’’ (LDW) based on IUPred predictions. Figure adapted from REVIGO, a system for summarizing and visualizing lists of GO terms. Each rectangle represents a cluster of related terms labeled according to a representative term. Rectangles are grouped in ‘‘superclusters’’ (identified with the same color) based on SimRel semantic similarity measure.
Table 1S with ESpritz results included. Summary of intrinsic disorder metrics for Camel and Human datasets.
Figure 1. Overall predicted global disorder and disordered binding regions in Camel and H. sapiens proteins. Left: percentages of disordered proteins (disordered proteins criterion: those proteins containing at least 50% disordered residues based on Disopred predictions). Right: average percentages of disordered residues involved in binding (DBRs), as predicted by Disopred.
Figure 2-A Fraction of proteins with different degrees of predicted disorder in Camel and H. sapiens. Protein disorder (as the percentage of disordered residues with respect to the sequence length) is binned into different ranges. Data based on Disopred predictions.
Figure 2-B: Fraction of proteins with different degrees of predicted disordered binding regions in Camelus and H. sapiens (using Disopred)
ANCHOR will give a better insight of the disordered binding regions.
Table1: Summary of intrinsic disorder metrics for Human and Camels. Results shown for Disopred (disorder prediction) and ANCHOR (disorder binding regions, DBRs).
Figure 2-B: Fraction of proteins with different degrees of predicted disordered binding regions in Camelus and H. sapiens (using ANCHOR)
Figure 1. Overall predicted global disorder and disordered binding regions in Camel and H. sapiens proteins. Left: percentages of disordered proteins (disordered proteins criterion: those proteins containing at least 50% disordered residues based on Disopred predictions). Right: average percentages of disordered residues involved in binding (DBRs), as predicted by ANCHOR.
GO:Biological Processes terms associated with Disordered Proteins in Camel dataset was summarized and visualized using Revigo. Disordered proteins here correspond to those with one or more ‘‘long disordered Region’’ (LDR) based on DISOPRED predictions.
To perform comparative analysis between H. and C: • I run PANNZER on: all proteins of H., all proteins in C., Disorder H., and Disorder C. I consider GO terms that has PPV 0.7 or above. • I detected 1993 common (shared) Go terms that are in H. AND in C. • I extract common GO terms and quantify them in: all-H.-dataset, all-C.-dataset, Disorder-H., and Disorder-C. • I computed contingency tables for each common GO term (observed/expected values). • Then computed Chi-square (P-value) for all common GO. • I keep only GO terms where the observed disorder in C. is greater than expected. I end up with 495 GO terms. • I computed the average of GO terms from previous step to see the enrichment percentage of each term in both disorder datasets. I filtered the results by considering only the terms where the percentage in C. is greater than that in H. We had 495 GO terms. • I found that this GO terms list are exactly the same as those when I consider Observed C. > Expected C. This was done in order to verify that the eventual differences in disorder are maintained when considering only the ‘‘comparable’’ proteins, and discard that these differences might be due to biases in the GO annotations of these two genomes. • To conclude, these 495 GO terms are more enriched in Camel Disordered proteins than in Human Disorder proteins.
Updated summary for PANNZER results after applying > 0.7 threshold on PPV.
Revigo representation of Disorder proteins in Camel using GO:BP terms with PPV>0.7. The Treemap appearance was improved using DrasticData.
The full list of the GO terms and their description (in colored boxes) and their general representative (in white boxes) is provided by Revigo as a table (in addition to other categories such as frequency and uniqueness)
Revigo representation of GO:BP terms (with PPV>0.7) that are more enriched in Camel Disordered proteins than in Human Disorder proteins. The Treemap appearance was improved using DrasticData.
The full list of the GO terms and their description (in colored boxes) and their general representative (in white boxes) is provided by Revigo as a table (in addition to other categories such as frequency and uniqueness)
Revigo representation of GO:BP terms (with PPV>0.7) that are more enriched in DBR in Camel than in Human. The Treemap appearance was improved using DrasticData.
The full list of the GO terms and their description (in colored boxes) and their general representative (in white boxes) is provided by Revigo as a table (in addition to other categories such as frequency and uniqueness)
-main reference paper[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3567104/pdf/pone.0055524.pdf]