git clone https://github.com/tlawrence3/tSFM.git
tSFM
directory
cd tSFM
conda create -n tSFM python=3.8 pandas numpy cython pytest-runner pytest scipy patsy mpmath statsmodels
conda activate tSFM
python setup.py install
tsfm -h
tSFM may be used to analyze any protein or RNA families. As a quick introduction to the functionality of tSFM we will be utilizing data from:
Let us recreate function logos for the human tRNA data provided by this study.
First we need to change into the example data directory. This directory contains all of the aligned tRNA sequences used for analysis in Kelly et al. 2020.
cd Kelly2020_data
To create single site and basepair RNA function logos for the human tRNA data we can use the command:
tsfm -e NSB -x 1 -c tRNA_L_skel_Leish.sites74.struct.cove --logo -p 10 HOMO/HOMO
Let us break this command down so we can understand the options:
a. The below -e
option sets the entropy estimator to the Bayesian estimator NSB (which is the default). This can also be set to the older and more inaccurate Miller-Madow estimator with argument Miller
.
-e NSB
b. The -x
option indicates the maximum sample size for the calculation of the exact entropy correction (larger sample sizes use the estimator set with the -e
option). Correction values can feasibly be calculated for sample sizes up to ~16, but the NSB estimator is preferred except for the smallest sample sizes.
-x 1
c. The -c
option provides the path to the file containing the
secondary structure annotation in cove format for RNA input
files. This is required for basepair function logos and can be
provided in infernal
, cove
, or text
formats. These formats
are described in below in the Appendix section. The secondary
structure annotation must conform to and encode the RNA
structural alignment common to the input alignment files and
usually must be prepared together with them.
-c tRNA_L_skel_Leish.sites74.struct.cove
d. The below option indicates that we want graphical output of function logos in EPS format. If we excluded this option, tSFM would only produce the text output file with statistics on tRNA CIFs.
--logo
e. The -p
option indicates the number of processor cores we want to utilize during the calculations. Below we have indicated we want to use 10 cores.
-p 10
f. Lastly, the below argument provides the path and clade prefix
to the input alignment files containing our tRNA
sequences. Multiple sets of input files for multiple clades can
be processed at the same time by providing multiple
arguments. tSFM expects input alignment files in ClustalW
format, all named with a common prefix corresponding to their
clade followed by a distinct one-letter symbol for the
functional class of sequences in the alignment, separated by an
underscore character. For example, in the HOMO
directory, we
have multiple alignment files including HOMO_A.aln
, in which
the prefix HOMO
indicates the clade and the postfix _A
indicates the alignment file containing sequences for the
tRNA-Ala functional class.
HOMO/HOMO
For each clade on input, tSFM produces a text output file in the
same directory where the program is run. This command produces the
output file HOMO_CIFs.txt
, more generally
<clade-prefix>_CIFs.txt
. The text output file has two sections,
one for paired-site features and one for single-site features.
Examples of records from each of these two sections is shown below.
:-: | :-: | :-: | :--: | :-: | :-: | :-: | :-: bp:| (0, 72)| AU| 14| 2.789| NA| NA| X:0.861:NA:NA L:0.108:NA:NA K:0.032:NA:NA
The first record in the output file is for feature (0, 72)AU
: the
number of sequences sharing this features is in column N
, the
total information (the total height of the stack (in bits) from
function logo of feature AU
at coordinate (0,72)
) is in column
info
, the p-value of the calculated info is in column p-value
and the multiple-test corrected significance of the p-value is in
the next column (this column is named with the method used for
multiple test correction for significance calculations). Also, the
last column shows a list of information for each tRNA class at
feature (0, 72)AU
with the format
<class>:<symbol-height>:<pvalue>:<BH>
which refers to the tRNA
functional class single letter symbol, the height of the symbol
within the stack, the p-value and the corrected significance value,
respectively.
A record for the single site is similar to the paired-site except that the first column is ss
instead of bp
.
:-: | :-: | :-: | :--: | :-: | :-: | :-: | :-: ss: | 0 | C | 13| 4.131 |NA |NA |Y:1.000:NA:NA
The values for the p-values and the significance are NA because we did not calculate p-values in the command described above. See the section Statistical significance testing for CIFs
below, which describes options for p-value calculations.
To avoid having to manually enter the names and prefixes for each of the parasite clades we can take a shortcut by using bash loops. The below command will loop through each folder and produce function logos for each clade in the running directory.
for d in */; do tsfm -e NSB -x 5 -c tRNA_L_skel_Leish.sites74.struct.cove --logo -p 10 $d${d%/}; done
To create ID/KLD logos and the data table for the bubble plots in
Kelly et al. 2020. for clades HOMO
versus MAJOR
, we can run
the command below. Note that option --bubbles or -B requires
designation of a specific clade to use as a reference clade (for
gains and losses), using option --clade.
tsfm -c tRNA_L_skel_Leish.sites74.struct.cove -e MM -x 5 --idlogos --kldlogos -B --clade HOMO MAJOR/MAJOR HOMO/HOMO
The text output of this command will be two files
F_HOMO_B_MAJOR_Table.txt
and F_MAJOR_B_HOMO_Table.txt
which can
be used to create the bubble plots. Here "F" stands for
"Foreground" and "B" for "Background." An example of a record from
the output text file is shown below.
aa | coord | state | fbits | fht | gainbits | gainfht | lossbits | lossfht | convbits | convfht | x | y | sprinzl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
K | 1 | A | 2.7602 | 0.0 | 0.7285 | 0.0 | 0.0 | 0.0 | 0.2913 | 0.0 | 0.0 | 0.0 | 0.0 |
This example shows at feature 1A of ID logo: the total height of
stack-bar in column gainbits
, and the height of symbol K in this
stack-bar in column gainfht
. Also, it shows at feature 1A of KLD
logo: the total height of stack-bar in column convbits
and the
height of symbol K in this stack-bar in column convfht
. The
columns x
, y
and sprinzl
are set to 0 and will be filled
later prior to creating the bubble plots by mapping each feature to
the tRNA sprinzl coordinates.
tSFM implements statisitical significance testing for CIFs and
function logos using permutation-based null distributions and corrects
for multiple tests using multiple user-determined FDR or FWER
methods. The -P
option indicates the number of permutations
generated for building the null distributions and the -C
option
indicates the method for multiple testing correction. Options include
the Bonferroni, Sidak, Holm, Holm-Sidak, Simes-Hochberg, or Hommel
methods for FWER control, or the Benjamini-Hochberg,
Benjamini-Yekutieli or Gavrilov-Benjamini-Sarkar methods for FDR
control. The default option is BH
for the Benjamini-Hochberg FDR.
To calculate the significance of CIFs stack-heights with 100 permutations for the humans tRNA data using the NSB entropy estimator we can use the command:
tsfm -e NSB -x 5 -c tRNA_L_skel_Leish.sites74.struct.cove --logo -T stacks -P 100 HOMO/HOMO
The -T
option will set the significance test for only CIF total stack-heights. This can also be set to letters
to calculate the significance for only CIF letter-heights within stacks. The default is to calculate significance for both stacks and letters, but for many applications and comparisons, we recommend to test only stacks
. The difference plays into the number of comparisons or tests requiring correction, where generally fewer is better. If one is interested in only which CIFs are significant, use stacks.
If one wants to test specific functional associations, use letters,
otherwise use both
.
Similarly, you can restrict CIF calculation and significance testing
to only single-site features with the --single
option or to only
paired-site features with the --nosingle
option. The -P
option
will set the number of permutations for significance test of CIFs to
100.
tSFM implements statistical significance testing for CIF divergences, specifically Information Difference (ID) logos and Kullback-Leibler Divergence (KLD) logos
using permutation-based null distributions. P-values can be calculated
by one of three algorithms called ECDF
, ECDF_pseudo
, and GPD
chosen by the user through
the option -m
. The default method is GPD
.
The algorithm ECDF_pseudo
is the Monte Carlo permutation-based
approximation calculated according to the formula (number of
exceedances + 1)/(number of permutations + 1), where exceedances
means the fraction of permutation replicates yielding statistics as
large or larger than the observed divergence. Algorithm ECDF
is
similar to ECDF_pseudo
except that when the number of permutation
exceedances reaches the value of option --exceedances
(with default
10) the p-value will be calculated according to the formula (number of
exceedances)/(number of permutations) and returned, otherwise when
there are fewer exceedances, the ECDF_pseudo
formula is
used. Algorithm GPD
is implemented according to the
"Peaks-over-Threshhold" (PoT) tail-approximation approach described as
algorithm APPROXIMATE in the manuscript, which includes the ECDF
and
ECDF_pseudo
algorithms as special cases. tSFM also calculates the confidence intervals for all the
p-values except for those calculated with pseudocounts.
To calculate the significance of KLD divergence values with 100 permutations for the contrast of the human and MAJOR clades, we can specify which algorithm to use as demonstrated in the three examples below:
Method ECDF_pseudo
:
tsfm --kldperms 100 -m ECDF_pseudo -c tRNA_L_skel_Leish.sites74.struct.cove HOMO/HOMO MAJOR/MAJOR
The --kldperms
option will set the number of permutations to compute the significance of KLD values to 100. More than one input clade must be specified.
Method ECDF
:
tsfm --kldperms 100 -m ECDF --alpha 0.01 -c tRNA_L_skel_Leish.sites74.struct.cove HOMO/HOMO MAJOR/MAJOR
The --alpha
option will set the significance level to compute the confidence interval of p-values. Its default is 0.05.
Method GPD
:
tsfm --kldperms 100 -m GPD --targetperms 70 --exceedances 5 --peaks 50 -c tRNA_L_skel_Leish.sites74.struct.cove HOMO/HOMO MAJOR/MAJOR
In addition to the previous options, there are options
--targetperms
and --peaks
specific to method GPD
. Options
targetperms
and peaks
are referred to as variables T and U,
respectively in the algorithm APPROXIMATE in the tSFM paper. Also,
the option exceedances
is referred to as parameter S and is used
for both methods ECDF
and GPD
. The default value for option
targetperms
is 500. The value of the option targetperms
should
be less than the maximum permutation number indicated with option
--kldperms
or --idperms
. The default value of exceedances
is
peaks
is 250; In the algorithm APPROXIMATE
the number of peaks over threshold will be set to the minimum of this value and one-third of the
permutations. The value of option peaks needs to be less than the
maximum permutation number.The output of KLD and ID logo significance from the examples described
above will be two text files named KLD_HOMO_MAJOR_stats.txt
and
KLD_MAJOR_HOMO_stats.txt
. An example of a record from the output
text file is shown below.
Coord | State | Statistic | Sample-Sz-Back | Sample-Sz-Fore | P-value | CI.Lower | CI.Upper | Adjusted-P | Permutations | P-Val-Method | GPD-shape | GPD-scale | Peaks | ADtest-P-val | Freqs-Back | Freqs-Fore |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | U | 0.75 | 30 | 72 | 2.46e-14 | 0.0 | 3.22e-05 | 7.07e-13 | 70.0 | p_gpd | 0.15 | 0.00037 | 13.0 | 0.97 | D13E16Y1 | D24E24N24s |
This record shows the significance of the KLD statistic at feature U2
along with other information at this feature including: confidence
interval (with level determined by option --alpha
) in columns
CI.Lower
and CI.Upper
, multiple-test adjusted p-value in column
Adjusted-P
, number of permutations with which the p-value is
calculated in column Permutations
, the method used for calculating
the p-value in column P-Val-Method
which can take the values:
p_ecdf
, p_ecdf_with_pseudo
, p_ecdf_with_pseudo (p_gpd=0)
and
p_gpd
. If the p-value is calculated with GPD, the parameters of the
GPD calculation will be shown in columns GPD-shape
, GPD-scale
and
Peaks
describing the maximum likelihood estimated parameters of the
GPD distribution and the number of peaks over threshold. Also the
column ADtest-P-val
shows the pvalue of the goodness-of-fit test of
the extreme permutation values to the GPD distribution.
To calculate the significance of ID-values of single-site features only, for clade HOMO against clades MAJOR and ENRIETTII with 100 permutations using the method GPD we can use this command (note that running this command can take about 30 minutes.)
tsfm --idperms 100 -m GPD --targetperms 50 --exceedances 5 --peaks 50 --single -e NSB -x 5 --clade HOMO HOMO/HOMO MAJOR/MAJOR ENRIETTII/ENRIETTII
The option --idperms
will set the number of permutations to compute significance of ID values to 100
--idperms 100
The option --clade
will contrast two clades MAJOR and ENRIETTII against HOMO rather than all-against-all:
--clade HOMO
The option --targetperms
will set the permutation number at which we attempt to fit the extreme permutation values to GPD.
--targetperms 50
To calculate the significance of KLD-values of paired-site features only, for clade HOMO against clades MAJOR with 100 permutations using the method GPD use the command below. The consensus secondary structure of tRNAs indicated by option -c is required to calculate functional information of base-pair features.
tsfm --kldperms 100 -m GPD --targetperms 50 --exceedances 5 --peaks 50 --nosingle -c tRNA_L_skel_Leish.sites74.struct.cove --clade HOMO HOMO/HOMO MAJOR/MAJOR
file_prefix One or more paths/file-prefix strings corresponding to
sets of input files compiled for a single clade in
clustalW format. Input files should be named
<path>/<prefix>_<functional-class>.<extension>, where
<functional_class> is a single letter
optional arguments:
-h, --help show this help message and exit
-i INFERNAL, --infernal INFERNAL
Use secondary structure file INFERNAL, required to
calculate functional information of base-pair
features, in Infernal format
-c COVE, --cove COVE Use secondary structure file COVE, required to
calculate functional information of base-pair
features, in COVE format. Example:
#=CS >>>>>>>..>>>>...........<<<<.>>>>>.......<<<<<.....>>>>>....
#=CS ...<<<<<<<<<<<<.
-t TEXT, --text TEXT Use secondary structure file TEXT, required to
calculate functional information of base-pair
features, is in text format. Example:
"A:0,72,1,71,2,70,3,69,4,68,5,67,6,66
D:9,25,10,24,11,23,12,22
C:27,43,28,42,29,41,30,40,31,39
T:49,65,50,64,51,63,52,62,53,61"
-s, --single Do not calculate functional information for paired
features. Calculate for single-site features only.
-n, --nosingle Do not calculate functional information for single-
site features. Calculate for paired-site features
only.
-V, --version show program's version number and exit
-p PROCESSES, --processes PROCESSES
Set the maximum number of concurrent processes.
Default is the number of cores reported by the
operating system.
-e {NSB,MM}, --entropy {NSB,MM}
Use entropy estimator ENTROPY when conditional sample
sizes exceed maximum for exact calculation. If value
is "NSB", use Nemenman-Shafee-Bialek estimator. If
value is "MM", use Miller-Madow estimator. Default is
NSB
-x EXACT, --exact EXACT
Maximum conditional sample size to exactly calculate
entropy correction. Default = 5
-l, --logos Visualize feature information using function/inverse
function logos in extended postscript format.
-v, --inverse Additionally calculate functional information for
inverse features/logos (features under-represented in
specific functional classes)
-P PERMUTATIONS, --permutations PERMUTATIONS
Calculate the significance of CIFs by a permutation
test, with a number of permutations equal to
PERMUTATIONS (an integer). Default is to not calculate
significance of CIFs.
-C, --correction {bonferroni,sidak,holm,holm-sidak,simes-hochberg,hommel,BH,BY,GBS}
Specify a method for multiple test correction for
significance calculations: bonferroni, sidak, holm,
holm-sidak, simes-hochberg, hommel, BH (Benjamini-
Hochberg FDR), BY (Benjamini-Yekutieli FDR) or GBS
(Gavrilov-Benjamini-Sarkar FDR). Default is BH
-T {stacks,letters,both}
Test the significance of only CIF stack-heights, only
CIF letter-heights, or both. Default is both.
-I, --idlogos Compute Information Differece logos for each pair of
clades
-K, --kldlogos Compute Kullback-Liebler Divergence logos for each
pair of clades
-B, --bubbles Compute input table for structural bubble-plots (to be
computed in R) like those appearing in Kelly et al.
(2020).
--kldperms KLDPERMS Set the number of permutations to compute significance
of Kullback-Leibler Divergences. Default is to not
calculate signifiance.
--idperms IDPERMS Set the number of permutations to compute significance
of Information Differences (SLOW). Default is to not
calculate signifiance.
-J, --JSD Produce pairwise distance matrices between function
logos for different taxa based on Jensen-Shannon
Divergences
--clade CLADE Contrast clade CLADE against all others. CLADE should
be one of the file-prefix strings passed as a required
argument to the program, stripped of its path. Default
is to compute contrasts for all pairs of clades.
-m {GPD,ECDF_pseudo,ECDF}, --pmethod {GPD,ECDF_pseudo,ECDF}
Set the p-value calculation algorithm for KLD/ID CIF Divergence significance calculations.
If value is "GPD", p-values for large divergences will be estimated by the Peaks-over-
Threshold method based on the Generalized Pareto Distribution, those for small divergences
with E exceedances (default 10) will be calculated as a binomial proportion (ECDF method)
and p-values for divergences that can't be estimated by either of those methods will be
estimated as a binomial proportion with pseudo-counts (ECDF_pseudo method). If value is
"ECDF", all p-values will be calculated by ECDF method or ECDF_pseudo method. If value is
"ECDF_pseudo", all the p-values will be calculated using ECDF_pseudo method. Default is GPD
--targetperms TARGETPERMS
Set the initial target number of permutations at which GPD-pvalue will be initially
calculated. Default is 500.
--exceedances EXCEEDANCES
Set the number of exceedances for which ECDF-based p-values will be calculated without
pseudo-counts. Default is 10.
--peaks PEAKS Set the number of Peaks-over-Threshold to use to initially estimate GPD. The actual value
used will be the minimum of this value and one-third of permutations. Default is 250.
--alpha ALPHA Set the significance level to compute the confidence interval of pvalues. Default is 0.05
Creating the supplemental figure
Create the table of statistics for KLD logos of clade ENRIETTII
vs HOMO
using the pvalue-calculation-method GPD
with maximum permutation number = 10000, target permutation number 500, and calculate the Confidence Interval of p-values with 5% significance level. The running time on an Intel Core i7 Dell XPS was ~34 minutes. The outputs are two text files: KLD_ENRIETTII_HOMO_stats.txt and KLD_HOMO_ENRIETTII_stats.txt
mkdir GPD
cd Kelly2020_data
tsfm --kldperms 10000 -m GPD -c tRNA_L_skel_Leish.sites74.struct.cove HOMO/HOMO ENRIETTII/ENRIETTII
mv KLD_ENRIETTII_HOMO_stats.txt KLD_HOMO_ENRIETTII_stats.txt GPD
Create the table of statistics for KLD logos of clade MAJOR
vs HOMO
using the pvalue-calculation-method GPD
with maximum permutation number = 10000, target permutation number 500, and calculate the Confidence Interval of p-values with 5% significance level. The running time on an Intel Core i7 Dell XPS was ~52.6 minutes. The outputs are two text files: KLD_MAJOR_HOMO_stats.txt and KLD_HOMO_MAJOR_stats.txt
tsfm --kldperms 10000 -m GPD -c tRNA_L_skel_Leish.sites74.struct.cove HOMO/HOMO MAJOR/MAJOR
mv KLD_MAJOR_HOMO_stats.txt KLD_HOMO_MAJOR_stats.txt GPD
Create the figure using the KLD tables in GPD folder by running this R script:
library(latex2exp)
library(ggplot2)
library(ggpubr)
#setwd("PATH/TO/WHERE/THE/FOLDERS/GPD/ECDF/ECDF_pseudo/ARE/LOCATED") # make sure to set your working directory to where the input data folders GPD, ECDF and ECDF_pseudo are located.
spetralc <-c("#D53E4F" ,"#F46D43" ,"#FDAE61" , "#66C2A5" ,"#3288BD","#313695")
breaksarray <- c(1,2,4,10,30,80,200,500)
legendbreaks <- logb(breaksarray, base = 2)
legendlables <- as.character(breaksarray)
KLD_paths <-
c(
"GPD/KLD_MAJOR_HOMO_stats.txt",
"GPD/KLD_HOMO_MAJOR_stats.txt",
"GPD/KLD_ENRIETTII_HOMO_stats.txt",
"GPD/KLD_HOMO_ENRIETTII_stats.txt"
)
xlabels <- c("Kullback-Leibler Divergence","Kullback-Leibler Divergence","Kullback-Leibler Divergence","Kullback-Leibler Divergence")
clades <- c("MAJOR (n=664 sequences) against Homo (m=431 sequences)", "HOMO against MAJOR","ENRIETTII (n=160 sequences) against Homo (m=431 sequences)", "HOMO against ENRIETTII")
plotls <- list()
for (i in 1:length(KLD_paths)) {
signTable <- read.table(KLD_paths[i], header = TRUE, sep = "\t")
signTable <- signTable[signTable$Sample.Sz.Back != 0 & signTable$Sample.Sz.Fore != 0 & signTable$Statistic != 0, ]
signTable$harmonicmean <- 0
for (j in 1:nrow(signTable)) {
signTable$harmonicmean[j] = 1 / mean(1 / c(signTable$Sample.Sz.Back[j], signTable$Sample.Sz.Fore[j]))
}
p <- ggplot(signTable, aes_string( x = round(signTable$Statistic, 4), y = -log(signTable$P.value, base = 2),colour = logb(signTable$harmonicmean, 2))) +
geom_point(aes_string( shape=signTable$P.Val.Method, colour = logb(signTable$harmonicmean, 2),size = (signTable$Permutations)))+
scale_colour_gradientn( colours = spetralc, breaks = legendbreaks, labels = legendlables, guide = "colourbar", limits = c(0, 9.2)) + ylim(0,60)+# xlim(0,xlimits[i])+
scale_size_continuous(limits = c(10,10001),breaks = c(10,2000,5000,10001),labels = c("10","2000","5000","10000"))+
geom_errorbar(aes(ymin=-log(P.value + CI.Upper, base = 2), ymax=-log(P.value - CI.Lower, base = 2)))+
scale_shape_manual( values = c(2,3,4,1),
breaks = c( "p_ecdf", "p_ecdf_with_pseudo", "p_ecdf_with_pseudo (p_gpd=0)", "p_gpd" ),
labels = c( "Pecdf", "Pecdf with pseudo", "Pecdf with_pseudo (pgpd=0)", "Pgpd" ))+
ggtitle(clades[i]) + ylab(TeX("$Surprisal (-log_2(p))$")) + xlab(xlabels[i]) + labs(color = "Sample Size", shape = "PmethodType",size = "Permutation number") +
theme(
axis.text = element_text(size = 13, family = "serif"),
axis.title = element_text(size = 14, family = "serif"),
legend.text = element_text(size = 12.5, family = "serif"),
legend.title = element_text(size = 14, family = "serif"),
plot.title = element_text(hjust = 0.5, family = "serif", size = 15, face = "bold" ),
legend.key.height = unit(0.9, "cm"),
plot.margin=unit(c(0.2,0.2,0.5,0.2), "cm")
)
plotls[[i]] <- p
}
setEPS()
postscript("KLD-Significance_ENRIETTII_MAJOR.eps", width = 16, height = 8,fonts=c("serif", "Palatino"))
p <- ggarrange( plotls[[1]], plotls[[2]], plotls[[3]], plotls[[4]],ncol = 2, nrow = 2, common.legend = TRUE,legend.grob = get_legend(plotls[[2]]), legend = "right",labels = c("A", "B", "C","D"))
print(p)
dev.off()
Creating the Supplemental Figure
Create the table of statistics for ID logos of clade MAJOR
vs HOMO
using the pvalue-calculation-method GPD
with maximum permutation number = 10000, target permutation number 500, and calculate the Confidence Interval of p-values with 5% significance level. This will use the default entropy estimator NSB with default Maximum conditional sample size 5. The running time on an Intel Core i7 Dell XPS was ~4.76 hrs. The outputs are two text files: ID_MAJOR_HOMO_stats.txt and ID_HOMO_MAJOR_stats.txt
tsfm --idperms 10000 -m GPD -c tRNA_L_skel_Leish.sites74.struct.cove HOMO/HOMO MAJOR/MAJOR
mv ID_MAJOR_HOMO_stats.txt ID_HOMO_MAJOR_stats.txt GPD
Create the table of statistics for ID logos of clade ENRIETTII
vs HOMO
using the pvalue-calculation-method GPD
with maximum permutation number = 10000, target permutation number 500, and calculate the Confidence Interval of p-values with 5% significance level. Default -e NSB -x 5. The running time on an Intel Core i7 Dell XPS was ~4.89 hrs. The outputs are two text files: ID_ENRIETTII_HOMO_stats.txt and ID_HOMO_ENRIETTII_stats.txt
tsfm --idperms 10000 -m GPD -c tRNA_L_skel_Leish.sites74.struct.cove HOMO/HOMO ENRIETTII/ENRIETTII
mv ID_ENRIETTII_HOMO_stats.txt ID_HOMO_ENRIETTII_stats.txt GPD
Create the figure using the ID tables in GPD folder by running this R script:
library(latex2exp)
library(ggplot2)
library(ggpubr)
#setwd("PATH/TO/WHERE/THE/FOLDERS/GPD/ECDF/ECDF_pseudo/ARE/LOCATED") # make sure to set your working directory to where the input data folders GPD, ECDF and ECDF_pseudo are located.
spetralc <-c("#D53E4F" ,"#F46D43" ,"#FDAE61" , "#66C2A5" ,"#3288BD","#313695")
breaksarray <- c(1,2,4,10,30,80,200,500)
legendbreaks <- logb(breaksarray, base = 2)
legendlables <- as.character(breaksarray)
ID_paths <-
c(
"GPD/ID_MAJOR_HOMO_stats.txt",
"GPD/ID_HOMO_MAJOR_stats.txt",
"GPD/ID_ENRIETTII_HOMO_stats.txt",
"GPD/ID_HOMO_ENRIETTII_stats.txt"
)
xlabels <- c("Information Difference","Information Difference","Information Difference","Information Difference")
clades <- c("MAJOR (n=664 sequences) against Homo (m=431 sequences)", "HOMO against MAJOR", "ENRIETTII (n=160 sequences) against Homo (m=431 sequences)", "HOMO against ENRIETTII")
plotls <- list()
for (i in 1:length(ID_paths)) {
signTable <- read.table(ID_paths[i], header = TRUE, sep = "\t")
signTable <- signTable[signTable$Sample.Sz.Back != 0 & signTable$Sample.Sz.Fore != 0 & signTable$Statistic != 0, ]
signTable$harmonicmean <- 0
for (j in 1:nrow(signTable)) {
signTable$harmonicmean[j] = 1 / mean(1 / c(signTable$Sample.Sz.Back[j], signTable$Sample.Sz.Fore[j]))
}
p <- ggplot(signTable, aes_string( x = round(signTable$Statistic, 4), y = -log(signTable$P.value, base = 2),colour = logb(signTable$harmonicmean, 2))) +
geom_point(aes_string( shape=signTable$P.Val.Method, colour = logb(signTable$harmonicmean, 2),size = (signTable$Permutations)))+
scale_colour_gradientn( colours = spetralc, breaks = legendbreaks, labels = legendlables, guide = "colourbar", limits = c(0, 9.2)) + ylim(0,60)+# xlim(0,xlimits[i])+
scale_size_continuous(limits = c(10,10001),breaks = c(10,2000,5000,10001),labels = c("10","2000","5000","10000"))+
geom_errorbar(aes(ymin=-log(P.value + CI.Upper, base = 2), ymax=-log(P.value - CI.Lower, base = 2)))+
scale_shape_manual( values = c(2,3,4,1),
breaks = c( "p_ecdf", "p_ecdf_with_pseudo", "p_ecdf_with_pseudo (p_gpd=0)", "p_gpd" ),
labels = c( "Pecdf", "Pecdf with pseudo", "Pecdf with_pseudo (pgpd=0)", "Pgpd" ))+
ggtitle(clades[i]) + ylab(TeX("$Surprisal (-log_2(p))$")) + xlab(xlabels[i]) + labs(color = "Sample Size", shape = "PmethodType",size = "Permutation number") +
theme(
axis.text = element_text(size = 13, family = "serif"),
axis.title = element_text(size = 14, family = "serif"),
legend.text = element_text(size = 12.5, family = "serif"),
legend.title = element_text(size = 14, family = "serif"),
plot.title = element_text(hjust = 0.5, family = "serif", size = 15, face = "bold" ),
legend.key.height = unit(0.9, "cm"),
plot.margin=unit(c(0.2,0.2,0.5,0.2), "cm")
)
plotls[[i]] <- p
}
setEPS()
postscript("ID-Significance_ENRIETTII_MAJOR.eps", width = 16, height = 8,fonts=c("serif", "Palatino"))
p <- ggarrange( plotls[[1]], plotls[[2]], plotls[[3]], plotls[[4]],ncol = 2, nrow = 2, common.legend = TRUE,legend.grob = get_legend(plotls[[2]]), legend = "right",labels = c("A", "B", "C","D"))
print(p)
dev.off()
Creating the slope graph from tSFM manuscript.
Create the table of statistics for KLD logos of clade ENRIETTII
vs HOMO
using the pvalue-calculation-method ECDF_pseudo
(naive Monte Carlo method using pseudo-counts) with maximum permutation number = 10000. The running time on a Linux compute node with 20 cores at 2301 MHz and 120 GB RAM was ~7.72 hrs. The outputs are two text files: KLD_ENRIETTII_HOMO_stats.txt and KLD_HOMO_ENRIETTII_stats.txt
mkdir ECDF_pseudo
tsfm --kldperms 10000 -m ECDF_pseudo -c tRNA_L_skel_Leish.sites74.struct.cove --clade HOMO HOMO/HOMO MAJOR/MAJOR
mv KLD_ENRIETTII_HOMO_stats.txt KLD_HOMO_ENRIETTII_stats.txt ECDF_pseudo
Create the table of statistics for KLD logos of clade ENRIETTII
vs HOMO
using the pvalue-calculation-method ECDF
(Monte Carlo method terminating after M = 10 exceedances) with maximum permutation number = 10000. The running time on a Linux compute node with 20 cores at 2301 MHz and 120 GB RAM was ~4.86 hrs. The outputs are two text files: KLD_ENRIETTII_HOMO_stats.txt and KLD_HOMO_ENRIETTII_stats.txt
mkdir ECDF
tsfm --kldperms 10000 -m ECDF -c tRNA_L_skel_Leish.sites74.struct.cove --clade HOMO HOMO/HOMO MAJOR/MAJOR
mv KLD_ENRIETTII_HOMO_stats.txt KLD_HOMO_ENRIETTII_stats.txt ECDF
Create the figure using the KLD tables for ENRIETTII against HOMO from three folders GPD, ECDF and ECDF_pseudo
library(latex2exp)
library(ggplot2)
library(ggpubr)
#setwd("PATH/TO/WHERE/THE/FOLDERS/GPD/ECDF/ECDF_pseudo/ARE/LOCATED") # make sure to set your working directory to where the input data folders GPD, ECDF and ECDF_pseudo are located.
signTable0 <- read.table("ECDF_pseudo/KLD_ENRIETTII_HOMO_stats.txt", header = TRUE, sep = "\t")
signTable1 <- read.table("ECDF/KLD_ENRIETTII_HOMO_stats.txt", header = TRUE, sep = "\t")
signTable2 <- read.table("GPD/KLD_ENRIETTII_HOMO_stats.txt", header = TRUE, sep = "\t")
signTable0 <- signTable0[signTable0$Sample.Sz.Back != 0 & signTable0$Sample.Sz.Fore != 0 & signTable0$Statistic != 0,]
signTable0$method <- "ECDF-with-pseudo"
signTable1 <- signTable1[signTable1$Sample.Sz.Back != 0 & signTable1$Sample.Sz.Fore != 0 & signTable1$Statistic != 0,]
signTable1$method <- "ECDF"
signTable2 <- signTable2[signTable2$Sample.Sz.Back != 0 & signTable2$Sample.Sz.Fore != 0 & signTable2$Statistic != 0,]
signTable2$method <- "GPD with Initial-Target-Permnum = 500"
signTable0$P.Val.Method <- "p_ecdf_with_pseudo"
common_cols <- intersect(colnames(signTable0), colnames(signTable1))
bindeddf <- rbind(
subset(signTable0, select = common_cols),
subset(signTable1, select = common_cols),
subset(signTable2, select = common_cols)
)
bindeddf$coordstate <- paste(bindeddf$Coord,bindeddf$State,sep = "")
bindeddf <- bindeddf[order(bindeddf$coordstate),]
bindeddf$method <- factor(bindeddf$method,levels = c("ECDF-with-pseudo","ECDF","GPD with Initial-Target-Permnum = 500"))
bindeddf$harmonicmean <- 0
for (j in 1:nrow(bindeddf)) {
bindeddf$harmonicmean[j] = 1 / mean(1 / c(bindeddf$Sample.Sz.Back[j], bindeddf$Sample.Sz.Fore[j]))
}
spetralc <-c("#D53E4F" ,"#F46D43" ,"#FDAE61" , "#66C2A5" ,"#3288BD","#313695")
breaksarray <- c(1,2,4,10,30,80,200,500)
legendbreaks <- logb(breaksarray, base = 2)
legendlables <- as.character(breaksarray)
p <- ggplot(data = bindeddf, aes( x = method , y = jitter(-log(P.value, base = 2),factor = 80), group = coordstate )) +
geom_line(aes(color = logb(harmonicmean, base = 2)), size = 0.5) +
geom_point(aes(color = logb(harmonicmean, base = 2),shape=factor(P.Val.Method)), size = 2) +
scale_x_discrete(position = "top") +
scale_colour_gradientn( colours = spetralc, breaks = legendbreaks, labels = legendlables, guide = "colourbar", limits = c(0, 9.2) ) +
theme( legend.key.height = unit(0.7, 'in'), plot.title = element_text(hjust = 0.5), text = element_text(size = 24, family = "serif")) +
scale_shape_manual( values = c(2,3,4,1), breaks = c( "p_ecdf", "p_ecdf_with_pseudo", "p_ecdf_with_pseudo (p_gpd=0)", "p_gpd"),
labels = c( "pecdf", "pecdf with pseudo", "pecdf with pseudo (pgpd=0)", "pgpd" ))+
labs(title = "" , colour = "Sample Size",shape="Pmethod", size = "Permutation number") +
xlab("") + ylab(TeX("$Surprisal (-log_2(p))$"))
setEPS()
postscript("KLD_ENRIETTII_Slopegraph_GPDvsECDF.eps", width = 18, height = 10,fonts=c("serif", "Palatino"))
print(p)
dev.off()