Open ndreey opened 1 year ago
There were no mismatches for the genomes of the endophytes, only in the host genome. Indicating that for the gsa of the endophytes, hc levels do not correlate with less or more mismatches.
As there is no host genome in the 00 samples, the 0.0 value for the host genome should be excluded from the correlation calculation. Therefore, I will use sample 01:095
.
To determine if the data is parametric or not I used the Shapiro-Wilk test as the sample size is low (n=10, n < 50)
Shapiro-Wilk normality test
data: host_mm
W = 0.91374, p-value = 0.3077
The test concluded that there is not sufficient evidence to reject the null hypothesis being that the data is normally distributed. Therefore, I will use a parametric test to determine if there exists a statistically significant correlation between the number of mismatches per 100kbp and increased HC levels.
Pearson's product-moment correlation
data: hc[2:11] and host_mm
t = -4.0165, df = 8, p-value = 0.00386
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9554058 -0.3874369
sample estimates:
cor
-0.8176141
We see that there is indeed a negative correlation. The number of mismatches decreases with the host genome. In summary, the number of mismatches has a statistically significant negative correlation (p = 0.00386, r = -0.82) with higher HC levels (FOR HOST). This is understandable as we get more host genome data with higher HC. However, overall, GSA shows no correlation with HC levels.
It is the ratio of how many contigs map to multiple sites of the genome. Thus, the closer you are to 1, the better. Higher than 1 indicates that contigs map to multiple sites of the genome.
What we see is that contigs map to multiple places on the pl_vi_unk
genomes. pl_vi_unk
are the plasmids, virus, and unknown plasmids. In other words, quite small genomes, so it is not odd that they are getting higher duplication ratio.
Here is the duplication ratio if we exclude the pl_vi_unk
What we can conclude is that the median duplication ratio for the gsa for each sample is 1.
There were zero misassemblies for the GSA except for P. zijnensis (host) genome in samples 03, 07, and 08.
The number of aligned bp divided by the reference genome size. Good indicator of assembly quality.
What is clear from the figure is that there is a negative trend with higher HC. The *Kendall'srank correlation tau test confirms the trend. Kendal tau seems to be more advocated than Spearman**, but I will have to go in deeper to answer why.
> cor.test(hc, median, method ="kendall", exact = T)
Kendall's rank correlation tau
data: hc and median
T = 1, p-value = 5.511e-07
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.9636364
The reason for this is that the proportion of endophytic reads are diminishing with higher HC proportions. Thus, we have the negative correlation. If we look at only the host genome, P. zijinensis a perfect positive correlation is observed.
> cor.test(hc, as.numeric(host), method = "kendall")
Kendall's rank correlation tau
data: hc and as.numeric(host)
T = 45, p-value = 5.511e-07
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
1
The NGA50 is a metric used to evaluate the quality of a genome assembly that takes into account gaps in the reference genome. It is the length of the smallest set of aligned contigs (or scaffolds) that, when combined and including gaps, cover at least 50% of the size of the reference genome.
As the figure shows, a strong negative correlation exists.
> cor.test(hc, median, method ="kendall", exact = T)
Kendall's rank correlation tau
data: hc and median
T = 1, p-value = 0.0003968
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.9285714
Firstly i have to assemble the raw and filtered reads. Then run them through metaQUAST. At the moment, we have:
Ideally, we maybe can find that the correlations are opposite to GSA
There is a positive correlation with increased HC and num mismatches per 100kbp.
Kendall's rank correlation tau
data: hc and raw_median
T = 49, p-value = 0.0003334
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.7818182
All values were NA for raw...
Believe its the lack of reads...
Significant strong negative corr for GSA p-value = 5.511e-07 r = -0.9636364
Barley sign. weak negative corr for raw p-value = 0.04053 r = -0.4909091
I need to run concoct to generate bins that i can run through BUSCO(??) to see if there is a difference.
Here we see the distribution of contig lengths. x00 is the raw sample that all will use, as it has not been tampered with.
Because of the limitation to CPU power, i will bin based on 700bp length.
Data analysis of gsa, raw, x1000c, x700c, and maxbin. Evaluating if post-sequencing reads removal improves assembly quality.
Replicant | p-value | Correlation | Test |
---|---|---|---|
raw | 0.0003334 | 0.78 | Kendall's Tau |
x1000c | 0.0003334 | 0.78 | Kendall's Tau |
x700c | 0.0003334 | 0.78 | Kendall's Tau |
maxbin | 0.001591 | 0.71 | Kendall's Tau |
Fig1. Median num. mismatches per 100kbp, per replicant. (Potting all lines on one plot created hard-to-read spaghetti. GSA is zero, think it looks better without plotting it and just focusing on the filtered Fig2. Dispersion of num. mismatches per 100kbp for each replicant.
Good discussion point on why this turned out to not be a good metric (low data size). Replicant | p-value | Correlation | Test |
---|---|---|---|
raw | 0.02616 | 0.59 | Kendall's Tau |
x1000c | 0.02616 | 0.59 | Kendall's Tau |
x700c | 0.2059 | 0.34 | Kendall's Tau |
maxbin | NA | NA | Kendall's Tau |
maxbin medians were all 1
Fig3. Dispersion of the duplication ratios per genome per replicant. B is misleading. The difference is minimal.
No dice... Everything is NA, as there is too little data. Adding --min-contig 250
to metaquast
might salvage some data (default is set to 500), but there is no time + it would be forcing something out of the data. Further, contigs less than 500 is quite small..(?)
Replicant | p-value | Correlation | Test |
---|---|---|---|
gsa | 5.511e-07 | -0.96 | Kendall's Tau |
raw | 0.04053 | -0.49 | Kendall's Tau |
x1000c | 0.03556 | -0.49 | Kendall's Tau |
x700c | 0.02638 | -0.53 | Kendall's Tau |
maxbin | 0.2183 | -0.31 | Kendall's Tau |
Fig5. Dispersion (FIX THE LEGEND, COLORS DO NOT MATCH)
Fig6. Linear plot of the means
However, i believe it is more correct to set NA's to zero as we know all genomes have a set proportion. Then the plot looks like this...
Quick overview / initial idea
Correlation
Determine if there exists a correlation with HC and metric.
Non-parametric
Parametric
Host depletion by bin removal
In the best-possible situation, we were able to remove all host DNA post-sequencing and assembly... do we see a trend where performance is increased?