ndreey commented 1 year ago

Quick overview / initial idea

Correlation

Determine if there exists a correlation with HC and metric.

Determine if data is parametric or non-parametric.
Do a correlation test.
Do a significance test.
Non-parametric
- Spearman's rank correlation
- Wilcoxon rank-sum test or Mann-Whitney U test
  Parametric
- Pearsons correlation
- t-test

Host depletion by bin removal

In the best-possible situation, we were able to remove all host DNA post-sequencing and assembly... do we see a trend where performance is increased?

Benchmark (worst-possible situation, extreme neg value).
Remove all host bins to create a best-possible situation (extreme pos value).
- Remove the contigs belonging to the host genome. in the GSA.
Do the binning to get an estimate of a real-life situation?

ndreey commented 1 year ago

Assembly benchmark

Num mismatches per 100kbp

There were no mismatches for the genomes of the endophytes, only in the host genome. Indicating that for the gsa of the endophytes, hc levels do not correlate with less or more mismatches.

As there is no host genome in the 00 samples, the 0.0 value for the host genome should be excluded from the correlation calculation. Therefore, I will use sample 01:095.

To determine if the data is parametric or not I used the Shapiro-Wilk test as the sample size is low (n=10, n < 50)

Shapiro-Wilk normality test

data:  host_mm
W = 0.91374, p-value = 0.3077

The test concluded that there is not sufficient evidence to reject the null hypothesis being that the data is normally distributed. Therefore, I will use a parametric test to determine if there exists a statistically significant correlation between the number of mismatches per 100kbp and increased HC levels.

Pearson's product-moment correlation

data:  hc[2:11] and host_mm
t = -4.0165, df = 8, p-value = 0.00386
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9554058 -0.3874369
sample estimates:
       cor 
-0.8176141

We see that there is indeed a negative correlation. The number of mismatches decreases with the host genome. In summary, the number of mismatches has a statistically significant negative correlation (p = 0.00386, r = -0.82) with higher HC levels (FOR HOST). This is understandable as we get more host genome data with higher HC. However, overall, GSA shows no correlation with HC levels.

Duplication ratio

It is the ratio of how many contigs map to multiple sites of the genome. Thus, the closer you are to 1, the better. Higher than 1 indicates that contigs map to multiple sites of the genome.

What we see is that contigs map to multiple places on the pl_vi_unk genomes. pl_vi_unk are the plasmids, virus, and unknown plasmids. In other words, quite small genomes, so it is not odd that they are getting higher duplication ratio.

Here is the duplication ratio if we exclude the pl_vi_unk

What we can conclude is that the median duplication ratio for the gsa for each sample is 1.

Misassemblies

There were zero misassemblies for the GSA except for P. zijnensis (host) genome in samples 03, 07, and 08.

Genome Fraction

The number of aligned bp divided by the reference genome size. Good indicator of assembly quality.

What is clear from the figure is that there is a negative trend with higher HC. The *Kendall'srank correlation tau test confirms the trend. Kendal tau seems to be more advocated than Spearman**, but I will have to go in deeper to answer why.

> cor.test(hc, median, method ="kendall", exact = T)

    Kendall's rank correlation tau

data:  hc and median
T = 1, p-value = 5.511e-07
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
-0.9636364

The reason for this is that the proportion of endophytic reads are diminishing with higher HC proportions. Thus, we have the negative correlation. If we look at only the host genome, P. zijinensis a perfect positive correlation is observed.

> cor.test(hc, as.numeric(host), method = "kendall")

    Kendall's rank correlation tau

data:  hc and as.numeric(host)
T = 45, p-value = 5.511e-07
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau 
  1

NGA50

The NGA50 is a metric used to evaluate the quality of a genome assembly that takes into account gaps in the reference genome. It is the length of the smallest set of aligned contigs (or scaffolds) that, when combined and including gaps, cover at least 50% of the size of the reference genome.

As the figure shows, a strong negative correlation exists.

> cor.test(hc, median, method ="kendall", exact = T)

    Kendall's rank correlation tau

data:  hc and median
T = 1, p-value = 0.0003968
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
-0.9285714

ndreey commented 1 year ago

Evaluating if filtered have a difference.

Firstly i have to assemble the raw and filtered reads. Then run them through metaQUAST. At the moment, we have:

GSA
Raw
1000c_filt
700c_filt
max_filt

Ideally, we maybe can find that the correlations are opposite to GSA

While we wait for the filtered data

There is a positive correlation with increased HC and num mismatches per 100kbp.

Kendall's rank correlation tau

data:  hc and raw_median
T = 49, p-value = 0.0003334
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.7818182

All values were NA for raw...

Believe its the lack of reads...

Significant strong negative corr for GSA p-value = 5.511e-07 r = -0.9636364 Barley sign. weak negative corr for raw p-value = 0.04053 r = -0.4909091

ndreey commented 1 year ago

Bins to compare

I need to run concoct to generate bins that i can run through BUSCO(??) to see if there is a difference.

CONCOCT contig length threshold

Here we see the distribution of contig lengths. x00 is the raw sample that all will use, as it has not been tampered with.

Because of the limitation to CPU power, i will bin based on 700bp length.

ndreey commented 1 year ago

Assembly plots

Data analysis of gsa, raw, x1000c, x700c, and maxbin. Evaluating if post-sequencing reads removal improves assembly quality.

Raw: Assembly of raw trimmed reads.
x1000c: Assembly of bin filtered reads (binned by contig length 1000bp)
x700c: Assembly of bin filtered reads (binned by contig length 700bp)
maxbin: Assembly of optimal post-sequencing filtration. All reads mapping to host were removed.
Number of mismatches per 100kbp

Replicant p-value Correlation Test

raw 0.0003334 0.78 Kendall's Tau

x1000c 0.0003334 0.78 Kendall's Tau

x700c 0.0003334 0.78 Kendall's Tau

maxbin 0.001591 0.71 Kendall's Tau

Replicant	p-value	Correlation	Test
raw	0.0003334	0.78	Kendall's Tau
x1000c	0.0003334	0.78	Kendall's Tau
x700c	0.0003334	0.78	Kendall's Tau
maxbin	0.001591	0.71	Kendall's Tau

Fig1. Median num. mismatches per 100kbp, per replicant. (Potting all lines on one plot created hard-to-read spaghetti. GSA is zero, think it looks better without plotting it and just focusing on the filtered Fig2. Dispersion of num. mismatches per 100kbp for each replicant.

Duplication Ratio

Good discussion point on why this turned out to not be a good metric (low data size). Replicant	p-value	Correlation	Test
raw	0.02616	0.59	Kendall's Tau
x1000c	0.02616	0.59	Kendall's Tau
x700c	0.2059	0.34	Kendall's Tau
maxbin	NA	NA	Kendall's Tau

maxbin medians were all 1

Fig3. Dispersion of the duplication ratios per genome per replicant. B is misleading. The difference is minimal.

NGA50

No dice... Everything is NA, as there is too little data. Adding --min-contig 250 to metaquastmight salvage some data (default is set to 500), but there is no time + it would be forcing something out of the data. Further, contigs less than 500 is quite small..(?)

Genome Fraction

Replicant	p-value	Correlation	Test
gsa	5.511e-07	-0.96	Kendall's Tau
raw	0.04053	-0.49	Kendall's Tau
x1000c	0.03556	-0.49	Kendall's Tau
x700c	0.02638	-0.53	Kendall's Tau
maxbin	0.2183	-0.31	Kendall's Tau

Fig5. Dispersion (FIX THE LEGEND, COLORS DO NOT MATCH)

Fig6. Linear plot of the means

NA's

However, i believe it is more correct to set NA's to zero as we know all genomes have a set proportion. Then the plot looks like this...

ndreey / ghost-magnet

Analysing the data #70

Quick overview / initial idea

Correlation

Non-parametric

Parametric

Host depletion by bin removal

Assembly benchmark

Num mismatches per 100kbp

Duplication ratio

Misassemblies

Genome Fraction

NGA50

Evaluating if filtered have a difference.

While we wait for the filtered data

Bins to compare

CONCOCT contig length threshold

Assembly plots

Number of mismatches per 100kbp

Duplication Ratio

NGA50

Genome Fraction

NA's