ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Analysing the data #70

Open ndreey opened 1 year ago

ndreey commented 1 year ago

Quick overview / initial idea

Correlation

Determine if there exists a correlation with HC and metric.

  1. Determine if data is parametric or non-parametric.
  2. Do a correlation test.
  3. Do a significance test.

    Non-parametric

    • Spearman's rank correlation
    • Wilcoxon rank-sum test or Mann-Whitney U test

      Parametric

    • Pearsons correlation
    • t-test

Host depletion by bin removal

In the best-possible situation, we were able to remove all host DNA post-sequencing and assembly... do we see a trend where performance is increased?

ndreey commented 1 year ago

Assembly benchmark

Num mismatches per 100kbp

There were no mismatches for the genomes of the endophytes, only in the host genome. Indicating that for the gsa of the endophytes, hc levels do not correlate with less or more mismatches.

As there is no host genome in the 00 samples, the 0.0 value for the host genome should be excluded from the correlation calculation. Therefore, I will use sample 01:095.

To determine if the data is parametric or not I used the Shapiro-Wilk test as the sample size is low (n=10, n < 50)

Shapiro-Wilk normality test

data:  host_mm
W = 0.91374, p-value = 0.3077

The test concluded that there is not sufficient evidence to reject the null hypothesis being that the data is normally distributed. Therefore, I will use a parametric test to determine if there exists a statistically significant correlation between the number of mismatches per 100kbp and increased HC levels.

Pearson's product-moment correlation

data:  hc[2:11] and host_mm
t = -4.0165, df = 8, p-value = 0.00386
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9554058 -0.3874369
sample estimates:
       cor 
-0.8176141

We see that there is indeed a negative correlation. The number of mismatches decreases with the host genome. In summary, the number of mismatches has a statistically significant negative correlation (p = 0.00386, r = -0.82) with higher HC levels (FOR HOST). This is understandable as we get more host genome data with higher HC. However, overall, GSA shows no correlation with HC levels. image

Duplication ratio

It is the ratio of how many contigs map to multiple sites of the genome. Thus, the closer you are to 1, the better. Higher than 1 indicates that contigs map to multiple sites of the genome.

What we see is that contigs map to multiple places on the pl_vi_unk genomes. pl_vi_unk are the plasmids, virus, and unknown plasmids. In other words, quite small genomes, so it is not odd that they are getting higher duplication ratio. image

Here is the duplication ratio if we exclude the pl_vi_unk image

What we can conclude is that the median duplication ratio for the gsa for each sample is 1.

image

Misassemblies

There were zero misassemblies for the GSA except for P. zijnensis (host) genome in samples 03, 07, and 08. image

Genome Fraction

The number of aligned bp divided by the reference genome size. Good indicator of assembly quality. image

What is clear from the figure is that there is a negative trend with higher HC. The *Kendall'srank correlation tau test confirms the trend. Kendal tau seems to be more advocated than Spearman**, but I will have to go in deeper to answer why.

> cor.test(hc, median, method ="kendall", exact = T)

    Kendall's rank correlation tau

data:  hc and median
T = 1, p-value = 5.511e-07
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
-0.9636364

The reason for this is that the proportion of endophytic reads are diminishing with higher HC proportions. Thus, we have the negative correlation. If we look at only the host genome, P. zijinensis a perfect positive correlation is observed.

> cor.test(hc, as.numeric(host), method = "kendall")

    Kendall's rank correlation tau

data:  hc and as.numeric(host)
T = 45, p-value = 5.511e-07
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau 
  1

NGA50

The NGA50 is a metric used to evaluate the quality of a genome assembly that takes into account gaps in the reference genome. It is the length of the smallest set of aligned contigs (or scaffolds) that, when combined and including gaps, cover at least 50% of the size of the reference genome. image

As the figure shows, a strong negative correlation exists.

> cor.test(hc, median, method ="kendall", exact = T)

    Kendall's rank correlation tau

data:  hc and median
T = 1, p-value = 0.0003968
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
-0.9285714 
ndreey commented 1 year ago

Evaluating if filtered have a difference.

Firstly i have to assemble the raw and filtered reads. Then run them through metaQUAST. At the moment, we have:

Ideally, we maybe can find that the correlations are opposite to GSA

While we wait for the filtered data

image

There is a positive correlation with increased HC and num mismatches per 100kbp.

Kendall's rank correlation tau

data:  hc and raw_median
T = 49, p-value = 0.0003334
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.7818182 

All values were NA for raw... image

Believe its the lack of reads... image

Significant strong negative corr for GSA p-value = 5.511e-07 r = -0.9636364 Barley sign. weak negative corr for raw p-value = 0.04053 r = -0.4909091 image

ndreey commented 1 year ago

Bins to compare

I need to run concoct to generate bins that i can run through BUSCO(??) to see if there is a difference.

CONCOCT contig length threshold

Here we see the distribution of contig lengths. x00 is the raw sample that all will use, as it has not been tampered with. image

Because of the limitation to CPU power, i will bin based on 700bp length. image

ndreey commented 1 year ago

Assembly plots

Data analysis of gsa, raw, x1000c, x700c, and maxbin. Evaluating if post-sequencing reads removal improves assembly quality.

Fig1. Median num. mismatches per 100kbp, per replicant. (Potting all lines on one plot created hard-to-read spaghetti. GSA is zero, think it looks better without plotting it and just focusing on the filtered image Fig2. Dispersion of num. mismatches per 100kbp for each replicant. image

Duplication Ratio

Good discussion point on why this turned out to not be a good metric (low data size). Replicant p-value Correlation Test
raw 0.02616 0.59 Kendall's Tau
x1000c 0.02616 0.59 Kendall's Tau
x700c 0.2059 0.34 Kendall's Tau
maxbin NA NA Kendall's Tau

maxbin medians were all 1

Fig3. Dispersion of the duplication ratios per genome per replicant. B is misleading. The difference is minimal. image

NGA50

No dice... Everything is NA, as there is too little data. Adding --min-contig 250 to metaquastmight salvage some data (default is set to 500), but there is no time + it would be forcing something out of the data. Further, contigs less than 500 is quite small..(?)

Genome Fraction

Replicant p-value Correlation Test
gsa 5.511e-07 -0.96 Kendall's Tau
raw 0.04053 -0.49 Kendall's Tau
x1000c 0.03556 -0.49 Kendall's Tau
x700c 0.02638 -0.53 Kendall's Tau
maxbin 0.2183 -0.31 Kendall's Tau

Fig5. Dispersion (FIX THE LEGEND, COLORS DO NOT MATCH) image

Fig6. Linear plot of the means image

NA's

However, i believe it is more correct to set NA's to zero as we know all genomes have a set proportion. Then the plot looks like this... image

image