ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Final approach #69

Open ndreey opened 1 year ago

ndreey commented 1 year ago

Final approach before data analysis

Objective

The main objective of this study is to demonstrate a correlation between high levels of host contamination (HC) and poor assembly and binning performance. Additionally, the study aims to determine if host bins can serve as a reference genome to reduce host contamination and improve assembly and binning performance. The proposed method involves assembling and binning the host-contaminated samples, concatenating the resulting host bins to create a reference genome, filtering out reads that map to this reference genome, reassembling the remaining reads, and then binning again. The goal is to evaluate this approach's validity to improve assembly and binning performance.

To determine if there is a correlation with HC , I have accumulated performance data over 11 different samples. SampleID Size(Gb) Tot.Reads host.reads HC%
00 1.84 12325434 0 0,0
01 1.92 12804618 3251606 25,4
02 1.95 13063198 5606170 42,9
03 1.95 13045134 7389790 56,6
04 1.98 13233904 8787688 66,4
05 1.95 13053776 9912918 75,9
06 1.98 13252530 10837938 81,8
07 1.98 13218802 11611980 87,8
08 1.99 13288036 12269324 92,3
090 1.99 13310662 12834170 96,4
095 1.99 13316552 13087900 98,3

Where I will evaluate these metrics:

Assembly

Current stand

After determining, installing, and troubleshooting the different programs, there is now a working workflow to achieve the data (however, not automatic).

Assembly

Each sample has been benchmarked using the gold standard assembly (GSA) with metaQUAST. Here, the sequencing depth for the sample and the proportion of microbes have been held constant. The only thing differing is the proportion of HC, which thus minimizes the sequencing depth for each individual endophyte. I believe enough data has been accumulated to draw statistically significant conclusions if a correlation exists.

Binning

A benchmark for each HC level has been set using the trimmed reads, GSA, and gold standard bin map (GSB). However, the benchmark was set using contigs >= 1000 bp. Now in hindsight, I wonder if I should change this.

The number of contigs that have the length >= x can be found in the table and Figure 1 shows the distribution.

Threshold Count Proportion (%)
0 4957330 100
250 3353322 67.6
400 778146 15.7
500 370909 7.5
600 163248 3.3
700 77714 1.6
800 35012 0.7
900 16395 0.3
1000 7601 0.15

Figure 1 image

The proportion of contigs that have a length equal to or below 500 bp is 92.6%. When I binned using the default of 1000 bp, I discarded more than 99.8% of the data. This illuminates the limitation of the 2Gb sequence depth and raises the question if I should use a lower contig length threshold.

Evaluating bin removal

Using CONCOCT with the default threshold (-l 1000), resulted in 69 bins where P. zijinensis was the most abundant genome. Every bin had a purity of 1.00 except for one that had 0.961. These bins were concatenated into one fasta file, bin_ref.fa.

Assembly

It would be incorrect to remove the contigs of bin_ref.fa from the gsa as it would not tell me how the assembly performed. Instead, i assembled raw 095 reads (095_raw) and filtered 095 reads (095_filt) using MEGAHIT. Utilizing bowtie2, SAMtools, and BEDtools i could remove the reads aligning with the bin_ref.fa. With the GSA, 095_raw, and 095_filt, a true comparison can be made. Therefore, perhaps this should be done for samples 06 to 095 (81-98% HC) to see if a trend in performance boost exists.

Binning

As AMBER needs a GSB I cannot bin the 095_filt assembly as the contig's names do not correspond with the GSB. Therefore, a Python script was used to filter away all the contigs belonging to bin_ref.fa from the GSA. Now, the binned contigs are removed, and I can bin against it using the 095_filt reads... *or?

Binning

It was here I noticed that I could not use CONCOCT with the same settings as before. Because I had removed the +1000bp contigs, CONCOCT could not determine the coverage. I, therefore, set --length_threshold 500. This resulted in a CONCOCT run that took 22h and 32min on my 8 core 16GB RAM laptop, which determined 390 new bins (all belonging to P. zijinensis) with a purity ranging from 0.83 to 1 (mean 0.998).

As mentioned, this persuades me to re-evaluate my binning evaluation methodology.

ndreey commented 1 year ago

Binning evaluation 2.0

Let's take a deeper look into CONCOCT.

optional arguments:
  --coverage_file COVERAGE_FILE
                        specify the coverage file, containing a table where
                        each row correspond to a contig, and each column
                        correspond to a sample. The values are the average
                        coverage for this contig in that sample. All values
                        are separated with tabs.
  --composition_file COMPOSITION_FILE
                        specify the composition file, containing sequences in
                        fasta format. It is named the composition file since
                        it is used to calculate the kmer composition (the
                        genomic signature) of each contig.
  -l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD
                        specify the sequence length threshold, contigs shorter
                        than this value will not be included. Defaults to
                        1000.
  -r READ_LENGTH, --read_length READ_LENGTH
                        specify read length for coverage, default 100
  -s SEED, --seed SEED  Specify an integer to use as seed for clustering. 0
                        gives a random seed, 1 is the default seed and any
                        other positive integer can be used. Other values give
                        ArgumentTypeError.

COVERAGE_FILE

The *_coverage_table.tsv is generated by the concoct_coverage_table.py, where it takes a BED file holding info about the contigs and how the reads are mapped to assembly in the form of a BAM file. Thus, the coverage information to be clustered is determined before the concoct run.

    # Builds index of assembly
    bowtie2-build --threads 6 ${assembly} ${hc_prefix}_index

    # Mapping reads
    bowtie2 -p 6 -q \
        -x ${hc_prefix}_index \
        -1 output_R1.fastq\
        -2 output_R2.fastq \
        -S ${hc_prefix}_map.sam
    echo "$(date)    Bowtie2 Disengaged!"

    echo "$(date)    SAMtools Engaged!"    
    samtools view -@ 6 -b -S ${hc_prefix}_map.sam > ${hc_prefix}_map.bam
    echo "$(date)    SAMtools view complete" 
    samtools sort -@ 6 ${hc_prefix}_map.bam -o ${hc_prefix}_map_sorted.bam

Hence, I should be able to use the filtered GSA to align the filtered reads in creating the BAM file.

COMPOSITION_FILE

The filtered GSA should be used as we now need to cluster based on sequence composition.

LENGTH_THRESHOLD

As mentioned, setting it to 500 resulted in 390 bins with a mean purity of 99.8%, all belonging to the host and showing that I should have set the benchmark to 500 to create a larger bin_ref.fa.

READ_LENGTH

After the trim, R1 and R2 have a mean length of 90 and 68 bp, respectively. If I understand it correctly, the -r option is a way to normalize the data. From the paper:

To make the over and under estimation more equal, ill set this to the mean of the means, 79.

SEED

I will set the seed to 13371337.

ndreey commented 1 year ago

Analysis |overview|

Correlation

Determine if there exists a correlation with HC and metric.

  1. Determine if data is parametric or non-parametric.
  2. Do a correlation test.
  3. Do a significance test.

    Non-parametric

    • Spearman's rank correlation
    • Wilcoxon rank-sum test or Mann-Whitney U test

      Parametric

    • Pearsons correlation
    • t-test

Host depletion by bin removal