ndreey commented 1 year ago

Final approach before data analysis

Objective

The main objective of this study is to demonstrate a correlation between high levels of host contamination (HC) and poor assembly and binning performance. Additionally, the study aims to determine if host bins can serve as a reference genome to reduce host contamination and improve assembly and binning performance. The proposed method involves assembling and binning the host-contaminated samples, concatenating the resulting host bins to create a reference genome, filtering out reads that map to this reference genome, reassembling the remaining reads, and then binning again. The goal is to evaluate this approach's validity to improve assembly and binning performance.

To determine if there is a correlation with HC , I have accumulated performance data over 11 different samples. SampleID	Size(Gb)	Tot.Reads	host.reads	HC%
00	1.84	12325434	0	0,0
01	1.92	12804618	3251606	25,4
02	1.95	13063198	5606170	42,9
03	1.95	13045134	7389790	56,6
04	1.98	13233904	8787688	66,4
05	1.95	13053776	9912918	75,9
06	1.98	13252530	10837938	81,8
07	1.98	13218802	11611980	87,8
08	1.99	13288036	12269324	92,3
090	1.99	13310662	12834170	96,4
095	1.99	13316552	13087900	98,3

Where I will evaluate these metrics:

Assembly

Strain recall: The fraction of all genomes that got a genome fraction > 90%
Precision: How many high-quality assemblies (genome fraction > 90%) were found compared to GSA
Mismatches per 100 kb
Duplication ratio
Misassemblies
Genome fraction
NGA50
Binning
Purity (precision)
Adjusted Rand Index (ARI)
Percentage of binned bp
How many genomes were binned.

Current stand

After determining, installing, and troubleshooting the different programs, there is now a working workflow to achieve the data (however, not automatic).

Assembly

Each sample has been benchmarked using the gold standard assembly (GSA) with metaQUAST. Here, the sequencing depth for the sample and the proportion of microbes have been held constant. The only thing differing is the proportion of HC, which thus minimizes the sequencing depth for each individual endophyte. I believe enough data has been accumulated to draw statistically significant conclusions if a correlation exists.

Binning

A benchmark for each HC level has been set using the trimmed reads, GSA, and gold standard bin map (GSB). However, the benchmark was set using contigs >= 1000 bp. Now in hindsight, I wonder if I should change this.

The number of contigs that have the length >= x can be found in the table and Figure 1 shows the distribution.

Threshold	Count	Proportion (%)
0	4957330	100
250	3353322	67.6
400	778146	15.7
500	370909	7.5
600	163248	3.3
700	77714	1.6
800	35012	0.7
900	16395	0.3
1000	7601	0.15

Figure 1

The proportion of contigs that have a length equal to or below 500 bp is 92.6%. When I binned using the default of 1000 bp, I discarded more than 99.8% of the data. This illuminates the limitation of the 2Gb sequence depth and raises the question if I should use a lower contig length threshold.

Evaluating bin removal

Using CONCOCT with the default threshold (-l 1000), resulted in 69 bins where P. zijinensis was the most abundant genome. Every bin had a purity of 1.00 except for one that had 0.961. These bins were concatenated into one fasta file, bin_ref.fa.

Assembly

It would be incorrect to remove the contigs of bin_ref.fa from the gsa as it would not tell me how the assembly performed. Instead, i assembled raw 095 reads (095_raw) and filtered 095 reads (095_filt) using MEGAHIT. Utilizing bowtie2, SAMtools, and BEDtools i could remove the reads aligning with the bin_ref.fa. With the GSA, 095_raw, and 095_filt, a true comparison can be made. Therefore, perhaps this should be done for samples 06 to 095 (81-98% HC) to see if a trend in performance boost exists.

Binning

As AMBER needs a GSB I cannot bin the 095_filt assembly as the contig's names do not correspond with the GSB. Therefore, a Python script was used to filter away all the contigs belonging to bin_ref.fa from the GSA. Now, the binned contigs are removed, and I can bin against it using the 095_filt reads... *or?

Binning

It was here I noticed that I could not use CONCOCT with the same settings as before. Because I had removed the +1000bp contigs, CONCOCT could not determine the coverage. I, therefore, set --length_threshold 500. This resulted in a CONCOCT run that took 22h and 32min on my 8 core 16GB RAM laptop, which determined 390 new bins (all belonging to P. zijinensis) with a purity ranging from 0.83 to 1 (mean 0.998).

As mentioned, this persuades me to re-evaluate my binning evaluation methodology.

ndreey commented 1 year ago

Binning evaluation 2.0

Let's take a deeper look into CONCOCT.

optional arguments:
  --coverage_file COVERAGE_FILE
                        specify the coverage file, containing a table where
                        each row correspond to a contig, and each column
                        correspond to a sample. The values are the average
                        coverage for this contig in that sample. All values
                        are separated with tabs.
  --composition_file COMPOSITION_FILE
                        specify the composition file, containing sequences in
                        fasta format. It is named the composition file since
                        it is used to calculate the kmer composition (the
                        genomic signature) of each contig.
  -l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD
                        specify the sequence length threshold, contigs shorter
                        than this value will not be included. Defaults to
                        1000.
  -r READ_LENGTH, --read_length READ_LENGTH
                        specify read length for coverage, default 100
  -s SEED, --seed SEED  Specify an integer to use as seed for clustering. 0
                        gives a random seed, 1 is the default seed and any
                        other positive integer can be used. Other values give
                        ArgumentTypeError.

COVERAGE_FILE

The *_coverage_table.tsv is generated by the concoct_coverage_table.py, where it takes a BED file holding info about the contigs and how the reads are mapped to assembly in the form of a BAM file. Thus, the coverage information to be clustered is determined before the concoct run.

    # Builds index of assembly
    bowtie2-build --threads 6 ${assembly} ${hc_prefix}_index

    # Mapping reads
    bowtie2 -p 6 -q \
        -x ${hc_prefix}_index \
        -1 output_R1.fastq\
        -2 output_R2.fastq \
        -S ${hc_prefix}_map.sam
    echo "$(date)    Bowtie2 Disengaged!"

    echo "$(date)    SAMtools Engaged!"    
    samtools view -@ 6 -b -S ${hc_prefix}_map.sam > ${hc_prefix}_map.bam
    echo "$(date)    SAMtools view complete" 
    samtools sort -@ 6 ${hc_prefix}_map.bam -o ${hc_prefix}_map_sorted.bam

Hence, I should be able to use the filtered GSA to align the filtered reads in creating the BAM file.

COMPOSITION_FILE

The filtered GSA should be used as we now need to cluster based on sequence composition.

LENGTH_THRESHOLD

As mentioned, setting it to 500 resulted in 390 bins with a mean purity of 99.8%, all belonging to the host and showing that I should have set the benchmark to 500 to create a larger bin_ref.fa.

READ_LENGTH

After the trim, R1 and R2 have a mean length of 90 and 68 bp, respectively. If I understand it correctly, the -r option is a way to normalize the data. From the paper:

It is equivalent to assuming a uniform Dirichlet distribution prior on the relative frequencies. For the composition, we simply add a single count to each k-mer Z′i,j = Zi,j + 1, but for the coverage we imagine mapping an extra read of length 100 bp to each contig in each sample Y′i,j = Yi,j + 100/Li, where Li is the contig length.

To make the over and under estimation more equal, ill set this to the mean of the means, 79.

SEED

I will set the seed to 13371337.

ndreey commented 1 year ago

Analysis |overview|

Correlation

Determine if there exists a correlation with HC and metric.

Determine if data is parametric or non-parametric.
Do a correlation test.
Do a significance test.
Non-parametric
- Spearman's rank correlation
- Wilcoxon rank-sum test or Mann-Whitney U test
  Parametric
- Pearsons correlation
- t-test

Host depletion by bin removal

Benchmark (worst-possible situation, extreme neg value).
Remove all host bins to create a best-possible situation (extreme pos value).
- Remove the contigs belonging to the host genome. in the GSA.
Do the binning to get an estimate of a real-life situation?

ndreey / ghost-magnet

Final approach #69

Final approach before data analysis

Objective

Assembly

Binning

Current stand

Assembly

Binning

Evaluating bin removal

Assembly

Binning

Binning

Binning evaluation 2.0

COVERAGE_FILE

COMPOSITION_FILE

LENGTH_THRESHOLD

READ_LENGTH

SEED

Analysis |overview|

Correlation

Non-parametric

Parametric

Host depletion by bin removal