Bin removal and evaluation

Assembly

The bins that are to be removed are to be concatenated and then used as a reference genome.

Concatenate the bins to a separate .fa file.
- cat fasta_bins/*.fa > all_bins.fa
Create a hash index of the bin reference.
- bowtie2-build --threads 6 all_bins.fa bin_index

Map and align the reads to the bin_index.

  bowtie2 -p 6 -q \
     -x bin}_index \
     -1 ${reads}/${hc_prefix}_trim_R1.fq.gz \
     -2 ${reads}/${hc_prefix}_trim_R2.fq.gz \
     -S ${hc_prefix}_map.sam

Generate a .bam file of the SAM
- samtools view -@ 6 -b -S ${hc_prefix}_map.sam > ${hc_prefix}_map.bam
Keep the reads that did not align using the -f 4 flag.
- samtools view -h -f 4 ${hc_prefix}_map.bam > ${hc_prefix}_filtered_map.bam
Use bedtools to separate the R1 and R2 reads into separate FASTQ files.
- bedtools bamtofastq -i ${hc_prefix}_filtered_map.bam -fq output_R1.fastq -fq2 output_R2.fastq

I can then assemble and re-run the MABQ using the host-depleted reads. Originally, 095_trim_R1.fq.gz had 4521544 reads. After host bin filtering, we now have 2995189 reads,

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "@" | wc -l
4521544

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "Chr" | wc -l
4443896

[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "@" | wc -l
2995189

[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "Chr" | wc -l
2917562

We can clearly see that the reads from two different microbes are withheld (some reads have been taken, although).

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "tig" | wc -l
11743
[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "tig" | wc -l
11733

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "scaffold" | wc -l
39677
[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "scaffold" | wc -l
39672

Binning

I cannot use the new assembly as the contigs wont be able to map to the gsb. Therefore, I created a Python script to remove the bins from the gsa using the Biopython package SeqIO.

from Bio import SeqIO

# set filenames
bin_headers_file = "bin_headers.txt"
input_file = "095_gsa.fasta"
output_file = "filtered.fasta"

bin_header = set()

with open(bin_headers_file, "r") as bin_handle:
    for line in bin_handle:
        header = line.strip()[1:]  # remove the '>' character from the beginning of the header
        bin_header.add(header)

with open(output_file, "w") as out_handle:
    for record in SeqIO.parse(input_file, "fasta"):
        if record.id not in bin_header:
            SeqIO.write(record, out_handle, "fasta")
        else:
            print("Excluding sequence with ID:", record.id)

With the filtered gsa, bin_filtered_gsa.fasta.gz i can run concoct_run.sh using the filtered reads output_R1.fastq and output_R2.fastq together with the bin_filtered_gsa.fasta.gz.

This will show if the removal of bins increased the binning. However, because only 7259 reads were removed... I think that multiple bin filtration steps will be needed to see a significant change

ndreey / ghost-magnet

Bin removal #68

Bin removal and evaluation

Assembly

Binning