ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Bin removal #68

Open ndreey opened 1 year ago

ndreey commented 1 year ago

Bin removal and evaluation

Assembly

The bins that are to be removed are to be concatenated and then used as a reference genome.

  1. Concatenate the bins to a separate .fa file.
    • cat fasta_bins/*.fa > all_bins.fa
  2. Create a hash index of the bin reference.
    • bowtie2-build --threads 6 all_bins.fa bin_index
  3. Map and align the reads to the bin_index.
    •   bowtie2 -p 6 -q \
           -x bin}_index \
           -1 ${reads}/${hc_prefix}_trim_R1.fq.gz \
           -2 ${reads}/${hc_prefix}_trim_R2.fq.gz \
           -S ${hc_prefix}_map.sam
  4. Generate a .bam file of the SAM
    • samtools view -@ 6 -b -S ${hc_prefix}_map.sam > ${hc_prefix}_map.bam
  5. Keep the reads that did not align using the -f 4 flag.
    • samtools view -h -f 4 ${hc_prefix}_map.bam > ${hc_prefix}_filtered_map.bam
  6. Use bedtools to separate the R1 and R2 reads into separate FASTQ files.
    • bedtools bamtofastq -i ${hc_prefix}_filtered_map.bam -fq output_R1.fastq -fq2 output_R2.fastq

I can then assemble and re-run the MABQ using the host-depleted reads. Originally, 095_trim_R1.fq.gz had 4521544 reads. After host bin filtering, we now have 2995189 reads,

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "@" | wc -l
4521544

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "Chr" | wc -l
4443896

[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "@" | wc -l
2995189

[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "Chr" | wc -l
2917562

We can clearly see that the reads from two different microbes are withheld (some reads have been taken, although).

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "tig" | wc -l
11743
[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "tig" | wc -l
11733

[gfr152@mjolnirhead01fl gfr152]$ zcat bsc_thesis/data/subsample/reads/01_trimmed/095_trim_R1.fq.gz | grep "scaffold" | wc -l
39677
[gfr152@mjolnirhead01fl gfr152]$ cat bin_remove/output_R1.fastq | grep "scaffold" | wc -l
39672

Binning

I cannot use the new assembly as the contigs wont be able to map to the gsb. Therefore, I created a Python script to remove the bins from the gsa using the Biopython package SeqIO.

from Bio import SeqIO

# set filenames
bin_headers_file = "bin_headers.txt"
input_file = "095_gsa.fasta"
output_file = "filtered.fasta"

bin_header = set()

with open(bin_headers_file, "r") as bin_handle:
    for line in bin_handle:
        header = line.strip()[1:]  # remove the '>' character from the beginning of the header
        bin_header.add(header)

with open(output_file, "w") as out_handle:
    for record in SeqIO.parse(input_file, "fasta"):
        if record.id not in bin_header:
            SeqIO.write(record, out_handle, "fasta")
        else:
            print("Excluding sequence with ID:", record.id)

With the filtered gsa, bin_filtered_gsa.fasta.gz i can run concoct_run.sh using the filtered reads output_R1.fastq and output_R2.fastq together with the bin_filtered_gsa.fasta.gz.

This will show if the removal of bins increased the binning. However, because only 7259 reads were removed... I think that multiple bin filtration steps will be needed to see a significant change

ndreey commented 1 year ago

With CONCOCT i was able to bin based on contig lengths of 1000 and 700. These generated different amount of bins. To get a perfect bin removal i removed all host reads using the gsb. Here we see how many reads are after filtering, starting with 1000c, 700c, and lastly the perfect bin removal. Maybe, we can see a trend of imporvment

[gfr152@mjolnirhead01fl bin_filter]$ cat bin_refs/1000c_bin/reads/06_filt_R1.fastq | grep "Chr" | wc -l
2897406

[gfr152@mjolnirhead01fl bin_filter]$ cat bin_refs/700c_bin/06_filt_R1.fastq | grep "Chr" | wc -l
1927245

[gfr152@mjolnirhead01fl bin_filter]$ cat bin_refs/max_filt/reads/06_filt_R1.fastq | grep "Chr" | wc -l
1149