ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Re-run: Benchmark #65

Open ndreey opened 1 year ago

ndreey commented 1 year ago

Complete re-run of benchmark

Because of #64 where the Platanthera zijinensis genome fasta format was set to have 100 nucleotides per line, it caused the gold standard assembly (gsa.fasta.gz) to generate N's for each contig belonging to P. zijinensis. Which makes some sense as we noticed that in the metaQUASTreports no read matched the gold standard assembly. It was first believed be due to the low sequencing depth, but it turned out the A, T, G, and C's could not match to N's. After reformating the fasta file of the genome and generating a new fasta.fai, a total re-run of all steps are started.

CAMISIM

Because of the issue with not being able to disable the pooled_assembly, I am limited to how much I can simulate on my 16GB RAM, 8 CORE AMD Ryzen 7 4700U Windows laptop with 200 GB SSD available space.

CAMISIM output

Quality Control and Processing

As the same error profile was used, the FastQC analysis was equivalent to the past run, therefore same trim parameters were used, fastp_trim.sh.

MEGAHIT

Runtime for assembly.

JobID JobName Partition AllocCPUS State ExitCode Elapsed
1159511_1 megahit cpuqueue 6 COMPLETED 0:0 00:45:29
1159511_2 megahit cpuqueue 6 COMPLETED 0:0 01:09:57
1159511_3 megahit cpuqueue 6 COMPLETED 0:0 01:01:28
1159511_4 megahit cpuqueue 6 COMPLETED 0:0 00:40:24
1159511_5 megahit cpuqueue 6 COMPLETED 0:0 00:45:22
1159511_6 megahit cpuqueue 6 COMPLETED 0:0 00:43:32
1159511_7 megahit cpuqueue 6 COMPLETED 0:0 00:42:54
1159511_8 megahit cpuqueue 6 COMPLETED 0:0 00:58:10
1159511_9 megahit cpuqueue 6 COMPLETED 0:0 00:43:27
1159511_10 megahit cpuqueue 6 COMPLETED 0:0 00:55:30
1159511_11 megahit cpuqueue 6 COMPLETED 0:0 00:55:21

metaQUAST

queued 2023-04-28 09:15 Benchmarking both assembly and gsa

CONCOCT

The gold standard binning benchmark took 8h 32min (ran on PC). CONCOCT failed to bin the 06, 07, and 090 samples. Compared to the previous run where all P. zijinensis contigs were N's the 095 binning resulted in 2 bins where 1 belonged to P. zijinensis and the other belonged to Ceratobasidium spp. With the issue resolved, CONCOCT binned 69 bins.

AMBER

With the issue resolved, CONCOCT binned 69 bins where P. zijinensis was the most abundant genome in each. In fact, they all had a purity = 1.00.

This raises the question of which and how many of the bins should be used as a reference to remove host contamination.