Complete re-run of benchmark

Because of #64 where the Platanthera zijinensis genome fasta format was set to have 100 nucleotides per line, it caused the gold standard assembly (gsa.fasta.gz) to generate N's for each contig belonging to P. zijinensis. Which makes some sense as we noticed that in the metaQUASTreports no read matched the gold standard assembly. It was first believed be due to the low sequencing depth, but it turned out the A, T, G, and C's could not match to N's. After reformating the fasta file of the genome and generating a new fasta.fai, a total re-run of all steps are started.

CAMISIM

Because of the issue with not being able to disable the pooled_assembly, I am limited to how much I can simulate on my 16GB RAM, 8 CORE AMD Ryzen 7 4700U Windows laptop with 200 GB SSD available space.

2 Gigabases of data per sample will be simulated, 1Gb forward, 1 Gb reverse.
I will try to change the config after the run has started to manually disable the pooled_assembly step. did not work...
One sample took 1h to generate

CAMISIM output

bunch_up.sh was placed in platanthera_mock/ and was executed, generating *_all_R*.fq.gz. Which are fastq files holding all R1 and R2 in separate files.
I then created a mkdir -p reads/00_raw reads/01_trimmed and ran code below
- find . -name '*_all_*' -type f -exec mv {} reads/00_raw/ \;
I then manually rename the files in the *_output/contigs to the hc prefix and move them to gold_standards. I also add .gz to the gsa_mapping.tsv files. All this can be automated
- mkdir -p gold_standards/gsa gold_standards/gsb
- mv gold_standards/*mapping* gold_standards/gsb
- mv gold_standards/*fasta* gold_standards/gsa

Quality Control and Processing

As the same error profile was used, the FastQC analysis was equivalent to the past run, therefore same trim parameters were used, fastp_trim.sh.

conda activate qc, where i then ran fastqc and fastp.

MEGAHIT

Runtime for assembly.

JobID	JobName	Partition	AllocCPUS	State	ExitCode	Elapsed
1159511_1	megahit	cpuqueue	6	COMPLETED	0:0	00:45:29
1159511_2	megahit	cpuqueue	6	COMPLETED	0:0	01:09:57
1159511_3	megahit	cpuqueue	6	COMPLETED	0:0	01:01:28
1159511_4	megahit	cpuqueue	6	COMPLETED	0:0	00:40:24
1159511_5	megahit	cpuqueue	6	COMPLETED	0:0	00:45:22
1159511_6	megahit	cpuqueue	6	COMPLETED	0:0	00:43:32
1159511_7	megahit	cpuqueue	6	COMPLETED	0:0	00:42:54
1159511_8	megahit	cpuqueue	6	COMPLETED	0:0	00:58:10
1159511_9	megahit	cpuqueue	6	COMPLETED	0:0	00:43:27
1159511_10	megahit	cpuqueue	6	COMPLETED	0:0	00:55:30
1159511_11	megahit	cpuqueue	6	COMPLETED	0:0	00:55:21

metaQUAST

queued 2023-04-28 09:15 Benchmarking both assembly and gsa

CONCOCT

The gold standard binning benchmark took 8h 32min (ran on PC). CONCOCT failed to bin the 06, 07, and 090 samples. Compared to the previous run where all P. zijinensis contigs were N's the 095 binning resulted in 2 bins where 1 belonged to P. zijinensis and the other belonged to Ceratobasidium spp. With the issue resolved, CONCOCT binned 69 bins.

AMBER

With the issue resolved, CONCOCT binned 69 bins where P. zijinensis was the most abundant genome in each. In fact, they all had a purity = 1.00.

This raises the question of which and how many of the bins should be used as a reference to remove host contamination.

ndreey / ghost-magnet

Re-run: Benchmark #65