Because of #64 where the Platanthera zijinensis genome fasta format was set to have 100 nucleotides per line, it caused the gold standard assembly (gsa.fasta.gz) to generate N's for each contig belonging to P. zijinensis. Which makes some sense as we noticed that in the metaQUASTreports no read matched the gold standard assembly. It was first believed be due to the low sequencing depth, but it turned out the A, T, G, and C's could not match to N's. After reformating the fasta file of the genome and generating a new fasta.fai, a total re-run of all steps are started.
CAMISIM
Because of the issue with not being able to disable the pooled_assembly, I am limited to how much I can simulate on my 16GB RAM, 8 CORE AMD Ryzen 7 4700U Windows laptop with 200 GB SSD available space.
2 Gigabases of data per sample will be simulated, 1Gb forward, 1 Gb reverse.
I will try to change the config after the run has started to manually disable the pooled_assembly step. did not work...
One sample took 1h to generate
CAMISIM output
bunch_up.sh was placed in platanthera_mock/ and was executed, generating *_all_R*.fq.gz. Which are fastq files holding all R1 and R2 in separate files.
I then created a mkdir -p reads/00_raw reads/01_trimmed and ran code below
I then manually rename the files in the *_output/contigs to the hc prefix and move them to gold_standards. I also add .gz to the gsa_mapping.tsv files. All this can be automated
mkdir -p gold_standards/gsa gold_standards/gsb
mv gold_standards/*mapping* gold_standards/gsb
mv gold_standards/*fasta* gold_standards/gsa
Quality Control and Processing
As the same error profile was used, the FastQC analysis was equivalent to the past run, therefore same trim parameters were used, fastp_trim.sh.
conda activate qc, where i then ran fastqc and fastp.
MEGAHIT
Runtime for assembly.
JobID
JobName
Partition
AllocCPUS
State
ExitCode
Elapsed
1159511_1
megahit
cpuqueue
6
COMPLETED
0:0
00:45:29
1159511_2
megahit
cpuqueue
6
COMPLETED
0:0
01:09:57
1159511_3
megahit
cpuqueue
6
COMPLETED
0:0
01:01:28
1159511_4
megahit
cpuqueue
6
COMPLETED
0:0
00:40:24
1159511_5
megahit
cpuqueue
6
COMPLETED
0:0
00:45:22
1159511_6
megahit
cpuqueue
6
COMPLETED
0:0
00:43:32
1159511_7
megahit
cpuqueue
6
COMPLETED
0:0
00:42:54
1159511_8
megahit
cpuqueue
6
COMPLETED
0:0
00:58:10
1159511_9
megahit
cpuqueue
6
COMPLETED
0:0
00:43:27
1159511_10
megahit
cpuqueue
6
COMPLETED
0:0
00:55:30
1159511_11
megahit
cpuqueue
6
COMPLETED
0:0
00:55:21
metaQUAST
queued 2023-04-28 09:15
Benchmarking both assembly and gsa
CONCOCT
The gold standard binning benchmark took 8h 32min (ran on PC).
CONCOCT failed to bin the 06, 07, and 090 samples.
Compared to the previous run where all P. zijinensis contigs were N's the 095 binning resulted in 2 bins where 1 belonged to P. zijinensis and the other belonged to Ceratobasidium spp. With the issue resolved, CONCOCT binned 69 bins.
AMBER
With the issue resolved, CONCOCT binned 69 bins where P. zijinensis was the most abundant genome in each. In fact, they all had a purity = 1.00.
This raises the question of which and how many of the bins should be used as a reference to remove host contamination.
Complete re-run of benchmark
Because of #64 where the Platanthera zijinensis genome fasta format was set to have 100 nucleotides per line, it caused the gold standard assembly (
gsa.fasta.gz
) to generate N's for each contig belonging to P. zijinensis. Which makes some sense as we noticed that in themetaQUAST
reports no read matched the gold standard assembly. It was first believed be due to the low sequencing depth, but it turned out the A, T, G, and C's could not match to N's. After reformating the fasta file of the genome and generating a newfasta.fai
, a total re-run of all steps are started.CAMISIM
Because of the issue with not being able to disable the pooled_assembly, I am limited to how much I can simulate on my 16GB RAM, 8 CORE AMD Ryzen 7 4700U Windows laptop with 200 GB SSD available space.
CAMISIM output
bunch_up.sh
was placed inplatanthera_mock/
and was executed, generating*_all_R*.fq.gz
. Which are fastq files holding all R1 and R2 in separate files.mkdir -p reads/00_raw reads/01_trimmed
and ran code belowfind . -name '*_all_*' -type f -exec mv {} reads/00_raw/ \;
*_output/contigs
to the hc prefix and move them togold_standards
. I also add.gz
to thegsa_mapping.tsv
files. All this can be automatedmkdir -p gold_standards/gsa gold_standards/gsb
mv gold_standards/*mapping* gold_standards/gsb
mv gold_standards/*fasta* gold_standards/gsa
Quality Control and Processing
As the same error profile was used, the FastQC analysis was equivalent to the past run, therefore same trim parameters were used,
fastp_trim.sh
.conda activate qc
, where i then ran fastqc and fastp.MEGAHIT
Runtime for assembly.
metaQUAST
queued 2023-04-28 09:15 Benchmarking both assembly and gsa
CONCOCT
The gold standard binning benchmark took 8h 32min (ran on PC). CONCOCT failed to bin the 06, 07, and 090 samples. Compared to the previous run where all P. zijinensis contigs were N's the 095 binning resulted in 2 bins where 1 belonged to P. zijinensis and the other belonged to Ceratobasidium spp. With the issue resolved, CONCOCT binned 69 bins.
AMBER
With the issue resolved, CONCOCT binned 69 bins where P. zijinensis was the most abundant genome in each. In fact, they all had a purity = 1.00.
This raises the question of which and how many of the bins should be used as a reference to remove host contamination.