ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Assembly: megahit #58

Open ndreey opened 1 year ago

ndreey commented 1 year ago

Various settings will be tested, as choosing the appropriate kmer size is not always straightforward.

Pros of small kmer

Cons of small kmer

Pros of large kmer

Cons of large kmer

K-mer size Pros Cons
Small (e.g. 21-31) Reduce memory requirements and computation time More errors due to sequencing errors or low-quality reads
Preserve information in regions with high levels of variation or low coverage Incomplete assemblies or merging of different regions of the genome in complex regions
Help identify and correct sequencing errors Misassemblies or chimeric contigs if reads do not overlap enough
Large (e.g. 61-81) Increase assembly accuracy by reducing errors in the reads Increase memory requirements and computation time
Help resolve complex regions of the genome containing repeats, transposable elements, etc. Fragmented assemblies if reads do not overlap enough
Help distinguish between closely related species or strains in the sample Loss of the information in low coverage regions of the genome or regions with high levels of variation
ndreey commented 1 year ago

In the CAMI II they used these settings:

ndreey commented 1 year ago

First run using these settings

megahit -t 6 -m 0.8 --k-min 21 --k-max 91 -1 /mnt/c/Users/andbo/thesis_andbo/CAMISIM/platanthera_mock/reads/01_trimmed/02_trim_R1.fq.gz -2 /mnt/c/Users/andbo/thesis_andbo/CAMISIM/platanthera_mock/reads/01_trimmed/02_trim_R2.fq.gz -o /home/andbo/megahit_results/platanthera_mock_assembly/02 --out-prefix 02

-t: Number of cores -m: Fraction of PC max memory

ndreey commented 1 year ago

Because computation became more intense, I have switched over to Mjölnir to run the jobs with SLURM. Two SLURM ARRAY JOBS were created to assemble each parameter, megahit_k21_array.sh and megahit_meta_sensi_array.sh.

They both are similar but with --presets meta-sensitive set instead of --k-min --k-max and more resources allocated for the meta-sensitive run.

#!/bin/bash

#SBATCH --job-name=k21_megahit     # name that will show up in the queue
#SBATCH --array=1-11%4
#SBATCH --output=slurm-%j.out             # filename of the output; the %j is equal to jobID
#SBATCH --error=slurm-%j.err              #
#SBATCH --partition=cpuqueue              #
#SBATCH --ntasks=1                        # number of tasks (analyses) to run
#SBATCH --cpus-per-task=6               # the number of threads allocated to each task
#SBATCH --mem-per-cpu=8G                  # memory per cpu-core
#SBATCH --time=01:30:00                   # time for analysis (day-hour:min:sec)
#SBATCH --mail-type=ALL                   # send all type of email
#SBATCH --mail-user=andre.bourbonnais@sund.ku.dk

# I. Define directory names [DO NOT CHANGE]
# =========================================

# get the directories
submitdir=${SLURM_SUBMIT_DIR}
workdir=${TMPDIR}
jobid=${SLURM_ARRAY_TASK_ID}

# Information
echo "$(date)    Submitted from ${submitdir}"
echo "$(date)    Accessed ${workdir}"
echo "$(date)    ArrayID: ${jobid}"

# 1. Lock and load module and data
# ============================================
module load megahit

# Get the trimmed reads
reads=${submitdir}/bsc_thesis/data/subsample/reads/01_trimmed/

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================

# Different host-contamination level
hc_level=("00" "01" "02" "03" "04" "05" "06" "07" "08" "09" "095")

# Define prefix based on array id
hc_prefix=${hc_level[$jobid-1]}

megahit -t 6 --k-min 21 --k-max 91 \
    -1 ${reads}/${hc_prefix}_trim_R1.fq.gz \
    -2 ${reads}/${hc_prefix}_trim_R2.fq.gz \
    -o k21/${hc_prefix} \
    --out-prefix "${hc_prefix}_k21"
ndreey commented 1 year ago

MEGAHIT K21 RUNTIME

       JobID    JobName  Partition  AllocCPUS      State ExitCode    Elapsed
------------ ---------- ---------- ---------- ---------- -------- ----------
893970_1     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:16:23
893970_1.ba+      batch                     6  COMPLETED      0:0   00:16:23
893970_2     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:12:41
893970_2.ba+      batch                     6  COMPLETED      0:0   00:12:41
893970_3     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:11:30
893970_3.ba+      batch                     6  COMPLETED      0:0   00:11:30
893970_4     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:10:43
893970_4.ba+      batch                     6  COMPLETED      0:0   00:10:43
893970_5     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:06:01
893970_5.ba+      batch                     6  COMPLETED      0:0   00:06:01
893970_6     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:05:57
893970_6.ba+      batch                     6  COMPLETED      0:0   00:05:57
893970_7     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:12:40
893970_7.ba+      batch                     6  COMPLETED      0:0   00:12:40
893970_8     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:13:23
893970_8.ba+      batch                     6  COMPLETED      0:0   00:13:23
893970_9     k21_megah+   cpuqueue          6  COMPLETED      0:0   00:05:59
893970_9.ba+      batch                     6  COMPLETED      0:0   00:05:59
893970_10    k21_megah+   cpuqueue          6  COMPLETED      0:0   00:06:49
893970_10.b+      batch                     6  COMPLETED      0:0   00:06:49
893970_11    k21_megah+   cpuqueue          6  COMPLETED      0:0   00:06:27
893970_11.b+      batch                     6  COMPLETED      0:0   00:06:27

MEGAHIT META-SENSITIVE RUNTIME _Mjolnir had a minor disruption which can be the cause for the long runtime for the 00_reads (8939831)

       JobID    JobName  Partition  AllocCPUS      State ExitCode    Elapsed
------------ ---------- ---------- ---------- ---------- -------- ----------
893983_1     metasens_+   cpuqueue          8  COMPLETED      0:0   02:10:17
893983_1.ba+      batch                     8  COMPLETED      0:0   02:10:17
893983_2     metasens_+   cpuqueue          8  COMPLETED      0:0   00:29:07
893983_2.ba+      batch                     8  COMPLETED      0:0   00:29:07
893983_3     metasens_+   cpuqueue          8  COMPLETED      0:0   00:55:57
893983_3.ba+      batch                     8  COMPLETED      0:0   00:55:57
893983_4     metasens_+   cpuqueue          8  COMPLETED      0:0   00:41:11
893983_4.ba+      batch                     8  COMPLETED      0:0   00:41:11
893983_5     metasens_+   cpuqueue          8  COMPLETED      0:0   00:25:52
893983_5.ba+      batch                     8  COMPLETED      0:0   00:25:52
893983_6     metasens_+   cpuqueue          8  COMPLETED      0:0   00:25:02
893983_6.ba+      batch                     8  COMPLETED      0:0   00:25:02
893983_7     metasens_+   cpuqueue          8  COMPLETED      0:0   00:27:37
893983_7.ba+      batch                     8  COMPLETED      0:0   00:27:37
893983_8     metasens_+   cpuqueue          8  COMPLETED      0:0   00:29:23
893983_8.ba+      batch                     8  COMPLETED      0:0   00:29:23
893983_9     metasens_+   cpuqueue          8  COMPLETED      0:0   00:30:01
893983_9.ba+      batch                     8  COMPLETED      0:0   00:30:01
893983_10    metasens_+   cpuqueue          8  COMPLETED      0:0   00:27:52
893983_10.b+      batch                     8  COMPLETED      0:0   00:27:52
893983_11    metasens_+   cpuqueue          8  COMPLETED      0:0   00:29:08
893983_11.b+      batch                     8  COMPLETED      0:0   00:29:08