Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual.
The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference.
Merqury provides a set of tools for this purpose.
Note that igvtools is no longer used. The .tdf
files are replaced with .wig
files, compatable to IGV and UCSC genome browser.
Only the latest Merqury version supports the compressed
option used in Verkko.
PATH
.
Download Meryl release: https://github.com/marbl/meryl/releases/tag/v1.4.1
tar -xJf meryl-1.4.*.tar.xz
cd meryl-1.4.1/bin
export PATH=$pwd:$PATH
If the binary doesn't work, download the source and compile:
git clone https://github.com/marbl/meryl.git
cd meryl/src
make -j 24
export PATH=/path/to/meryl/…/bin:$PATH
See if we get help message with meryl
.
git clone https://github.com/marbl/merqury.git
cd merqury
export MERQURY=$PWD
Other dependency:
The v1.3 Merqury is compatable with Meryl v1.3, however does not support homopolymer-compressed kmers. In addition, multiple issues were fixed (e.g. use of large k-mers, better memory utilization, minor bugs in the logs etc.) since then. Therefore, we recommend to use the latest Meryl and Merqury. The Conda version below is currently deploying v1.3.
Download Meryl release: https://github.com/marbl/meryl/releases/tag/v1.3
tar -xJf meryl-1.3.*.tar.xz
cd meryl-1.3/bin
export PATH=$pwd:$PATH
If the binary doesn't work, download the source and compile:
cd meryl/src
make -j 24
export PATH=/path/to/meryl/…/bin:$PATH
See if we get help message with meryl
.
$MERQURY
wget https://github.com/marbl/merqury/archive/v1.3.tar.gz
tar -zxvf v1.3.tar.gz
cd merqury-1.3
export MERQURY=$PWD
Add the “export” part to your environment for meryl
and MERQURY
(~/.bash_profile or ~/.profile).
Add installation dir paths for bedtools
and samtools
to your environment.
source
it.
Thanks to @EdHarry, a conda recipe is now available: https://anaconda.org/bioconda/merqury
On a new conda environment, run:
conda install -c conda-forge -c bioconda merqury
Or, if you have a different version of jdk installed or want to have a separate conda environnment for merqury:
conda create -n merqury -c conda-forge -c bioconda merqury openjdk=11
You will then need to activate the merqury environment before using it with:
conda activate merqury
Test running
Rscript $MERQURY/plot/plot_spectra_cn.R --help
In case R complains for version mismatches of the R packages, try
conda update --all
It seems like R in conda isn't maintained anymore. Try to modify channel priority in .condarc
.
.meryl
. !!On a single machine:
ln -s $MERQURY/merqury.sh # Link merqury
./merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
Usage: merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
<read-db.meryl> : k-mer counts of the read set
<mat.meryl> : k-mer counts of the maternal haplotype (ex. mat.only.meryl or mat.hapmer.meryl)
<pat.meryl> : k-mer counts of the paternal haplotype (ex. pat.only.meryl or pat.hapmer.meryl)
<asm1.fasta> : Assembly fasta file (ex. pri.fasta, hap1.fasta or maternal.fasta)
[asm2.fasta] : Additional fasta file (ex. alt.fasta, hap2.fasta or paternal.fasta)
*asm1.meryl and asm2.meryl will be generated. Avoid using the same names as the hap-mer dbs
<out> : Output prefix
< >
: required
[ ]
: optional
Below is showing examples how to run Merqury using the prebuilt meryl dbs on a. thaliana F1 hybrid. The fasta files are the trio-binned assemblies from Koren et al.
### Download assemblies ###
wget https://gembox.cbcb.umd.edu/triobinning/athal_COL.fasta
wget https://gembox.cbcb.umd.edu/triobinning/athal_CVI.fasta
### Download prebuilt meryl dbs ###
# read.meryl of the F1 hybrid between COL-0 and CVI-0
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.k18.meryl.tar.gz
# hap-mers for COL-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.col0.hapmer.meryl.tar.gz
# hap-mers for CVI-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.cvi0.hapmer.meryl.tar.gz
# Untar
for gz in *.tar.gz
do
tar -zxf $gz
done
# Run merqury
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test
# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta test-1
# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta test-1
# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta athal_CVI.fasta test-2
# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test-2
Merqury starts with eval/spectra_cn.sh
.
When hap-mers are provided, merqury runs modules under trio/
in addition to eval/spectra_cn.sh
.
The following can run at the same time. Modules with dependency are followed by arrows (->).
eval/spectra_cn.sh
-> trio/spectra_hap.sh
trio/hap_blob.sh
trio/phase_block.sh
per assembly -> trio/block_n_stats.sh
Meryl, the k-mer counter inside, uses the maximum cpus available.
Set OMP_NUM_THREADS=24
for example to use 24 threads.
On slurm environment, simply run:
ln -s $MERQURY/_submit_merqury.sh # Link merqury
./_submit_merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
Change the sbatch
to match your environment. (ex. partition)
eval/spectra_cn.sh
: k-mer completeness, qv, spectra-cn and spectra-asm plots, asm-only .bed
and .tdf
for tracking errorseval/qv.sh
: just get the qv stats and quit.trio/spectra_hap.sh
: hap-mer level spectra-cn plots, hap-mer completenesstrio/hap_blob.sh
: blob plots of the hap-mers in each contg/scaffoldtrio/phase_block.sh
: phase block statistics, phase block N* plots, hap-mer tracks (.bed
and .tdf
files)trio/block_n_stats.sh
: continuity plots (phase block N or NG plots, phase block vs. contig/scaffold plots)trio/switch_error.sh
: this is run part of phase_blck.sh
, however can be re-run with desired short-range switch parameters. Run trio/block_n_stats.sh
along with it to get the associated plots.Run each script without any parameters if not sure what to do.
For example, ./trio/switch_error.sh
will give a help message and quit.
Following wiki pages have more detailed examples.
Meryl dbs from Illumina WGS and hapmers are available here for
Please use this paper to cite Merqury:
Rhie, A., Walenz, B.P., Koren, S. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020). https://doi.org/10.1186/s13059-020-02134-9