Manual epiGBS2

Prerequisites for running the pipeline

Preparation to run the pipeline

The Flowcellname can be found in the fastq headers of the read file, e.g. @ST-E00317:403:H53KHCCXY:5:1101:5660:1309 1:N:0:NCAATCAC translates to @ST-E00317:403:FLOWCELL:LANE-NUMBER:1101:5660:1309 1:N:0:NCAATCAC. ENZ_R1/2 expects the names of the restriction enzymes and Wobble_R1/2 is the length of the unique molecular identifier ("Wobble") sequence (usually 3). It is important that the restriction enzyme names are spelled correctly, so the end of the restriction enzyme names are capital i (I) and not an l (lowercase L) or an 1 (number).

At the moment it is not supported to process multiple lanes at the same time. Limiting step is the demultiplexing. If you want to analyse more than one lane, please first run demultiplexing per lane, merge demultiplexed files and then run the rest of the pipeline.

# barcodes.tsv
Flowcell        Lane    Barcode_R1      Barcode_R2      Sample  history Country PlateName       Row     Column  ENZ_R1  ENZ_R2  Wobble_R1       Wobble_R2       Species
H53KHCCXY       5       AACT    CCAG    BUXTON_178      C       BUXTON  BUXTON_WUR_AseI_NsiI_final_run1 1       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       CCTA    CCAG    WUR_178 C       WUR     BUXTON_WUR_AseI_NsiI_final_run1 2       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       TTAC    CCAG    BUXTON_169      C       BUXTON  BUXTON_WUR_AseI_NsiI_final_run1 3       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       AGGC    CCAG    WUR_169 C       WUR     BUXTON_WUR_AseI_NsiI_final_run1 4       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       GAAGA   CCAG    BUXTON_175      SD      BUXTON  BUXTON_WUR_AseI_NsiI_final_run1 5       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       CCTTC   CCAG    WUR_175 SD      WUR     BUXTON_WUR_AseI_NsiI_final_run1 6       2       AseI    NsiI    3       3       Scabiosa columbaria

Start the pipeline

Running the pipeline with --use-conda is important!

It follows a description of all output files. Files that are important for downstream analysis are highlighted in bold. Files or Directories in italics are specific for the de-novo and reference branch respectively.

When not to run the pipeline?

Quality control or "How to discover errors?"

Recommendation: Run fastq-screen in bisulphite mode on raw data to determine sources of contamination (e.g. by sharing a lane with other customers, human DNA, phiX, vectors and adapters). and

MultiQC report:

Fix errors

Clone removal


The clone_filter process does not proceed due insufficient memory availability on the server



I have a very high percentage of clone reads




There are many reads that are lost due to polyG (GGGGG) stretches at the beginning or end of the reads



One or more samples have small amounts of recovered reads or read numbers differ a lot between different samples.



The coverage is too low in the methylation bed file and after filtering on coverage (>10) only few positions remain.



The mapping percentage in de novo mode is low.


Example Data and Config Files

Example Data

An example data set and barcode file are available at Zenodo.

Barcode file using the de novo branch

# path to output directory
output_dir: "/home/maarten/test_data/epigbs2/output"

# input directory where raw reads are
input_dir       : "/home/maarten/test_data/data"

# name of sequence read files
Read1 : "test_data_R1.fq.gz"
Read2 : "test_data_R2.fq.gz"

# number of sequencing cycles (the same as read length in Illumina sequencing)
cycles        : 150

# barcode file(barcode file should be kept inside input directory) and enzymes will be included in barcode file
barcodes: "barcodes.tsv"

# the pipeline produces some temporary files. Please indicate the tmp location on your server (in most cases /tmp)
tmpdir        : "/tmp"

# mode of running pipeline (set denovo, reference or legacy. PLEASE NOTE: legacy is not supported)
mode: "denovo"

# genome directory (leaave it blank in denovo mode)
ref_dir: ""

# genome name (leaave it blank in denovo mode)
genome: ""

# advanced users have the possibility to change different parameter, leave them blank or write "default" to run them in default mode

# parameters in the denovo reference creation:
# identity: percentage of sequence identity in the last clustering step, in decimal number e.g. for 90% identity write 0.90, default 0.97
# min-depth: minimal cluster depth in the first clustering step to include a cluster, default 10
# max-depth: maximal cluster depth in the first clustering step to include a cluster, default 10000
  identity: "0.97"
  min-depth: "10"
  max-depth: "10000"

Barcode file using the Reference branch

# path to output directory
output_dir: "/home/maarten/test_data/epigbs2/output"

# input directory where raw reads are
input_dir       : "/home/maarten/test_data/data"

# name of sequence read files
Read1 : "test_data_R1.fq.gz"
Read2 : "test_data_R2.fq.gz"

# number of sequencing cycles (the same as read length in Illumina sequencing)
cycles        : 150

# barcode file(barcode file should be kept inside input directory) and enzymes will be included in barcode file
barcodes: "barcodes.tsv"

# the pipeline produces some temporary files. Please indicate the tmp location on your server (in most cases /tmp)
tmpdir        : "/tmp"

# mode of running pipeline (set denovo, reference or legacy. PLEASE NOTE: legacy is not supported)
mode: "reference"

# genome directory (leaave it blank in denovo mode)
ref_dir: "/home/maarten/test_data/ref/"

# genome name (leaave it blank in denovo mode)
genome: "ref.fa"

# advanced users have the possibility to change different parameter, leave them blank or write "default" to run them in default mode

# parameters in the denovo reference creation:
# identity: percentage of sequence identity in the last clustering step, in decimal number e.g. for 90% identity write 0.90, default 0.97
# min-depth: minimal cluster depth in the first clustering step to include a cluster, default 10
# max-depth: maximal cluster depth in the first clustering step to include a cluster, default 10000
  identity: ""
  min-depth: ""
  max-depth: ""

List of used software and references



