nioo-knaw / epiGBS2

This is the epiGBS2 snakemake pipeline as published in a preprint version.
MIT License
2 stars 6 forks source link

Manual epiGBS2

Prerequisites for running the pipeline

Preparation to run the pipeline

The Flowcellname can be found in the fastq headers of the read file, e.g. @ST-E00317:403:H53KHCCXY:5:1101:5660:1309 1:N:0:NCAATCAC translates to @ST-E00317:403:FLOWCELL:LANE-NUMBER:1101:5660:1309 1:N:0:NCAATCAC. ENZ_R1/2 expects the names of the restriction enzymes and Wobble_R1/2 is the length of the unique molecular identifier ("Wobble") sequence (usually 3). It is important that the restriction enzyme names are spelled correctly, so the end of the restriction enzyme names are capital i (I) and not an l (lowercase L) or an 1 (number).

At the moment it is not supported to process multiple lanes at the same time. Limiting step is the demultiplexing. If you want to analyse more than one lane, please first run demultiplexing per lane, merge demultiplexed files and then run the rest of the pipeline.

# barcodes.tsv
Flowcell        Lane    Barcode_R1      Barcode_R2      Sample  history Country PlateName       Row     Column  ENZ_R1  ENZ_R2  Wobble_R1       Wobble_R2       Species
H53KHCCXY       5       AACT    CCAG    BUXTON_178      C       BUXTON  BUXTON_WUR_AseI_NsiI_final_run1 1       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       CCTA    CCAG    WUR_178 C       WUR     BUXTON_WUR_AseI_NsiI_final_run1 2       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       TTAC    CCAG    BUXTON_169      C       BUXTON  BUXTON_WUR_AseI_NsiI_final_run1 3       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       AGGC    CCAG    WUR_169 C       WUR     BUXTON_WUR_AseI_NsiI_final_run1 4       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       GAAGA   CCAG    BUXTON_175      SD      BUXTON  BUXTON_WUR_AseI_NsiI_final_run1 5       2       AseI    NsiI    3       3       Scabiosa columbaria
H53KHCCXY       5       CCTTC   CCAG    WUR_175 SD      WUR     BUXTON_WUR_AseI_NsiI_final_run1 6       2       AseI    NsiI    3       3       Scabiosa columbaria

Start the pipeline

Running the pipeline with --use-conda is important!

It follows a description of all output files. Files that are important for downstream analysis are highlighted in bold. Files or Directories in italics are specific for the de-novo and reference branch respectively.

When not to run the pipeline?

Quality control or "How to discover errors?"

Recommendation: Run fastq-screen in bisulphite mode on raw data to determine sources of contamination (e.g. by sharing a lane with other customers, human DNA, phiX, vectors and adapters). https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html and https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/fastq_screen_documentation.html

MultiQC report:

Fix errors

Clone removal

Problem:

The clone_filter process does not proceed due insufficient memory availability on the server

Fix:

Problem:

I have a very high percentage of clone reads

Fix:

Demultiplexing

Problem:

There are many reads that are lost due to polyG (GGGGG) stretches at the beginning or end of the reads

Fix:

Problem:

One or more samples have small amounts of recovered reads or read numbers differ a lot between different samples.

Fix:

Problem:

The coverage is too low in the methylation bed file and after filtering on coverage (>10) only few positions remain.

Fix:

Problem:

The mapping percentage in de novo mode is low.

Fix:

Example Data and Config Files

Example Data

An example data set and barcode file are available at Zenodo.

Barcode file using the de novo branch

# path to output directory
output_dir: "/home/maarten/test_data/epigbs2/output"

# input directory where raw reads are
input_dir       : "/home/maarten/test_data/data"

# name of sequence read files
Read1 : "test_data_R1.fq.gz"
Read2 : "test_data_R2.fq.gz"

# number of sequencing cycles (the same as read length in Illumina sequencing)
cycles        : 150

# barcode file(barcode file should be kept inside input directory) and enzymes will be included in barcode file
barcodes: "barcodes.tsv"

# the pipeline produces some temporary files. Please indicate the tmp location on your server (in most cases /tmp)
tmpdir        : "/tmp"

# mode of running pipeline (set denovo, reference or legacy. PLEASE NOTE: legacy is not supported)
mode: "denovo"

# genome directory (leaave it blank in denovo mode)
ref_dir: ""

# genome name (leaave it blank in denovo mode)
genome: ""

# advanced users have the possibility to change different parameter, leave them blank or write "default" to run them in default mode

# parameters in the denovo reference creation:
# identity: percentage of sequence identity in the last clustering step, in decimal number e.g. for 90% identity write 0.90, default 0.97
# min-depth: minimal cluster depth in the first clustering step to include a cluster, default 10
# max-depth: maximal cluster depth in the first clustering step to include a cluster, default 10000
param_denovo:
  identity: "0.97"
  min-depth: "10"
  max-depth: "10000"

Barcode file using the Reference branch

# path to output directory
output_dir: "/home/maarten/test_data/epigbs2/output"

# input directory where raw reads are
input_dir       : "/home/maarten/test_data/data"

# name of sequence read files
Read1 : "test_data_R1.fq.gz"
Read2 : "test_data_R2.fq.gz"

# number of sequencing cycles (the same as read length in Illumina sequencing)
cycles        : 150

# barcode file(barcode file should be kept inside input directory) and enzymes will be included in barcode file
barcodes: "barcodes.tsv"

# the pipeline produces some temporary files. Please indicate the tmp location on your server (in most cases /tmp)
tmpdir        : "/tmp"

# mode of running pipeline (set denovo, reference or legacy. PLEASE NOTE: legacy is not supported)
mode: "reference"

# genome directory (leaave it blank in denovo mode)
ref_dir: "/home/maarten/test_data/ref/"

# genome name (leaave it blank in denovo mode)
genome: "ref.fa"

# advanced users have the possibility to change different parameter, leave them blank or write "default" to run them in default mode

# parameters in the denovo reference creation:
# identity: percentage of sequence identity in the last clustering step, in decimal number e.g. for 90% identity write 0.90, default 0.97
# min-depth: minimal cluster depth in the first clustering step to include a cluster, default 10
# max-depth: maximal cluster depth in the first clustering step to include a cluster, default 10000
param_denovo:
  identity: ""
  min-depth: ""
  max-depth: ""

List of used software and references

Software

References

  1. Manuscript on bioRxiv
  2. Köster, J. & Rahmann, S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 28, 2520-2522 (2012).
  3. Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475-476 (2018).
  4. Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A. & Cresko, W. A. Stacks: an analysis tool set for population genomics. Mol. Ecol. 22, 3124-3140 (2013).
  5. Catchen, J. M., Amores, A., Hohenlohe, P., Cresko, W. & Postlethwait, J. H. Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3 Genes Genomes Genet. 1, 171-182 (2011).
  6. Stacks 2: Analytical Methods for Paired-end Sequencing Improve RADseq-based Population Genomics | bioRxiv. Available at: https://www.biorxiv.org/content/10.1101/615385v1. (Accessed: 27th August 2019)
  7. Andrews, Simon. FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc. (2010).
  8. Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30, 614-620 (2014).
  9. Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, (2016).
  10. Lepais, O. & Weir, J. T. SimRAD: an R package for simulation-based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches. Mol. Ecol. Resour. 14, 1314-1321 (2014).
  11. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 27(11):1571-2. (2011)
  12. Garrison, E., Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907 (2012)
  13. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17 1, 10-1 (2011)
  14. Nunn, A., Otto, C., Stadler, P.F., Langenberger, D. Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional Bayesian approaches bioRxiv 2021.01.11.425926; doi: https://doi.org/10.1101/2021.01.11.425926 (2021)