This pipeline is built to analyse bulk long-reads DNA-seq data from PromethION (Oxford Nanopore Technologies). It can perform basecalling and methylation calling, Differential Methylated Regions analysis, germline and somatic variants identification (SNV, SV, long CNV) and phasing.
POD5 directory or BAM files (aligned BAM) are accepted as input.
This pipeline is designed to perform :
Go to pipeline detailed description to see details about each pipeline step and watch the complete rules execution order on the germline and somatic rulegraphs.
The pipeline and environments are already installed on the Flamingo cluster of Gustave Roussy.
It is localised at /mnt/beegfs/pipelines/bigr_long-reads_bulk/<VERSION>
. You can check Usage section.
cd /path_to_pipeline_installation_dir/
VERSION="2.0.1"
git clone https://github.com/gustaveroussy/bigr_long-reads_bulk.git ${VERSION}
Download all singularity images from Zenodo in the directory /path_to_pipeline_installation_dir/${VERSION}/envs/singularity/
of the cloned repository.
source /mnt/beegfs/software/miniconda/24.3.0/etc/profile.d/conda.sh
conda env create -f /path_to_pipeline_installation_dir/${VERSION}/envs/conda/snakemake.yaml --prefix=/path_to_pipeline_installation_dir/${VERSION}/envs/compiled_conda/snakemake -y
You are now ready to use the pipeline!
You need to create 2 mandatory input files: a configuration file and a design file. You can also add an optional file containing a list of genes of interest.
3 models of configuration files are available in config/
directory, depending on the genome of reference used during the analysis.
You can copy/paste and modify any model config file to fit your needs:
[!WARNING] We recommand using an Ensembl reference genome. Otherwise, please submit an issue if using another database reference fasta file triggers pipeline errors.
There are various fields of interest in the config file:
input_format: accepts pod5
(POD5 directory) or bam
(aligned BAM file).
basecalling_mode: accepts basic
or methylation
. Choose if you wish to perform basecalling only or methylation calling (5mCG and 5hmCG in Super Accuracy = SUP).
variant_calling_mode: accepts germline
or somatic
. Choose the variant calling mode if you want to perform variant calling.
steps: The configuration file allows you to choose which step(s) will be executed by the pipeline by setting any step name to true
or false
.
steps:
basecalling: true
alignment: true
differential_methylation_sample: true
differential_methylation_condition: true
snv_calling: true
sv_calling: true
phasing: true
cnv_calling: true
differential_methylation_sample:
DMR analysis between every pair of samples extracted from the design file.
differential_methylation_condition:
DMR analysis between two conditions/groups extracted from the design file.
spectre: Using a blacklist file is recommanded by Spectre documentation for long CNV calling, GRCh38 blacklist file provided by Spectre can be given to the config file as following:
spectre:
blacklist: "/mnt/beegfs/database/bioinfo/bigr_long-reads_bulk/REFERENCES/Spectre/blacklist/grch38_renamed_blacklist_spectre.bed"
[!WARNING]
- As we recommand using Ensembl references, the chromosomes in the given GRCh38 blacklist file are named "1", ..., "22", "X", "Y" and not "chr1", ..., "chr22", "chrX", "chrY". You can copy/paste this file and change chromosome names if needed by the reference file you have chosen.
- You can download your own blacklist on UCSC table browser as explained below.
You can change various fields in the config.yaml
file such as tools parameters (example: you can select the specific filters you want to apply after SNV calling).
It must be a comma separated file (.csv where comma is ",") and its path should be given in the config.yaml
file. Field names are the following and depend on the analysis you wish to perform:
case
and control
. It is required only if you set differential_methylation_condition
step as true
. variant_calling_mode
as somatic
.cnv_calling
step as true
.Example with POD5 directory input and all columns given:
sample_id,path_file,methyl_group,somatic_ctrl,cnv_cancer
sample1_Tumor,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample1_Tumor/,case,sample1_Normal,TRUE
sample1_Normal,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample1_Normal/,control,,FALSE
sample2_Tumor,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample2_Tumor/,case,sample2_Normal,TRUE
sample2_Normal,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample2_Normal/,control,,FALSE
Here, every step available in the config file can be executed and both variant analysis modes can be used.
[!NOTE]
- sample_id should not contain special characters, spaces or 'vs'.
- each aligned BAM file used as input must be sorted and have its bai index in the same directory than them.
- if you need to concatenate POD5 files for one sample, you only need to put them all in the same input directory that you give in the design file.
You can create a .txt file containing a list of genes of interest for your analysis, its path will be given in the config.yaml
file. It should have one gene name per line. The gene name should come from HUGO Gene Nomenclature Committee (HGNC) gene symbol. It can be used by some tools to plot supplementary graphs specifically for these genes (maftools lollipop plots).
TP53
ATM
BARD1
BRCA1
BRCA2
CHEK2
NF1
You need snakemake and singularity. For GR users, as they are already installed on Flamingo (snakemake via conda and singularity via module load
) just follow the example below.
Don't forget to change the version of the pipeline and the path to your configuration file.
Script example:
#!/bin/bash
#using: sbatch run.sh
#SBATCH --job-name=LR_analysis
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=250M
#SBATCH --partition=longq
source /mnt/beegfs/software/miniconda/24.3.0/etc/profile.d/conda.sh
conda activate /mnt/beegfs/pipelines/bigr_long-reads_bulk/<version>/envs/compiled_conda/snakemake
module load singularity
LR_pipeline="/mnt/beegfs/pipelines/bigr_long-reads_bulk/<version>/"
snakemake --profile ${LR_pipeline}/profiles/slurm \
-s ${LR_pipeline}/Snakefile \
--configfile /path_to_my_configuration_file/config.yaml
Step | Tool | Description |
---|---|---|
Basecalling | Dorado | Perform basecalling (SUP) |
Methylation calling | Dorado, Modkit | Perform methylation calling for 5mCG and 5hmCG modifications (SUP) and extract methylation information |
Step | Tool | Description |
---|---|---|
Filter | python script | Filter to keep only reads with quality score >Q10 and length >200 |
Alignment | Dorado | Align sample reads to the reference genome, remove secondary alignments |
Concatenate BAM | Samtools | As POD5 files were treated by batch, BAM concatenation is more convenient for next steps |
Sort, index | Samtools | Sort and index the just generated BAM files |
Step | Tool | Description |
---|---|---|
BAM QC | NanoPlot, Qualimap, Samtools | Generate general QC metrics and graphs regarding a BAM file |
BAM QC | Mosdepth, Samtools | Calculate the depth and coverage of a BAM file |
BAM to FASTQ | Samtools | Convert BAM file to FASTQ file for supplementary QC |
FASTQ QC | FastQC | Perform QC on FASTQ file |
FASTQ QC | FastQ-Screen | Perform library quality check on FASTQ file |
METHYLATION QC | R scripts | Perform Methylation QC |
QC report | MultiQC | Aggregates BAM and FASTQ QC results to generate a final QC report for all samples |
Step | Tool | Description |
---|---|---|
DMR by sample | GeneDMRs (R) | Differentially methylated regions analyses (Alu, Transcripts, CpG) for and between all samples |
DMR by condition | GeneDMRs (R) | Differentially methylated regions analyses (Alu, Transcripts, CpG) between two conditions |
Step | Tool | Description |
---|---|---|
SNV calling | Clair3, PEPPER-Margin-DeepVariant | Call Single-Nucleotide Variants |
SNV annotation | SnpEff, SnpSift | Annotate Single-Nucleotide Variants using ClinVar and/or dbNSFP |
SNV filtering | SnpSift | Filter depending on the parameters entered in the config file |
SNV graphs | vcf2maf, maftools (R) | Convert VCF to MAF file and generate general graphs by sample and graphs by gene if given in the config file |
SV calling | Sniffles2, cuteSV | Call Structural Variants |
SV annotation | AnnotSV | Annotate SV |
SV graphs | Sniffles2_plot | Generate general graphs by sample |
Long CNV calling | Spectre | Call long CNV >100kb |
Step | Tool | Description |
---|---|---|
SNV calling | ClairS | Call somatic Single-Nucleotide Variants |
SNV annotation | SnpEff, SnpSift | Annotate Single-Nucleotide Variants using ClinVar and/or dbNSFP |
SNV filtering | SnpSift | Filter depending on the parameters entered in the config file |
SNV graphs | vcf2maf, maftools (R) | Convert VCF to MAF file and generate general graphs by sample and graphs by gene if given in the config file |
SV calling | nanomonsv | Call somatic Structural Variants |
SV annotation | AnnotSV | Annotate SV |
Step | Tool | Description |
---|---|---|
Phasing | WhatsHap | Germline SNV phasing |
Haplotagging | WhatsHap | Assign Haplotype 1 or Haplotype 2 to the reads covering at least two heterozygous variants, generating a new BAM file, useful for IGV. |
This pipeline is still under active development to provide the most complete and accurate analysis experience.
🗨️ Feel free to submit issues either to report any bug or to suggest pipeline enhancements.