Full documentation here, salient usage details summarised below.
Cross-linking and immunoprecipitation followed by sequencing (CLIP) has allowed high resolution studies of RNA binding protein (RBP)-RNA interactions at transcriptomic scale.
This pipeline enables analysis of various forms of single-end CLIP data including variants of iCLIP (eg. irCLIP, iCLIP2, iiCLIP) and eCLIP (note that we have achieved comparable results using this pipeline to study reformatted paired-end eCLIP also). The pipeline currently doesn't support mutation calling and therefore might not be suitable for PAR-CLIP analysis, but we plan to include this at a future date.
This DSL2 CLIP-Seq pipeline is written and maintained by Goodwright in collaboration with Ule lab and the developers of the DSL1 nf-core/clipseq pipeline.
To test the pipeline, use the associated config file and run it with the
profiles test
and the container engine you wish to use eg. docker
. For example:
nextflow run main.nf -profile test,docker
Full dataset testing of 9 iCLIP samples can also be run using profile test_full
.
A test can also be run that skips all preparing of annotations/indexes using profile test_no_prep_genome
.
If you require all reference files (eg. genomic indexes, filtered and segmented gtf...) to be generated the minimal input is:
samplesheet
: csv file containing 4 columns: group,replicate,fastq_1,fastq_2. group is the sample name, replicate is currently unused by the pipeline so filling with '1' is acceptable, fastq_1 is your demultiplexed sample fastq, paired end is currently not supported so please do not add a fastq 2 .eg './tests/data/samplesheets/small-single-sample-se.csv'group | replicate | fastq_1 | fastq_2 |
---|---|---|---|
TDP43_1 | 1 | s3://nf-core-awsmegatests/clipseq/input_data/fastq/ERR1530360.fastq.gz |
fasta
: genome fasta file .eg './tests/data/genome/homosapien-hg37-chr21.fa.gz'smrna_fasta
: fasta file to be mapped to before the genome file, typically containing rRNA and tRNA sequences .eg'./tests/data/genome/homosapiens_smallRNA.fa.gz'gtf
: annotation file for the genome fasta .eg'./tests/data/genome/gencode.v35.chr21.gtf.gz'If you are providing all reference files then the following additional files must be provided (note these are all produced by the prepare_clipseq
subworkflow automatically if they are not provided to the pipeline):
fasta_fai
chrom_sizes
genome_index
smrna_index
smrna_fasta_fai
smrna_chrom_sizes
longest_transcript
filtered_gtf
seg_gtf
seg_filt_gtf
seg_resolved_gtf
seg_resolved_gtf_genic
regions_gtf
regions_filt_gtf
regions_resolved_gtf
regions_resolved_gtf_genic
When the full pipeline is run, output is organised into 6 folders:
00_genome
contains all reference files produced when the prepare_clipseq subworkflow is run.01_prealign
contains pre-trimmed FastQC reports, trimmed read files and post-trimming FastQC reports.02_alignment
contains two folders, one for the pre-mapping "smrna" and one for the genomic mapping "target", each contain alignment files and the "target" folder also contains useful samtools assessment of the bam file.03_filt_dedup
contains de-duplicated genome mapped bams along with statistics and also two transcriptome mapped de-duplicated bams. Here those marked "filt" have been filtered to only contain alignments to the longest transcript for each gene.04_crosslinks
contains genomic crosslink bed, bedgraph and normalised bedgraph (crosslinks divided by total number of crosslinks in the sample and multiplied by a million resulting in a crosslinks per million (CPM) value); also the same files for transcriptome mapping filtered by longest transcript.05_peak_calling
contains Clippy, iCount and Paraclu peaks. The iCount folder also contains gene, subtype and type level summaries of crosslink information and metagene plots around transcript landmarks of interest in the rnamaps folder. Also included is PEKA output.06_reports
contains various CLIP-specific QC metrics in tabular format in the clipqc folder. These are plotted, alongside other QC metrics in the html provided in the multiqc folder.This DSL2 CLIP-Seq pipeline is written and maintained by Goodwright in collaboration with Ule lab and the developers of the DSL1 nf-core/clipseq pipeline. To raise any issues or comments with the pipeline you can (in order of preference):