CRITICAL NOTE We provide all code as Unix shell instructions that generally run on Unix/Linux distributions, allow line-by-line processing of large files and are typically very robust
Sub-setting Large fastq.gz files with command line
For sub-setting large files you can use the command line
Step 0 : Subsetting Large Zipped Fastq files
for sample in chip_dmel input_dmel; dozcat ${sample}.fastq.gz | sed -n 1,4000000p | gzip > ${sample}_sub.fastq.gzdone
The -n 1,4000000 will keep the fastq format intact; hence a single read in fastq file has 4 attributes
Quality control with Fastx
Assess the read quality by the average quality score. For raw sequencing files in fastq format, Fastx is used to do quality control.
we can do this in terminal:
Step 1: Sequence quality checkfor sample in chip_dmel input_dmel; do
Count the number of total and unique reads. In addition, it is worthwhile to check the number and
identity of the most abundant sequences, which might identify a high amount of linkers or other contaminants.for sample in chip_dmel input_dmel; do echo -en $sample"\t"
Number of unique reads and most repeated read
gunzip -c ${sample}_sub.fastq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max){max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' done
Step 3: Read length check
for sample in chip_dmel input_dmel; do echo -en $sample"\t"
Read mapping. Map reads uniquely to the reference genome allowing for mismatches. We recommend using the SAM output format and then convert the files to sorted BAM files (compressed binary version) and associated index (BAI) files using SAMTools. Convert BAM files to BED files using BEDTools.
CRITICAL STEP Reads from all compared samples should be mapped with the same settings in order to avoid bias in downstream analyses such as peak calling.
for sample in chip_dmel input_dmel; do gunzip -c ${sample}_36bp.fastq.gz | bowtie2 -x bowtie_index/dm3 - > ${sample}.sam
Count the number of mapped reads, unique read coordinates and the maximum of reads mapped to the same genomic position.for sample in chip_dmel input_dmel; do echo -en $sample"\t"
Number of raw reads
`raw=$(samtools view ${sample}.bam | wc -l)`
# Number of raw, unique and most repeated reads
samtools view -f 0x0004 ${sample}.bam | awk '{read=$10;total++;count[read]++}END{print "Total_non-mapped_reads",total;for(read in count){print read,count[read]+0}}' | sort -k2,2nr | head -11 done
Step 6: Peak calling.
For each immunoprecipitation sample and its corresponding input control sample, call peaks using MACS, with a stringent FDR threshold (e.g., FDR ≤ 1%) to identify confident peaks and with the default P value (10 − 5 ) to identify regions with nonrandom enrichments.for pair in chip_dmel-input_dmel; doecho -en $pair"\t"chip=$(echo $pair | sed 's/-.*//')input=$(echo $pair | sed 's/.*-//')macs2 callpeak -t ${chip}.bam -c ${input}.bam -n short -g dm -p 1e-2 --nomodel --extsize 36 --keep-dup alldone
Step 7: sort the narrowpeak file and convert to bed
This step will facilitate the count table step,hence the counting would use the sorted bam file it is recommended to sort the peaks to speed up the counting.for sample in chip_dmel input_dmel; docat short_peaks.narrowPeak | sort -k1,1 -k2,2n > narrowPeak.beddone
Step 8 : Count Table
will count how many reads are in those peaks using bedtools multicovCount by bedtoolsfor sample in chip_dmel input_dmel;dobedtools multicov -bams ${sample}.bam -bed narrowPeak.bed > counts_table.beddone
CRITICAL NOTE When dealing with merged bed file make sure to add the peak id to the fourth column . Here we do not need change the bed file as the peak id is the fourth column.
To do so for merged file
cat merge.bed | awk '{$3=$3"\t""peak_"NR}1' OFS="\t" > bed_for_multicov.bed
But we will skip this step
CRITICAL NOTE We provide all code as Unix shell instructions that generally run on Unix/Linux distributions, allow line-by-line processing of large files and are typically very robust
*You can download bash file from https://github.com/kacst-bioinfo-lab/labwork/blob/master/code/abdulmalek/ChipBash.sh We always suggest keeping large files in a compressed format (e.g., using gzip).*
Pre-Processing:
Sub-setting Large fastq.gz files with command line For sub-setting large files you can use the command line Step 0 : Subsetting Large Zipped Fastq files
for sample in chip_dmel input_dmel; do
zcat ${sample}.fastq.gz | sed -n 1,4000000p | gzip > ${sample}_sub.fastq.gz
done
The -n 1,4000000 will keep the fastq format intact; hence a single read in fastq file has 4 attributesQuality control with Fastx
Assess the read quality by the average quality score. For raw sequencing files in fastq format, Fastx is used to do quality control. we can do this in terminal:
Step 1: Sequence quality check
for sample in chip_dmel input_dmel; do
FASTX Statistics
FASTX quality score
FASTX nucleotide distribution
fastx_nucleotide_distribution_graph.sh -i quality/${sample}_sub_stats.txt -o quality/${sample}_nuc.png -t ${sample}
Remove intermediate file
rm quality/${sample}_sub_stats.txt
done
Step 2: Raw reads count
Count the number of total and unique reads. In addition, it is worthwhile to check the number and identity of the most abundant sequences, which might identify a high amount of linkers or other contaminants.
for sample in chip_dmel input_dmel; do
echo -en $sample"\t"
Number of unique reads and most repeated read
gunzip -c ${sample}_sub.fastq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max){max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'
done
Step 3: Read length check
for sample in chip_dmel input_dmel; do
echo -en $sample"\t"
Read length
gunzip -c ${sample}_sub.fastq.gz | awk '((NR-2)%4==0){count[length($1)]++}END{for(len in count){print len}}'
# Truncate longer reads to 36bp (if necessary)LEN=36
gunzip -c ${sample}_sub.fastq.gz | awk -v LEN=$LEN '{if((NR-2)%2==0){print substr($1,1,LEN)}else{print $0}}' | gzip > ${sample}_36bp.fastq.gz
done
Read mapping
Read mapping. Map reads uniquely to the reference genome allowing for mismatches. We recommend using the SAM output format and then convert the files to sorted BAM files (compressed binary version) and associated index (BAI) files using SAMTools. Convert BAM files to BED files using BEDTools. CRITICAL STEP Reads from all compared samples should be mapped with the same settings in order to avoid bias in downstream analyses such as peak calling.
for sample in chip_dmel input_dmel; do
gunzip -c ${sample}_36bp.fastq.gz | bowtie2 -x bowtie_index/dm3 - > ${sample}.sam
done
for sample in chip_dmel input_dmel; do
Convert file from SAM to BAM format
done
Step 5: Mapped reads counts
Count the number of mapped reads, unique read coordinates and the maximum of reads mapped to the same genomic position.
for sample in chip_dmel input_dmel; do
echo -en $sample"\t"
Number of raw reads
bamToBed -i ${sample}.bam | awk -v RAW=$raw '{coordinates=$1":"$2"-"$3;total++;count[coordinates]++}END{for(coordinates in count){if(!max||count[coordinates]>max){max=count[coordinates];maxCoor= coordinates};if(count[coordinates]==1){unique++}};print RAW,total,total*100/RAW,unique,unique*100/total,maxCoor,count[maxCoor],count[maxCoor]*100/total}'
Total and top 10 of non-mapped reads
samtools view -f 0x0004 ${sample}.bam | awk '{read=$10;total++;count[read]++}END{print "Total_non-mapped_reads",total;for(read in count){print read,count[read]+0}}' | sort -k2,2nr | head -11
done
Step 6: Peak calling.
For each immunoprecipitation sample and its corresponding input control sample, call peaks using MACS, with a stringent FDR threshold (e.g., FDR ≤ 1%) to identify confident peaks and with the default P value (10 − 5 ) to identify regions with nonrandom enrichments.
for pair in chip_dmel-input_dmel; do
echo -en $pair"\t"
chip=$(echo $pair | sed 's/-.*//')
input=$(echo $pair | sed 's/.*-//')
macs2 callpeak -t ${chip}.bam -c ${input}.bam -n short -g dm -p 1e-2 --nomodel --extsize 36 --keep-dup all
done
Step 7: sort the narrowpeak file and convert to bed
This step will facilitate the count table step,hence the counting would use the sorted bam file it is recommended to sort the peaks to speed up the counting.
for sample in chip_dmel input_dmel; do
cat short_peaks.narrowPeak | sort -k1,1 -k2,2n > narrowPeak.bed
done
Step 8 : Count Table
will count how many reads are in those peaks using bedtools multicov Count by bedtools
for sample in chip_dmel input_dmel;do
bedtools multicov -bams ${sample}.bam -bed narrowPeak.bed > counts_table.bed
done
CRITICAL NOTE When dealing with merged bed file make sure to add the peak id to the fourth column . Here we do not need change the bed file as the peak id is the fourth column. To do so for merged filecat merge.bed | awk '{$3=$3"\t""peak_"NR}1' OFS="\t" > bed_for_multicov.bed
But we will skip this stepEnter R
R
You can download dmel_RChipseeker.R from https://github.com/kacst-bioinfo-lab/labwork/blob/master/code/abdulmalek/dmel_RChipseeker.R Then type**
source ("dmel_RChipseeker.R")