Genrich is a peak-caller for genomic enrichment assays (e.g. ChIP-seq, ATAC-seq). It analyzes alignment files generated following the assay and produces a file detailing peaks of significant enrichment.
Given:
sample.bam
(alignment file, sorted by queryname)Genrich
(compiled as described below)To produce a file listing regions of genomic enrichment:
$ ./Genrich -t sample.bam -o sample.narrowPeak -v
The software can be downloaded from GitHub.
A Makefile is provided for compilation with GCC, and zlib is also required. The program has been tested after compilation with GCC 5.4.0 and zlib 1.2.8.
To compile, run make
in the folder in which the software was downloaded. The executable Genrich
should be produced.
Usage: ./Genrich -t <file> -o <file> [optional arguments]
Required arguments:
-t <file> Input SAM/BAM file(s) for experimental sample(s)
-o <file> Output peak file (in ENCODE narrowPeak format)
Optional I/O arguments:
-c <file> Input SAM/BAM file(s) for control sample(s)
-f <file> Output bedgraph-ish file for p/q values
-k <file> Output bedgraph-ish file for pileups and p-values
-b <file> Output BED file for reads/fragments/intervals
-R <file> Output file for PCR duplicates (only with -r)
Filtering options:
-r Remove PCR duplicates
-e <arg> Comma-separated list of chromosomes to exclude
-E <file> Input BED file(s) of genomic regions to exclude
-m <int> Minimum MAPQ to keep an alignment (def. 0)
-s <float> Keep sec alns with AS >= bestAS - <float> (def. 0)
-y Keep unpaired alignments (def. false)
-w <int> Keep unpaired alns, lengths changed to <int>
-x Keep unpaired alns, lengths changed to paired avg
Options for ATAC-seq:
-j Use ATAC-seq mode (def. false)
-d <int> Expand cut sites to <int> bp (def. 100)
-D Skip Tn5 adjustments of cut sites (def. false)
Options for peak-calling:
-p <float> Maximum p-value (def. 0.01)
-q <float> Maximum q-value (FDR-adjusted p-value; def. 1)
-a <float> Minimum AUC for a peak (def. 200.0)
-l <int> Minimum length of a peak (def. 0)
-g <int> Maximum distance between signif. sites (def. 100)
Other options:
-X Skip peak-calling
-P Call peaks directly from a log file (-f)
-z Option to gzip-compress output(s)
-v Option to print status updates/counts to stderr
Here is an overview of the method used by Genrich to identify peaks (Fig. 1):
Genrich analyzes paired-end reads aligned to a reference genome. It correctly infers full fragments as spanning between the 5' ends of two properly paired alignments. By default, it does not consider unpaired alignments (including those from single-end reads), although there are three options for keeping such alignments, as described here.
An alternative analysis mode for ATAC-seq is also provided by Genrich, as described here.
Genrich analyzes reads/fragments that map to multiple locations in the genome by adding a fractional count to each location. This allows for peak detection in regions of the genome that are otherwise inaccessible to the assay. The input SAM/BAM file(s) must list secondary alignments for multimapping reads/fragments, such as those reported by the short read aligner Bowtie2 in -k <int>
mode or -a
mode. Additional information about the processing of secondary alignments by Genrich can be found in the description of the -s
parameter.
Genrich provides an option for removing PCR duplicates (-r
). In this process, it analyzes reads/fragments based on their alignments, in three separate groups (proper pairs, discordant pairs, and singletons), and removes those identified as duplicates from further analysis. One novel feature is that this evaluation takes into account reads/fragments with multiple alignments. Additional information on the duplicate identification procedure can be found here.
Genrich computes the genome length as the sum of the lengths of the chromosomes (reference sequences) in the header of the SAM/BAM file. The length is reduced if the user specifies chromosomes (-e
) or genomic regions (-E
) to be excluded, as described below. The calculated length is used for computing a background pileup value and q-values.
The background pileup value is calculated by dividing the total sequence information (sum of read/fragment/interval lengths) in the experimental sample by the calculated genome length. The net control pileup value at a particular genomic position is the maximum of the background pileup value and the pileup of the control sample at that position (if a control sample is specified). Note that control pileups are scaled to match the experimental, based on the total sequence information in each.
The p-value for each base of the genome is calculated assuming a null model with a log-normal distribution. The control/background pileup value is used as the parameter μ, and the σ parameter is 1.2×μ if μ ≤ 7, and 10×log10(μ) if μ > 7. These formulas, as well as the choice of the log-normal distribution, were determined from a comprehensive review of control samples from various ChIP-seq experiments (human, mouse, and worm) downloaded from SRA.
The q-value for each base of the genome is calculated from the p-value using the Benjamini-Hochberg procedure. The calculated genome length is used as the number of hypothesis tests (m).
Genrich calls peaks for multiple replicates collectively. First, it analyzes the replicates separately, with p-values calculated for each. At each genomic position, the multiple replicates' p-values are then combined by Fisher's method. The combined p-values are converted to q-values, and peaks are called. This obviates the need for IDR (you're welcome!).
-t <file> Input SAM/BAM file(s) for experimental sample(s)
samtools sort -n
).samtools merge
.stdin
with -t -
.-P
option).
-o <file> Output peak file (in ENCODE narrowPeak format)
As indicated, the output file is in ENCODE narrowPeak format. Here are details of the fields:
1. chrom | Name of the chromosome |
2. chromStart | Starting position of the peak (0-based) |
3. chromEnd | Ending position of the peak (not inclusive) |
4. name | peak_N , where N is the 0-based count |
5. score | Average AUC (total AUC / bp) × 1000, rounded to the nearest int (max. 1000) |
6. strand | . (no orientation) |
7. signalValue | Total area under the curve (AUC) |
8. pValue | Summit -log10(p-value) |
9. qValue | Summit -log10(q-value), or -1 if not available (e.g. without -q ) |
10. peak | Summit position (0-based offset from chromStart): the midpoint of the peak interval with the highest significance (the longest interval in case of ties) |
Here is the portion of the output file corresponding to the peaks called in Figure 1:
chr1 894446 894988 peak_10 402 . 217.824936 4.344683 1.946031 317
chr1 895834 896167 peak_11 343 . 114.331093 4.344683 1.946031 90
Genrich writes to stdout
with -o -
.
This file need not be specified when peak-calling is skipped (-X
).
-c <file> Input SAM/BAM file(s) for control sample(s)
null
.
-f <file> Output bedgraph-ish file for p/q values
*
) for each interval. Here is the portion of the log file corresponding to the beginning of peak_10
(Fig. 1):
chr1 894435 894436 33.000000 2.477916 3.183460 1.208321
chr1 894436 894442 34.000000 2.477916 3.231466 1.241843
chr1 894442 894446 35.000000 2.477916 3.278469 1.274561
chr1 894446 894447 36.000000 2.477916 3.324516 1.306471 *
chr1 894447 894450 39.000000 2.477916 3.457329 1.398035 *
chr1 894450 894451 40.000000 2.477916 3.499948 1.427253 *
chr1 894451 894460 41.000000 2.477916 3.541798 1.455938 *
-k
file, below) is called "bedgraph-ish" because it contains multiple dataValue
fields, which isn't strictly allowed in the bedGraph format. However, a simple application of awk
can produce the desired bedgraph files for visualization purposes (see this awk reference for a guide to printing specific fields of input records).-X
), the significance column is not produced.-P
option).
-k <file> Output bedgraph-ish file for pileups and p-values
# experimental file: <name>; control file: <name>
), followed by experimental/control pileups and a p-value for each interval. This is the way to examine pileup values with multiple replicates, since the -f
log file does not supply them in that case.
-b <file> Output BED file for reads/fragments/intervals
SRR5427886.59_2_E_0
.
-R <file> Output file for PCR duplicates (only with -r)
SRR5427886.5958 chr4:185201876-185201975 SRR5427886.4688 paired
SRR5427886.1826 chr12:34372610,+;chr1:91852878,- SRR5427886.2040 discordant
SRR5427886.10866 chr14:53438632,+ SRR5427886.4746 single
# experimental file #0: SRR5427886.bam
.getReads.py
, for example.
-r Remove PCR duplicates
-s
threshold are considered. If any of a read's/fragment's alignments is evaluated as a duplicate, then the whole read/fragment is classified as a duplicate, and all of its alignments are discarded. Note that no maximum of 10 alignments per read/fragment is imposed at this stage, and all alignments to skipped chromosomes (-e
) or genomic regions (-E
) are still evaluated. -e <arg> Comma-separated list of chromosomes to exclude
-x
), and the alignments are not printed to the -b
file.-e
chromosomes are considered for comparison purposes.-e
chromosomes are subtracted from the total genome length calculated by the program.
-E <file> Input BED file(s) of genomic regions to exclude
-x
), and the full fragments are listed in the -b
file.-f
, -k
), excluded regions have experimental/control pileup values of 0.0
and p-/q-values of NA
.findNs.py
produces a BED file of 'N' homopolymers in a fasta file (e.g. a reference genome). Its output should be given to Genrich via -E
.
-m <int> Minimum MAPQ to keep an alignment (def. 0)
MAPQ
less than the given value are eliminated. This is equivalent to filtering with samtools view -q <int>
.-s <float>
, below.
-s <float> Keep sec alns with AS >= bestAS - <float> (def. 0)
-s 20
causes Genrich also to keep secondary alignments whose scores are within 20 of the best.AS
. If not, all alignments are considered equivalent.N
alignments for a read/fragment is counted as 1/N
for the pileup.-s
threshold are subsampled based on the best alignment scores; in the case of ties, alignments appearing first in the SAM/BAM are favored.-k <int>
mode or -a
mode. The short read aligner BWA does not report secondary alignments.
By default, Genrich analyzes only properly paired alignments and infers the full fragments as spanning between the 5' ends of the two alignments (Fig. 2). It does not analyze unpaired alignments unless one of three options is selected:
-y Keep unpaired alignments (def. false)
-w <int> Keep unpaired alns, lengths changed to <int>
-x Keep unpaired alns, lengths changed to paired avg
-y
: unpaired alignments are kept, just as they appear in the SAM/BAM-w <int>
: unpaired alignments are kept, with their lengths changed to the given value (from their 5' ends)-x
: unpaired alignments are kept, with their lengths changed to the average length of fragments inferred from properly paired alignments (excluding those aligning to skipped chromosomes [-e
])ATAC-seq is a method for assessing genomic regions of open chromatin. Since only the ends of the DNA fragments indicate where the Tn5 transposase enzyme was able to insert into the chromatin, it may not be optimal to interpret alignments as shown above (Fig. 2). Genrich has an alternative analysis mode for ATAC-seq in which it creates intervals centered on transposase cut sites (Fig. 3).
-j Use ATAC-seq mode (def. false)
-d <int> Expand cut sites to <int> bp (def. 100)
-D Skip Tn5 adjustments of cut sites (def. false)
Unpaired alignments can be analyzed with -y
, though only one interval, centered on the read's 5' end, is inferred. Both -w <int>
and -x
are equivalent to -y
in ATAC-seq mode.
By default, Genrich centers the intervals at the ends of the reads/fragments, adjusted forward by 5bp to account for the Tn5 transposase occupancy. That is, for the 5' ends of fragments (or for reads aligning in a normal orientation), the position is increased by +5, and for the 3' ends of fragments (or for reads aligning in a reverse-complement orientation), the position is adjusted by -5. To avoid this position adjustment (e.g. for DNase-seq), one can use -D
.
For full fragments, when the two cut site intervals overlap, they are merged into a single interval. To get a BED file of cut sites, one can run -d 1 -b <file>
.
The remainder of the peak-calling process (calculating pileups and significance values) is identical to the default analysis mode. Note that the interval lengths (not the fragment lengths) are used to sum the total sequence information for the calculation of control/background pileup values.
-p <float> Maximum p-value (def. 0.01)
-q <float> Maximum q-value (FDR-adjusted p-value; def. 1)
-q
threshold is specified, the -p
threshold is ignored.-q
threshold is not specified, q-values are not calculated (reported as -1
).
-a <float> Minimum AUC for a peak (def. 200.0)
-p
over the length of the region (i.e. the Area Under the -log(p) Curve).-q
threshold is specified, the Area Under the -log(q) Curve is calculated (see Fig. 1).
-l <int> Minimum length of a peak (def. 0)
-g <int> Maximum distance between signif. sites (def. 100)
-X Skip peak-calling
-f
) can subsequently be used to call peaks directly (see -P
, below).-o
) is suspended, and no such file is produced.
-P Call peaks directly from a log file (-f)
-f
) produced by a previous Genrich run. One can modify the peak-calling parameters and explore the results without needing to run the full analysis (and requiring much less time and memory).-e
) and genomic regions (-E
) that were previously specified in the Genrich run that produced the log file are honored. Additional chromosomes and regions can be specified via the same command-line options, although the results may be slightly different than running the full analysis, due to differences in the calculations of genome length and sequence information. Note that there is no verification that an excluded region (-E
) does not extend past the end of a chromosome, unlike when the full analysis is run.-t
) is suspended, but the input log file (-f
) must be specified.-k
/-b
/-R
) cannot be produced with -P
.
-z Option to gzip-compress output(s)
-S Option to skip sort order check
-L <int> Set genome length to given value
Other options:
-v/--verbose Option to print status updates/counts to stderr
-h/--help Print the usage message and exit
-V/--version Print the version and exit
A sequencing run was downloaded from SRA. Its reads were adapter-trimmed by NGmerge and aligned to the human genome (hg19) by Bowtie2 with -k 20
. The SAM alignments from bowtie2 were piped through SAMtools to sort them by queryname and convert the file to the BAM format. The resulting alignment file SRR5427886.bam
was analyzed by Genrich with the options to remove PCR duplicates (-r
) and to keep unpaired alignments, extended to the average fragment length (-x
). All alignments to chrM and chrY were discarded (-e chrM,chrY
), as well as alignments to two sets of excluded intervals: regions of 'N' homopolymers in the hg19 genome (produced by findNs.py
) and high mappability regions (-E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz
). There was no control sample. Peaks were called using a maximum q-value of 0.05 (-q 0.05
) and a minimum AUC of 20.0 (-a 20.0
).
$ ./Genrich -t SRR5427886.bam -o SRR5427886.narrowPeak \
-f SRR5427886.log -r -x -q 0.05 -a 20.0 -v \
-e chrM,chrY -E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz
Processing experimental file #0: SRR5427886.bam
BAM records analyzed: 146277509
Unmapped: 2877550
To skipped refs: 2560688
(chrM,chrY)
Paired alignments: 137462998
secondary alns: 67724920
Unpaired alignments: 3376273
secondary alns: 2232056
PCR duplicates --
Paired aln sets: 35574220
duplicates: 4660723 (13.1%)
Discordant aln sets: 64679
duplicates: 7736 (12.0%)
Singleton aln sets: 1036778
duplicates: 475201 (45.8%)
Fragments analyzed: 31286234
Full fragments: 30616993
(avg. length: 226.4bp)
Half fragments: 669241
(from unpaired alns, extended to 226bp)
- control file #0 not provided -
Background pileup value: 2.477916
Peak-calling parameters:
Genome length: 2826865605bp
Significance threshold: -log(q) > 1.301
Min. AUC: 20.000
Max. gap between sites: 100bp
Peaks identified: 35114 (27918264bp)
The total time to execute the above command (analyzing a BAM of 146.3 million alignments and calling peaks) was 10.5min. It required 17.1GB of memory.
Once a log file (SRR5427886.log
) is produced, calling peaks with an alternative set of parameters is accomplished most easily with the -P
option. For example, with -p 0.01 -a 200
:
$ ./Genrich -P -f SRR5427886.log -o SRR5427886_p01_a200.narrowPeak -p 0.01 -a 200 -v
Peak-calling from log file: SRR5427886.log
Peak-calling parameters:
Genome length: 2826865605bp
Significance threshold: -log(p) > 2.000
Min. AUC: 200.000
Max. gap between sites: 100bp
Peaks identified: 47329 (62194983bp)
This required just 36sec and less than 0.1MB of memory.
However, if one wanted the alignments to be interpreted differently, such as ATAC-seq mode (-j
), the original command (with -t SRR5427886.bam
) would need to be rerun.
In verbose mode, Genrich may print one or more warnings to stderr
:
Read N prevented from extending below 0 on <chrom>
: This may occur due to extending unpaired alignments (-w <int>
, -x
) or in ATAC-seq mode (-j
).Read N prevented from extending past <int> on <chrom>
: This also may occur due to extending unpaired alignments (-w <int>
, -x
) or in ATAC-seq mode (-j
). A maximum of 128 warning messages of these types (Read N prevented...
) are printed per SAM/BAM.Large scaling may mask true signal
: This is printed if the scaling factor for the control pileup is greater than 5.BED interval ignored - located off end of reference
: An excluded BED interval (-E
) whose start coordinate is past the end of the reference sequence is ignored. One should ensure that the genome version that produced the BED intervals matches that of the SAM/BAM.BED interval extends past end of ref. - edited to <loc>
: An excluded BED interval (-E
) whose end coordinate is past the end of the reference sequence is adjusted as indicated. Again, one should ensure that the genome version that produced the BED intervals matches that of the SAM/BAM.No paired alignments to calculate avg frag length -- Printing unpaired alignments "as is"
: When there are no properly paired alignments and the -x
average extension option is selected, the unpaired alignments are printed as they appear in the SAM/BAM.Read N, alignment at <loc> skipped due to overflow
: The maximum difference in pileup values from one genomic position to the next is +32767, and additional read alignments are skipped due to this limitation. Removing PCR duplicates (-r
) may help reduce this issue.Read N, alignment at <loc> skipped due to underflow
: The minimum difference in pileup values from one genomic position to the next is -32768, and additional read alignments are skipped due to this limitation. Removing PCR duplicates (-r
) may help reduce this issue.Read N has more than 128 alignments
: If a read has more than 128 alignments in the SAM/BAM, only the first 128 are considered. As described above, the best 10 alignments (at most) are ultimately analyzed by Genrich.All q-values are 1
: When no p-values are smaller than the reciprocal of the calculated genome length, all q-values are calculated to be 1. If peaks are expected to be called, one should consider using a p-value threshold instead (-p
).Skipping chromosome N -- Reads aligning to it were used...
: This occurs when peak-calling directly from a log file (-P
), and the log file indicates that reads/fragments aligned to a chromosome to be skipped (-e
). When Genrich created the log file, the chromosome was not skipped. Therefore, the reads/fragments that aligned to it were used in the background pileup calculation, and its length was included in the genome length calculation. Those calculations affected the statistics present in the log file, so the called peaks may be slightly different than those called when running the full analysis.Skipping given BED regions -- Reads aligning to them were used...
: This occurs when peak-calling directly from a log file (-P
), and the log file indicates that reads/fragments aligned to one or more genomic regions to be skipped (-E
). When Genrich created the log file, the regions were not skipped. Therefore, the reads/fragments that aligned to them were used in the background pileup calculation, and their lengths were included in the genome length calculation. Those calculations affected the statistics present in the log file, so the called peaks may be slightly different than those called when running the full analysis."orphan" alns
: An "orphan" alignment is one that the SAM/BAM indicates is properly paired, but its pair could not be found. This could be due to a poorly formatted SAM/BAM, or possibly (but unlikely) a bug in alignment parsing by Genrich. This warning appears in the accounting of alignments analyzed.Genrich runs very quickly but uses a considerable amount of memory. For starters, it requires 3 bytes for every base-pair of the reference genome, i.e. ~9GB for a human sample. The number of input files has little effect on memory, but certain analysis options (especially the option to remove PCR duplicates) can greatly increase the memory usage, particularly with large SAM/BAM input files. See above for an example.
Genrich is not multi-threaded.
Genrich
Copyright © 2018 John M. Gaspar (jsh58@wildcats.unh.edu)
Any question, concern, or bug report about the program should be posted as an Issue on GitHub. Before posting, please check previous issues (both Open and Closed) to see if your issue has been addressed already. Also, please follow these good GitHub practices.