jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
185 stars 27 forks source link

Genrich: detecting sites of genomic enrichment

Table of Contents

Introduction

Genrich is a peak-caller for genomic enrichment assays (e.g. ChIP-seq, ATAC-seq). It analyzes alignment files generated following the assay and produces a file detailing peaks of significant enrichment.

Quick start

Given:

To produce a file listing regions of genomic enrichment:

$ ./Genrich  -t sample.bam  -o sample.narrowPeak  -v

Software compilation

The software can be downloaded from GitHub.

A Makefile is provided for compilation with GCC, and zlib is also required. The program has been tested after compilation with GCC 5.4.0 and zlib 1.2.8.

To compile, run make in the folder in which the software was downloaded. The executable Genrich should be produced.

Usage message

Usage: ./Genrich  -t <file>  -o <file>  [optional arguments]
Required arguments:
  -t  <file>       Input SAM/BAM file(s) for experimental sample(s)
  -o  <file>       Output peak file (in ENCODE narrowPeak format)
Optional I/O arguments:
  -c  <file>       Input SAM/BAM file(s) for control sample(s)
  -f  <file>       Output bedgraph-ish file for p/q values
  -k  <file>       Output bedgraph-ish file for pileups and p-values
  -b  <file>       Output BED file for reads/fragments/intervals
  -R  <file>       Output file for PCR duplicates (only with -r)
Filtering options:
  -r               Remove PCR duplicates
  -e  <arg>        Comma-separated list of chromosomes to exclude
  -E  <file>       Input BED file(s) of genomic regions to exclude
  -m  <int>        Minimum MAPQ to keep an alignment (def. 0)
  -s  <float>      Keep sec alns with AS >= bestAS - <float> (def. 0)
  -y               Keep unpaired alignments (def. false)
  -w  <int>        Keep unpaired alns, lengths changed to <int>
  -x               Keep unpaired alns, lengths changed to paired avg
Options for ATAC-seq:
  -j               Use ATAC-seq mode (def. false)
  -d  <int>        Expand cut sites to <int> bp (def. 100)
  -D               Skip Tn5 adjustments of cut sites (def. false)
Options for peak-calling:
  -p  <float>      Maximum p-value (def. 0.01)
  -q  <float>      Maximum q-value (FDR-adjusted p-value; def. 1)
  -a  <float>      Minimum AUC for a peak (def. 200.0)
  -l  <int>        Minimum length of a peak (def. 0)
  -g  <int>        Maximum distance between signif. sites (def. 100)
Other options:
  -X               Skip peak-calling
  -P               Call peaks directly from a log file (-f)
  -z               Option to gzip-compress output(s)
  -v               Option to print status updates/counts to stderr

Attributes

Peak-calling method

Here is an overview of the method used by Genrich to identify peaks (Fig. 1):

Peak-calling by Genrich
Figure 1. Peak-calling by Genrich. Information about the sample and the Genrich command can be found here. Visualization by IGV.



Alignment parsing

Genrich analyzes paired-end reads aligned to a reference genome. It correctly infers full fragments as spanning between the 5' ends of two properly paired alignments. By default, it does not consider unpaired alignments (including those from single-end reads), although there are three options for keeping such alignments, as described here.

An alternative analysis mode for ATAC-seq is also provided by Genrich, as described here.

Multimapping reads

Genrich analyzes reads/fragments that map to multiple locations in the genome by adding a fractional count to each location. This allows for peak detection in regions of the genome that are otherwise inaccessible to the assay. The input SAM/BAM file(s) must list secondary alignments for multimapping reads/fragments, such as those reported by the short read aligner Bowtie2 in -k <int> mode or -a mode. Additional information about the processing of secondary alignments by Genrich can be found in the description of the -s parameter.

PCR duplicate removal

Genrich provides an option for removing PCR duplicates (-r). In this process, it analyzes reads/fragments based on their alignments, in three separate groups (proper pairs, discordant pairs, and singletons), and removes those identified as duplicates from further analysis. One novel feature is that this evaluation takes into account reads/fragments with multiple alignments. Additional information on the duplicate identification procedure can be found here.

Genome length calculation

Genrich computes the genome length as the sum of the lengths of the chromosomes (reference sequences) in the header of the SAM/BAM file. The length is reduced if the user specifies chromosomes (-e) or genomic regions (-E) to be excluded, as described below. The calculated length is used for computing a background pileup value and q-values.

Control/background pileup calculation

The background pileup value is calculated by dividing the total sequence information (sum of read/fragment/interval lengths) in the experimental sample by the calculated genome length. The net control pileup value at a particular genomic position is the maximum of the background pileup value and the pileup of the control sample at that position (if a control sample is specified). Note that control pileups are scaled to match the experimental, based on the total sequence information in each.

p-value calculation

The p-value for each base of the genome is calculated assuming a null model with a log-normal distribution. The control/background pileup value is used as the parameter μ, and the σ parameter is 1.2×μ if μ ≤ 7, and 10×log10(μ) if μ > 7. These formulas, as well as the choice of the log-normal distribution, were determined from a comprehensive review of control samples from various ChIP-seq experiments (human, mouse, and worm) downloaded from SRA.

q-value calculation

The q-value for each base of the genome is calculated from the p-value using the Benjamini-Hochberg procedure. The calculated genome length is used as the number of hypothesis tests (m).

Multiple replicates

Genrich calls peaks for multiple replicates collectively. First, it analyzes the replicates separately, with p-values calculated for each. At each genomic position, the multiple replicates' p-values are then combined by Fisher's method. The combined p-values are converted to q-values, and peaks are called. This obviates the need for IDR (you're welcome!).

I/O files and options

Required files

  -t  <file>       Input SAM/BAM file(s) for experimental sample(s)
  -o  <file>       Output peak file (in ENCODE narrowPeak format)

Optional files

  -c  <file>       Input SAM/BAM file(s) for control sample(s)
  -f  <file>       Output bedgraph-ish file for p/q values
  -k  <file>       Output bedgraph-ish file for pileups and p-values
  -b  <file>       Output BED file for reads/fragments/intervals
  -R  <file>       Output file for PCR duplicates (only with -r)

Filtering options

  -r               Remove PCR duplicates

  -e  <arg>        Comma-separated list of chromosomes to exclude
  -E  <file>       Input BED file(s) of genomic regions to exclude
  -m  <int>        Minimum MAPQ to keep an alignment (def. 0)

  -s  <float>      Keep sec alns with AS >= bestAS - <float> (def. 0)

Unpaired alignments

By default, Genrich analyzes only properly paired alignments and infers the full fragments as spanning between the 5' ends of the two alignments (Fig. 2). It does not analyze unpaired alignments unless one of three options is selected:

  -y               Keep unpaired alignments (def. false)
  -w  <int>        Keep unpaired alns, lengths changed to <int>
  -x               Keep unpaired alns, lengths changed to paired avg
Alignment analysis
Figure 2. Analysis of alignments by Genrich. The alignment file example.bam has both properly paired alignments (top left) and unpaired alignments (top right). By default, Genrich infers the full fragments from the paired alignments and discards the unpaired alignments. Unpaired alignments can be kept via -y, -w <int>, or -x, as described above.



ATAC-seq mode

ATAC-seq is a method for assessing genomic regions of open chromatin. Since only the ends of the DNA fragments indicate where the Tn5 transposase enzyme was able to insert into the chromatin, it may not be optimal to interpret alignments as shown above (Fig. 2). Genrich has an alternative analysis mode for ATAC-seq in which it creates intervals centered on transposase cut sites (Fig. 3).

  -j               Use ATAC-seq mode (def. false)
  -d  <int>        Expand cut sites to <int> bp (def. 100)
  -D               Skip Tn5 adjustments of cut sites (def. false)
ATAC-seq mode
Figure 3. ATAC-seq mode of Genrich. Genrich analyzes intervals centered on cut sites (both ends of full fragments, as well as the 5' ends of unpaired alignments if -y is set, adjusted forward by 5bp). The lengths of the intervals can be changed from the default of -d 100.



Unpaired alignments can be analyzed with -y, though only one interval, centered on the read's 5' end, is inferred. Both -w <int> and -x are equivalent to -y in ATAC-seq mode.

By default, Genrich centers the intervals at the ends of the reads/fragments, adjusted forward by 5bp to account for the Tn5 transposase occupancy. That is, for the 5' ends of fragments (or for reads aligning in a normal orientation), the position is increased by +5, and for the 3' ends of fragments (or for reads aligning in a reverse-complement orientation), the position is adjusted by -5. To avoid this position adjustment (e.g. for DNase-seq), one can use -D.

For full fragments, when the two cut site intervals overlap, they are merged into a single interval. To get a BED file of cut sites, one can run -d 1 -b <file>.

The remainder of the peak-calling process (calculating pileups and significance values) is identical to the default analysis mode. Note that the interval lengths (not the fragment lengths) are used to sum the total sequence information for the calculation of control/background pileup values.

Peak-calling parameters

  -p  <float>      Maximum p-value (def. 0.01)
  -q  <float>      Maximum q-value (FDR-adjusted p-value; def. 1)
  -a  <float>      Minimum AUC for a peak (def. 200.0)
  -l  <int>        Minimum length of a peak (def. 0)
  -g  <int>        Maximum distance between signif. sites (def. 100)

Miscellaneous

  -X               Skip peak-calling

  -P               Call peaks directly from a log file (-f)
  -z               Option to gzip-compress output(s)
  -S               Option to skip sort order check
  -L <int>         Set genome length to given value

Other options:

  -v/--verbose     Option to print status updates/counts to stderr
  -h/--help        Print the usage message and exit
  -V/--version     Print the version and exit

Full analysis example

A sequencing run was downloaded from SRA. Its reads were adapter-trimmed by NGmerge and aligned to the human genome (hg19) by Bowtie2 with -k 20. The SAM alignments from bowtie2 were piped through SAMtools to sort them by queryname and convert the file to the BAM format. The resulting alignment file SRR5427886.bam was analyzed by Genrich with the options to remove PCR duplicates (-r) and to keep unpaired alignments, extended to the average fragment length (-x). All alignments to chrM and chrY were discarded (-e chrM,chrY), as well as alignments to two sets of excluded intervals: regions of 'N' homopolymers in the hg19 genome (produced by findNs.py) and high mappability regions (-E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz). There was no control sample. Peaks were called using a maximum q-value of 0.05 (-q 0.05) and a minimum AUC of 20.0 (-a 20.0).

$ ./Genrich  -t SRR5427886.bam  -o SRR5427886.narrowPeak  \
  -f SRR5427886.log  -r  -x  -q 0.05  -a 20.0  -v  \
  -e chrM,chrY  -E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz 
Processing experimental file #0: SRR5427886.bam
  BAM records analyzed:  146277509
    Unmapped:              2877550
    To skipped refs:       2560688
      (chrM,chrY)
    Paired alignments:   137462998
      secondary alns:     67724920
    Unpaired alignments:   3376273
      secondary alns:      2232056
  PCR duplicates --
    Paired aln sets:      35574220
      duplicates:          4660723 (13.1%)
    Discordant aln sets:     64679
      duplicates:             7736 (12.0%)
    Singleton aln sets:    1036778
      duplicates:           475201 (45.8%)
  Fragments analyzed:     31286234
    Full fragments:       30616993
      (avg. length: 226.4bp)
    Half fragments:         669241
      (from unpaired alns, extended to 226bp)
- control file #0 not provided -
  Background pileup value: 2.477916
Peak-calling parameters:
  Genome length: 2826865605bp
  Significance threshold: -log(q) > 1.301
  Min. AUC: 20.000
  Max. gap between sites: 100bp
Peaks identified: 35114 (27918264bp)

The total time to execute the above command (analyzing a BAM of 146.3 million alignments and calling peaks) was 10.5min. It required 17.1GB of memory.

Once a log file (SRR5427886.log) is produced, calling peaks with an alternative set of parameters is accomplished most easily with the -P option. For example, with -p 0.01 -a 200:

$ ./Genrich  -P  -f SRR5427886.log  -o SRR5427886_p01_a200.narrowPeak  -p 0.01  -a 200  -v
Peak-calling from log file: SRR5427886.log
Peak-calling parameters:
  Genome length: 2826865605bp
  Significance threshold: -log(p) > 2.000
  Min. AUC: 200.000
  Max. gap between sites: 100bp
Peaks identified: 47329 (62194983bp)

This required just 36sec and less than 0.1MB of memory.

However, if one wanted the alignments to be interpreted differently, such as ATAC-seq mode (-j), the original command (with -t SRR5427886.bam) would need to be rerun.

Warning messages

In verbose mode, Genrich may print one or more warnings to stderr:

Computational requirements

Genrich runs very quickly but uses a considerable amount of memory. For starters, it requires 3 bytes for every base-pair of the reference genome, i.e. ~9GB for a human sample. The number of input files has little effect on memory, but certain analysis options (especially the option to remove PCR duplicates) can greatly increase the memory usage, particularly with large SAM/BAM input files. See above for an example.

Genrich is not multi-threaded.

Contact

Genrich

Copyright © 2018 John M. Gaspar (jsh58@wildcats.unh.edu)

Any question, concern, or bug report about the program should be posted as an Issue on GitHub. Before posting, please check previous issues (both Open and Closed) to see if your issue has been addressed already. Also, please follow these good GitHub practices.