pfenninglab / halLiftover-postprocessing

18 stars 4 forks source link

HALPER (halLiftover Postprocessing for the Evolution of Regulatory Elements)

Running HALPER

Introduction

HALPER is designed for constructing contiguous orthologs from the outputs of halLiftover (https://github.com/ComparativeGenomicsToolkit/hal). While it was originally designed for contructing orthologs of transcription factor ChIP-seq and open chromatin peaks, it can be applied to any genomic regions of interest. Since HALPER relies on halLiftover, the assembly of the query and target genomic regions must be in a Cactus alginment hal file.

Dependencies

Tips for Installing the HAL Format API

Program Parameters

Example Run of HALPER

Running these examples requires the files in the examples directory and 10plusway-master.hal, a Cactus alignment with 12 mammals that can be obtained from the authors of the paper describing Cactus (see "Relevant Publications" below). One can compare the outputs of each step to the files with the corresponding names in the examples directory.

Running hal and HALPER with one script

The script halper_map_peak_orthologs.sh runs halLiftover and postprocesses the results with HALPER all in one script. This is equivalent to running steps 1-4 in "Running steps manually" below. This script requires installing the dependencies in their own conda environment called "hal" and modifiying paths as described in https://github.com/pfenninglab/halLiftover-postprocessing/blob/master/hal_install_instructions.md.

To use halper_map_peak_orthologs.sh on a slurm cluster:

sbatch \
    -p [partition] \
    --array 1-[number of target species] \
    halper_map_peak_orthologs.sh \
    -b [path to input .bed or .narrowPeak file] \
    -o [path to output directory] \
    -s [source species, e.g. Homo_sapiens] \
    -t [comma-separated list of target species, e.g. Mus_musculus,Macaca_mulatta] \
    -c [path to cactus alignment file]

Using the --array flag above will instruct the slurm scheduler to map orthologs for each target species in parallel. If you omit the --array flag, the target species will be processed sequentially. To generate the error and output files, this needs to be run from a directory that contains a sub-directory called "logs."

If you are not running on a slurm cluster, you can submit the script with bash:

bash halper_map_peak_orthologs.sh \
    -b [path to input .bed or .narrowPeak file] \
    -o [path to output directory] \
    -s [source species, e.g. Homo_sapiens] \
    -t [comma-separated list of target species, e.g. Mus_musculus,Macaca_mulatta] \
    -c [path to cactus alignment file]

Running steps manually

To run only HALPER (not halLiftover), go directly to #4.

  1. Run halLiftover on the file from the query species (example is in narrowPeak format, so columns not in standard bed format are first removed) to obtain the regions' orthologs in the target species:
    [directory with hal]/hal/bin/halLiftover --bedType 4 [directory with Cactus alignment]/10plusway-master.hal Human [directory with halLiftover-postprocessing]/halLiftover-postprocessing/examples/hg38Peaks.bed Mouse hg38Peaks_halLiftovermm10.bed
  2. Get the peak summits (example is for a narrowPeak file, see "Preparing Histone Modification Data for HALPER" below for how to do this for histone modification ChIP-seq peaks or genomic regions without summits):
    awk 'BEGIN{OFS="\t"}{print $1, $2+$10, $2+$10+1, $4}' [directory with halLiftover-postprocessing]/halLiftover-postprocessing/examples/hg38Peaks.bed > hg38Peaks_summits.bed
  3. Run halLiftover on the peak summits to obtain their orthologs in the target species:
    [directory with hal]/hal/bin/halLiftover [directory with Cactus alignment]/10plusway-master.hal Human hg38Peaks_summits.bed Mouse hg38Peaks_summits_halLiftovermm10.bed
  4. Run HALPER (note that there is only one '-' for the parameter names):
    python [directory with halLiftover-postprocessing]/orthologFind.py -max_len 1000 -min_len 50 -protect_dist 5 -qFile [directory with halLiftover-postprocessing]/halLiftover-postprocessing/examples/hg38Peaks.bed -tFile hg38Peaks_halLiftovermm10.bed -sFile  hg38Peaks_summits_halLiftovermm10.bed -oFile hg38Peaks_halLiftovermm10_summitExtendedMin50Max1000Protect5.bed -mult_keepone
    • Examples of output without -narrowPeak option:
      chr8    55609305    55610335    55609835    peak0   1031    1019    530 500
      chr8    55609305    55610335    55609437    peak1   1031    1019    132 898
    • Examples of output with -narrowPeak option (columns 5-9 do not have meaningful values):
      chr8    55609305    55610335    peak0   -1  .   -1  -1  -1  530
      chr8    55609305    55610335    peak1   -1  .   -1  -1  -1  132

Output Files Produced by HALPER

Preparing Histone Modification Data for HALPER

Starting to construct target species orthologs with the target species orthologs of peak summits is sub-optimal for histone modification ChIP-seq data because, in this data, TFs are thought to bind, not where there are large numbers of reads, but in the valleys between the parts of regions with large numbers of reads. A reasonable place to start with histone modification data, therefore, is the location within the region that has the largest number of species in the alignment, as this is likely to be an important part of the region. If there are multiple such locations, which often happens, then choosing the one that is closest to the center makes sense because the centers of the histone modification regions tend to be more important than their edges. This same approach can be used for other genomic regions that do not have summits.

Here are the dependencies required for making an -sFile using this process:

Here is how to make an -sFile using this process:

  1. Get the alignment depth for the query species:

    [directory with hal]/hal/bin/halAlignmentDepth --outWiggle [alignmentDepthFileName] [cactusFileName] [speciesName]

    This can require up to 8 gigabytes for a hal file with 35 species. Running this on 35 species can take over a week, and the output files can be at least a few gigabytes. For a larger hal file, one can run halAlignmentDepth on each genomic region instead of on the entire genome.

  2. Convert the alignment depth file from a wig file to a bigwigh file:

    wigToBigWig [alignmentDepthFileName] [chromSizesFileName] [alignmentDepthBigwigFileName]

    This can require up to 64 gigabytes for an alignment depth file produced from a full genome. Note that the chromosome naming conventions used in the alignment depth file and the chrom sizes file need to be the same, which might require converting the chromosome names in the chrom sizes file.

  3. Convert the alignment depth bigwig file to a bedgraph file:

    bigWigToBedGraph [alignmentDepthBigwigFileName] [alignmentDepthBedgraphFileName]
  4. Sort the bedgraph file by chromosome, start, end:

    sort -k1,1 -k2,2n -k3,3n [alignmentDepthBedgraphFileName] > [sortedAlignmentDepthBedgraphFileName]

    The bedgraph files can be gzipped so that they take up less space.

  5. Get the file that will be used for starting the ortholog extension for each region using the scores in the bedgraph file:

    python [directory with halLiftover-postprocessing]/getMaxScorePositionFromBedgraph.py --bedFileName [file with regions you will be getting scores for, will be -qFile for next step] --bedgraphFileName [sortedAlignmentDepthBedgraphFileName] --highestScoreLocationFileName [where the positions with the highest scores will be recored, you can map this with hal-liftover to create -sFile for the next step] --gz

    This program requires the bed file and the bedgraph file to be sorted and not contain duplicated entires. Leave out --gz if the file with the regions and the alignment depth bedgraph file are not gzipped. Note that this program is compatible with both python version 2 and python version 3 while orthologFind.py is compatible with only python verison 3.

Alternatively, steps 2-5 can be replaced with the following script that combines them:

python [directory with halLiftover-postprocessing]/getMaxScorePositionFromWig.py --bedFileName [file with regions you will be getting scores for, will be -qFile for next step] --wigFileName [alignmentDepthFileName] --chromSizesFileName [chromSizesFileName] --highestScoreLocationFileName [where the positions with the highest scores will be recored, you can map this with hal-liftover to create -sFile for the next step] --gz

This program requires the bed file to be sorted and not contain duplicated rows. Leave out --gz if the bed file is not gzipped. This program is compatible with both python version 2 and python version 3. Note that this script runs UCSC tools internally that sometimes fail silently; therefore, check the sorted bedgraph file when it finishes and re-run it with more memory alloted if that file is not large.

  1. Use halLiftover to map the positions where the highest scores are recorded to the target species. This will create your -sFile for orthologFind.py.

Additional Utilities

Citing HALPER

Contributors