python orthologFind.py
using python3 (see "Example Run of HALPER" below for an example)HALPER is designed for constructing contiguous orthologs from the outputs of halLiftover (https://github.com/ComparativeGenomicsToolkit/hal). While it was originally designed for contructing orthologs of transcription factor ChIP-seq and open chromatin peaks, it can be applied to any genomic regions of interest. Since HALPER relies on halLiftover, the assembly of the query and target genomic regions must be in a Cactus alginment hal file.
matplotlib
and numpy
-qFile: BED file with query regions (used as input to halLiftover) containing (at least) the following information: chromosome_name, start, end, region name
-tFile: BED file of the file specified in -qFile mapped to the target species using halLiftover (no modifications to the output from halLiftover are necessary)
chr_name peak_start peak_end peak_name
(the output file from halLiftover should conform to this format)chr8 55610267 55610335 peak0
chr8 55610240 55610267 peak0
chr8 55610220 55610240 peak0
chr8 55610191 55610220 peak0
chr8 55610183 55610190 peak0
-sFile: BED file of the peak summits file specified in -qFile mapped to the target species using halLiftover
chr_name peak_start peak_end peak_name
(the output file from halLiftover should conform to this format)chr8 55609835 55609836 peak0
chr8 55609437 55609438 peak1
chr8 55591653 55591654 peak2
chr8 55592205 55592206 peak4
chr8 55536703 55536704 peak6
chr8 55499203 55499204 peak8
chr8 55473539 55473540 peak9
-narrowPeak: output files in narrowPeak format (optional argument)
-oFile: output file name (the chromosome name and all positions in the file specified in -oFile are from the target species)
chr_name
target_ortholog_start
target_ortholog_end
summit_position
peakname
target_ortholog_length
query_ortholog_length
target_summit_to_target_ortholog_start_length
target_summit_to_target_ortholog_end_length
-mult_keepone: if a region's summit maps to multiple positions in the target species, use the first position in file specified in -sFile; otherwise, such a region is discarded
-max_len: ortholog length must be less or equal to max_len
-max_frac: ortholog length must be less or equal to max_frac * region length
-min_len: ortholog length must be greater or equal to min_len
-min_frac: ortholog length must be greater or equal to min_frac * region length
-protect_dist: the ortholog length in each direction from the ortholog of the summit must be at least proct_dist
-keepChrPrefix: If passed, then only keep mapped peaks where the new chromosome name starts with this prefix. E.g. -keepChrPrefix chr
will keep only mapped peaks where the chromosome name starts with chr
.
-preserve: This is for narrowPeak file inputs. This preserves the requested column from the narrowPeak file in the output. Options are "signal" (column 7), "pValue" (column 8), and "qValue" (column 9).
Running these examples requires the files in the examples directory and 10plusway-master.hal, a Cactus alignment with 12 mammals that can be obtained from the authors of the paper describing Cactus (see "Relevant Publications" below). One can compare the outputs of each step to the files with the corresponding names in the examples directory.
The script halper_map_peak_orthologs.sh
runs halLiftover
and postprocesses the results with HALPER all in one script. This is equivalent to running steps 1-4 in "Running steps manually" below. This script requires installing the dependencies in their own conda environment called "hal" and modifiying paths as described in https://github.com/pfenninglab/halLiftover-postprocessing/blob/master/hal_install_instructions.md.
To use halper_map_peak_orthologs.sh
on a slurm cluster:
sbatch \
-p [partition] \
--array 1-[number of target species] \
halper_map_peak_orthologs.sh \
-b [path to input .bed or .narrowPeak file] \
-o [path to output directory] \
-s [source species, e.g. Homo_sapiens] \
-t [comma-separated list of target species, e.g. Mus_musculus,Macaca_mulatta] \
-c [path to cactus alignment file]
Using the --array
flag above will instruct the slurm scheduler to map orthologs for each target species in parallel.
If you omit the --array
flag, the target species will be processed sequentially.
To generate the error and output files, this needs to be run from a directory that contains a sub-directory called "logs."
If you are not running on a slurm cluster, you can submit the script with bash
:
bash halper_map_peak_orthologs.sh \
-b [path to input .bed or .narrowPeak file] \
-o [path to output directory] \
-s [source species, e.g. Homo_sapiens] \
-t [comma-separated list of target species, e.g. Mus_musculus,Macaca_mulatta] \
-c [path to cactus alignment file]
To run only HALPER (not halLiftover), go directly to #4.
[directory with hal]/hal/bin/halLiftover --bedType 4 [directory with Cactus alignment]/10plusway-master.hal Human [directory with halLiftover-postprocessing]/halLiftover-postprocessing/examples/hg38Peaks.bed Mouse hg38Peaks_halLiftovermm10.bed
awk 'BEGIN{OFS="\t"}{print $1, $2+$10, $2+$10+1, $4}' [directory with halLiftover-postprocessing]/halLiftover-postprocessing/examples/hg38Peaks.bed > hg38Peaks_summits.bed
[directory with hal]/hal/bin/halLiftover [directory with Cactus alignment]/10plusway-master.hal Human hg38Peaks_summits.bed Mouse hg38Peaks_summits_halLiftovermm10.bed
python [directory with halLiftover-postprocessing]/orthologFind.py -max_len 1000 -min_len 50 -protect_dist 5 -qFile [directory with halLiftover-postprocessing]/halLiftover-postprocessing/examples/hg38Peaks.bed -tFile hg38Peaks_halLiftovermm10.bed -sFile hg38Peaks_summits_halLiftovermm10.bed -oFile hg38Peaks_halLiftovermm10_summitExtendedMin50Max1000Protect5.bed -mult_keepone
chr8 55609305 55610335 55609835 peak0 1031 1019 530 500
chr8 55609305 55610335 55609437 peak1 1031 1019 132 898
chr8 55609305 55610335 peak0 -1 . -1 -1 -1 530
chr8 55609305 55610335 peak1 -1 . -1 -1 -1 132
Starting to construct target species orthologs with the target species orthologs of peak summits is sub-optimal for histone modification ChIP-seq data because, in this data, TFs are thought to bind, not where there are large numbers of reads, but in the valleys between the parts of regions with large numbers of reads. A reasonable place to start with histone modification data, therefore, is the location within the region that has the largest number of species in the alignment, as this is likely to be an important part of the region. If there are multiple such locations, which often happens, then choosing the one that is closest to the center makes sense because the centers of the histone modification regions tend to be more important than their edges. This same approach can be used for other genomic regions that do not have summits.
Here are the dependencies required for making an -sFile using this process:
Here is how to make an -sFile using this process:
Get the alignment depth for the query species:
[directory with hal]/hal/bin/halAlignmentDepth --outWiggle [alignmentDepthFileName] [cactusFileName] [speciesName]
This can require up to 8 gigabytes for a hal file with 35 species. Running this on 35 species can take over a week, and the output files can be at least a few gigabytes. For a larger hal file, one can run halAlignmentDepth on each genomic region instead of on the entire genome.
Convert the alignment depth file from a wig file to a bigwigh file:
wigToBigWig [alignmentDepthFileName] [chromSizesFileName] [alignmentDepthBigwigFileName]
This can require up to 64 gigabytes for an alignment depth file produced from a full genome. Note that the chromosome naming conventions used in the alignment depth file and the chrom sizes file need to be the same, which might require converting the chromosome names in the chrom sizes file.
Convert the alignment depth bigwig file to a bedgraph file:
bigWigToBedGraph [alignmentDepthBigwigFileName] [alignmentDepthBedgraphFileName]
Sort the bedgraph file by chromosome, start, end:
sort -k1,1 -k2,2n -k3,3n [alignmentDepthBedgraphFileName] > [sortedAlignmentDepthBedgraphFileName]
The bedgraph files can be gzipped so that they take up less space.
Get the file that will be used for starting the ortholog extension for each region using the scores in the bedgraph file:
python [directory with halLiftover-postprocessing]/getMaxScorePositionFromBedgraph.py --bedFileName [file with regions you will be getting scores for, will be -qFile for next step] --bedgraphFileName [sortedAlignmentDepthBedgraphFileName] --highestScoreLocationFileName [where the positions with the highest scores will be recored, you can map this with hal-liftover to create -sFile for the next step] --gz
This program requires the bed file and the bedgraph file to be sorted and not contain duplicated entires. Leave out --gz if the file with the regions and the alignment depth bedgraph file are not gzipped. Note that this program is compatible with both python version 2 and python version 3 while orthologFind.py is compatible with only python verison 3.
Alternatively, steps 2-5 can be replaced with the following script that combines them:
python [directory with halLiftover-postprocessing]/getMaxScorePositionFromWig.py --bedFileName [file with regions you will be getting scores for, will be -qFile for next step] --wigFileName [alignmentDepthFileName] --chromSizesFileName [chromSizesFileName] --highestScoreLocationFileName [where the positions with the highest scores will be recored, you can map this with hal-liftover to create -sFile for the next step] --gz
This program requires the bed file to be sorted and not contain duplicated rows. Leave out --gz if the bed file is not gzipped. This program is compatible with both python version 2 and python version 3. Note that this script runs UCSC tools internally that sometimes fail silently; therefore, check the sorted bedgraph file when it finishes and re-run it with more memory alloted if that file is not large.