Pastrami is a novel, scalable computational algorithm for rapid human ancestry estimation at population-, subcontinental- and continental-levels. Pastrami works on two key methodologies: exact haplotype matching and non-negative least square (NNLS) optimization.
Codebase stage: development
Developers and maintainers: Andrew Conley, Lavanya Rishishwar, Shivam Sharma
Testers: Lavanya Rishishwar, Shivam Sharma, Emily Norris
We are working on improving the user experience with Pastrami. This section will be updated in near future with easier installation procedure for pastrami and it's dependencies.
Pastrami requires the following OS/programs/modules to run:
Plink2 installation is quite straightforward. Select the appropriate binary for your processor architecture from this page:https://www.cog-genomics.org/plink/2.0/. An example installation is shown below for Intel 64-bit architecture:
# Download the zip file
# Get the latest version from plink: https://www.cog-genomics.org/plink/2.0/
wget [http://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20210203.zip](https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20240818.zip)
# Unzip the zip file
unzip plink2_linux_x86_64_20240818.zip
# Optionally, create a local bin and add it to your PATH (ignore these steps if you know what you are doing)
mkdir -p ~/bin
echo "PATH=$HOME/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
# Move the file to your local bin
mv plink2 ~/bin
# Make sure there is an executable by the name plink
ln -s ~/bin/plink2 ~/bin/plink
Assuming Plink/Plink2 is available system-wide, the following commands will install the python modules and clone the git repo
# Install python modules
pip install pathos numpy scipy pandas
# Download Pastrami
git clone https://github.com/healthdisparities/pastrami
# Give it executable permissions
cd pastrami
chmod +x pastrami.py
# Run Pastrami
./pastrami.py
Assuming Plink/Plink2 is available system-wide, the following commands will install the python modules and clone the git repo
# Create a conda environment with Python version = 3.8
conda create -n pastrami python=3.8
# Activate the newly created conda environment
conda activate pastrami
# Install core packages
conda install -c conda-forge pathos numpy pandas r-base
# Install R packages
Rscript -e 'install.packages("optparse", repos="https://cloud.r-project.org")'
# Download Pastrami
git clone https://github.com/healthdisparities/pastrami
# Give it executable permissions
cd pastrami
chmod +x pastrami.py
# Run Pastrami
./pastrami.py
# Deactivate environment, when not using pastrami
conda deactivate pastrami
This section will be populated in near future with small example datasets and commands for analyzing them.
Pastrami is a five step process that can be run as a single command or using a set of sequential commands. Each of the five step within Pastrami can be accessed through the following subcommands:
usage: pastrami.py [--help] {all,hapmake,build,query,coanc,aggregate} ...
pastrami.py - population scale haplotype copying
optional arguments:
--help, -h, --h
pastrami.py commands:
{all,hapmake,build,query,coanc,aggregate}
all Perform full analysis
hapmake Create haplotypes
build Build reference set
query Query reference set
coanc Individual v. individual co-ancestry
aggregate Aggregate results and estimate ancestries
These steps are all grouped within the all subcommand.
The genotype and individual information is expected in TPED/TFAM format. These formats are described at length within Plink's documentation and will not be reiterated here for brevity's sake.
If your files are in VCF format, you can convert them into TPED/TFAM format using the following plink command:
plink --vcf input.vcf.gz --recode transpose --out output_prefix
The all subcommand is the one that you will use if you have a set of reference individuals and a set of query individuals (presumably with some admixture) and you are interested in performing a one-off ancestry estimation analysis:
# Consolidated workflow:
./pastrami.py all --reference-prefix reference --query-prefix query --pop-group pop2group.txt --haplotype chrom.hap --out-prefix full_run -v --threads 20
The full description of each commandline option is provided below and can be access from within the script by running the "all" subcommand without any arguments:
usage: pastrami.py all [-h] [--reference-prefix <PREFIX>] [--query-prefix <TSV>] [--haplotypes <TSV>]
[--pop-group pop2group.txt] [--out-prefix out-prefix] [--map-dir maps/] [--min-snps MinSNPs]
[--max-snps MaxSNPs] [--max-rate MaxRate] [--per-individual] [--log-file run.log] [--threads N]
[--verbose]
optional arguments:
-h, --help show this help message and exit
Required Input options:
--reference-prefix <PREFIX> Prefix for the reference TPED/TFAM input files
--query-prefix <TSV> Prefix for the query TPED/TFAM input files
--haplotypes <TSV> File of haplotype positions
--pop-group pop2group.txt File containing population to group (e.g., tribes to region) mapping
Output options:
--out-prefix out-prefix Output prefix (default: pastrami) for creating following sets of files.
<prefix>.pickle, <prefix>_query.tsv, <prefix>.tsv, <prefix>.fam,
<prefix>_fractions.Q, <prefix>_paintings.Q, <prefix>_estimates.Q,
<prefix>_fine_grain_estimates.Q
Optional arguments for hapmake stage:
--map-dir maps/ Directory containing genetic maps: chr1.map, chr2.map, etc
--min-snps MinSNPs Minimum number of SNPs in a haplotype block (default: 7)
--max-snps MaxSNPs Maximum number of SNPs in a haplotype block (default: 20)
--max-rate MaxRate Maximum recombination rate (default: 0.3)
Runtime options:
--per-individual Generate per-individual copying rather than per-population copying
--log-file run.log, -l run.log, --l run.log File containing log information (default: run.log)
--threads N, -t N Number of concurrent threads (default: 4)
--verbose, -v Print program progress information on screen
Consider an example scenario where you want to assess the European ancestry estimates of a group of query individuals. If your European reference files are named euro_ref.tped, euro_ref.tfam and your query individual files are named my_query.tped, my_query.tfam, and your population to group mapping is stored within the pop2group.txt file (described more below), your all subcommand will look as follows:
./pastrami.py all --reference-prefix euro_ref --query-prefix my_query --pop-group pop2group.txt --haplotype chrom.hap --out-prefix full_run -v --threads 20
I prefer to run pastrami the the verbose flag and my hardware can support 20 threads. Feel free to turn off the verbose flag if you so desire, and change the number of threads in accordance with your hardware capability. This command will create the following outputs (listed in order of importance):
Pastrami is designed in a way that the reference files can be processed once and utilized many times in future. For the following use-case, consider that your reference individuals are African in origin and are stored in afro_ref.tped and afro_ref.tfam files. Your query individuals are still stored in my_query.tped and my_query.tfam files. Your population-to-group mapping is stored in a file named pop2group.txt. The following series of commands will allow you the pipeline in steps and save the intermediate outputs.
# step 1: build a haplotype file
./pastrami.py hapmake --map-dir map_data --haplotypes african.hap -v
# This step shouldn't take more than a few seconds, the remaining step will be time consuming in nature
# step 2: build database (pickle file)
./pastrami.py build --reference-pickle african.pickle --reference-prefix afro_ref --haplotypes african.hap -v --threads 20
# african.pickle is the output file
# step 3: querying the database
./pastrami.py query --reference-pickle african.pickle --query-prefix my_query --combined-out african.tsv --query-out african_query.tsv -v --threads 20
# step 4: aggregate the results
## combine the individual list for aggregate subcommand
cat afro_ref.tfam my_query.tfam > input.tfam
## run the subcommand
./pastrami.py aggregate --pop-group pop2group.txt --pastrami-output african.tsv --pastrami-fam input.tfam --out-prefix aggregate_output -v --threads 20
Based on general expectation and our experience in this area, deconvoluting population-level ancestry is very hard (and at times not possible) when populations are genetically extremely close. In these cases, we strongly recommend grouping closely related populations into groups. E.g., grouping African tribes into subcontinental African groups. This information can be supplied to Pastrami in a simple tab-separated file (pop2group.txt file in the previous examples). The structure of the file looks like this:
#Population Group
GWD Western
MSL Western
Yacouba Western
Ahizi Western
Fon Nigerian
Bariba Nigerian
Yoruba Nigerian
The whitespace separating the column is tab (shown as spaces here for formatting reasons). Any lines starting with # are ignored.