hammerlab / prohlatype

Probabilistic HLA typing
Apache License 2.0
35 stars 4 forks source link

Build Status Coverage Status

Probabilistic HLA Typing

Paper: Prohlatype: A Probabilistic Framework for HLA Typing 1

This project provides a set of tools to calculate the full posterior distribution of HLA types given read data.

Instead of:

    A1      A2      B1      B2      C1      C2      Reads   Objective
0   A*31:01 A*02:01 B*45:01 B*15:03 C*16:01 C*02:10 538.0   513.79

one can calculate:

Allele 1 Allele 2 Log P P
A*02:05:01:01 A*30:114 -23046.81 0.5000
A*02:05:01:01 A*30:01:01 -23046.81 0.5000
A*02:05:01:01 A*30:106 -23103.15 0.0000
A*02:05:01:02 A*30:114 -23146.35 0.0000
...
B*07:36 B*57:03:01:02 -13717.33 0.5000
B*07:36 B*57:03:01:01 -13717.33 0.5000
B*07:36 B*57:03:03 -13804.74 0.0000
B*27:157 B*57:03:01:02 -13816.17 0.0000
...
C*06:103 C*18:10 -11936.35 0.3338
C*06:103 C*18:02 -11936.36 0.3331
C*06:103 C*18:01 -11936.36 0.3331
C*15:102 C*18:02 -11951.72 0.0000

How:

There are three options to obtain the software:

  1. If you are running on Linux, standalone binaries are available with each release.
  2. Use the linked Docker image.
  3. Build the software from source:

    a. Install opam.

    b. Make sure that the opam packages are up to date:

      $ opam update

    c. Make sure that you're on the relevant compiler:

      $ opam switch 4.06.0
      $ eval `opam config env`

    d. Get source:

      $ git clone https://github.com/hammerlab/prohlatype.git prohlatype
      $ cd prohlatype

    e. Install the dependent packages:

      $ make setup

    f. Build the programs (afterwards they'll be in _build/default/src/apps):

      $ make

Make sure that you have IMGT/HLA available:

$ git clone https://github.com/ANHIG/IMGTHLA.git imgthla

"Prohla"-typing:

  1. Create an imputed HLA reference sequence via align2fasta. This step makes sure that all alleles have sequence information that spans the entire locus. This way, reads that originate from a region for which we normally do not have sequence information will still align (in the next filtering step), albeit poorly:

      $ align2fasta path-to-imgthla/alignments -o imputed_hla_class_I

    This step needs to be performed only once, per each IMGT version. Run $align2fasta --help for further information.

  2. Filter your data against the reference, by first aligning. Ex:

      $ bwa mem imputed_hla_class_I.fasta ${SAMPLE}.fastq | \
          samtools view -F 4 -bT imputed_hla_class_I.fasta -o ${SAMPLE}.bam

    While fundamentally, the algorithms here are alignment based. They're too slow to run for all sequences. Sequences that do not originate from the HLA-region would just act as background noice.

  3. and then convert aligned reads back to FASTQ:

      $ samtools fastq ${SAMPLE}.bam > ${SAMPLE}_filtered.fastq
  4. Infer types (see $ multi_par --help for further details):

      $ multi_par path-to-imgthla/aignments ${SAMPLE}_filtered.fastq -o ${SAMPLE}_output.tsv

Note: The script src/scripts/run-example-docker.sh provides an end-to-end example of the above. It depends only on docker, wget, and git; it fetches the data and runs everything in a docker container (see sh src/scripts/run-example-docker.sh help).

1: All versions of this software after 0.8.0 incorporate an important coverage likelihood that is not described in the previous paper. At the moment a short addendum describing the approach is in limbo, please contact me by email for a reference.