hammerlab/prohlatype - Githubissues

Probabilistic HLA Typing

Paper: Prohlatype: A Probabilistic Framework for HLA Typing ¹

This project provides a set of tools to calculate the full posterior distribution of HLA types given read data.

Instead of:

    A1      A2      B1      B2      C1      C2      Reads   Objective
0   A*31:01 A*02:01 B*45:01 B*15:03 C*16:01 C*02:10 538.0   513.79

one can calculate:

Allele 1	Allele 2	Log P	P
A*02:05:01:01	A*30:114	-23046.81	0.5000
A*02:05:01:01	A*30:01:01	-23046.81	0.5000
A*02:05:01:01	A*30:106	-23103.15	0.0000
A*02:05:01:02	A*30:114	-23146.35	0.0000
...
B*07:36	B*57:03:01:02	-13717.33	0.5000
B*07:36	B*57:03:01:01	-13717.33	0.5000
B*07:36	B*57:03:03	-13804.74	0.0000
B*27:157	B*57:03:01:02	-13816.17	0.0000
...
C*06:103	C*18:10	-11936.35	0.3338
C*06:103	C*18:02	-11936.36	0.3331
C*06:103	C*18:01	-11936.36	0.3331
C*15:102	C*18:02	-11951.72	0.0000

How:

There are three options to obtain the software:

If you are running on Linux, standalone binaries are available with each release.
Use the linked Docker image.
Build the software from source:

a. Install opam.

b. Make sure that the opam packages are up to date:
```
  $ opam update
```
c. Make sure that you're on the relevant compiler:
```
  $ opam switch 4.06.0
  $ eval `opam config env`
```
d. Get source:
```
  $ git clone https://github.com/hammerlab/prohlatype.git prohlatype
  $ cd prohlatype
```
e. Install the dependent packages:
```
  $ make setup
```
f. Build the programs (afterwards they'll be in _build/default/src/apps):
```
  $ make
```

Make sure that you have IMGT/HLA available:

$ git clone https://github.com/ANHIG/IMGTHLA.git imgthla

"Prohla"-typing:

Create an imputed HLA reference sequence via align2fasta. This step makes sure that all alleles have sequence information that spans the entire locus. This way, reads that originate from a region for which we normally do not have sequence information will still align (in the next filtering step), albeit poorly:
```
  $ align2fasta path-to-imgthla/alignments -o imputed_hla_class_I
```
This step needs to be performed only once, per each IMGT version. Run $align2fasta --help for further information.
Filter your data against the reference, by first aligning. Ex:
```
  $ bwa mem imputed_hla_class_I.fasta ${SAMPLE}.fastq | \
      samtools view -F 4 -bT imputed_hla_class_I.fasta -o ${SAMPLE}.bam
```
While fundamentally, the algorithms here are alignment based. They're too slow to run for all sequences. Sequences that do not originate from the HLA-region would just act as background noice.

and then convert aligned reads back to FASTQ:

  $ samtools fastq ${SAMPLE}.bam > ${SAMPLE}_filtered.fastq

Infer types (see $ multi_par --help for further details):

  $ multi_par path-to-imgthla/aignments ${SAMPLE}_filtered.fastq -o ${SAMPLE}_output.tsv

Note: The script src/scripts/run-example-docker.sh provides an end-to-end example of the above. It depends only on docker, wget, and git; it fetches the data and runs everything in a docker container (see sh src/scripts/run-example-docker.sh help).

1: All versions of this software after 0.8.0 incorporate an important coverage likelihood that is not described in the previous paper. At the moment a short addendum describing the approach is in limbo, please contact me by email for a reference.