mskcc / lohhla

Fork from https://bitbucket.org/mcgranahanlab/lohhla/src, modified for MSKCC needs
29 stars 10 forks source link

README

Immune evasion is a hallmark of cancer. Losing the ability to present productive tumor neoantigens could facilitate evasion from immune predation. An integral part of neoantigen presentation is the HLA class I molecule, which presents epitopes to T-cells on the cell surface. Thus, loss of an HLA allele, resulting in HLA homozygosity, may be a mechanism of immune escape. However, the polymorphic nature of the HLA locus precludes accurate copy number calling using conventional copy number tools.

Here, we present LOHHLA, Loss Of Heterozygosity in Human Leukocyte Antigen, a computational tool to evaluate HLA loss using next-generation sequencing data.

LICENCE

THIS LOHHLA TOOL IS PROTECTED BY COPYRIGHT AND IS SUBJECT TO PATENT APPLICATION PCT/GB2018/052004. THE TOOL IS PROVIDED “AS IS” FOR INTERNAL NON-COMMERCIAL ACADEMIC RESEARCH PURPOSES ONLY. THIS LOHHLA TOOL IS PROVIDED TO YOU AS A PERSONAL COPY SUBJECT TO THESE TERMS AND YOU SHALL NOT FORWARD OR PROVIDE A COPY OF THE LOHHLA TOOL TO ANY OTHER PERSON OR PARTY. NO RESPONSIBILITY IS ACCEPTED FOR ANY LIABILITY ARISING FROM USE OF THE LOHHLA TOOL OR ANY RESULTS ARISING THEREON BY YOU OR ANY THIRD PARTY. COMMERCIAL USE OF THIS TOOL FOR ANY PURPOSE IS NOT PERMITTED.
ALL COMMERCIAL USE OF THE TOOL OR ANY MODIFICATION OR DERIVATIVE OF THE TOOL, INCLUDING TRANSFER, SALE OR LICENCE TO A COMMERCIAL THIRD PARTY OR USE ON BEHALF OF A COMMERCIAL THIRD PARTY (INCLUDING BUT NOT LIMITED TO USE AS PART OF A SERVICE SUPPLIED TO ANY THIRD PARTY FOR FINANCIAL REWARD) REQUIRES A LICENCE. FOR FURTHER INFORMATION PLEASE EMAIL Eileen Clark eileen.clark@crick.ac.uk.

What do I need to install to run LOHHLA?

Please ensure a number of dependencies are first installed. These include:

Within R, the following packages are required:

LOHHLA also requires an HLA fasta file. This can be obtained from Polysolver (http://archive.broadinstitute.org/cancer/cga/polysolver)

How do I install LOHHLA?

To install LOHHLA, simply clone the repository:

git clone https://bitbucket.org/mcgranahanlab/lohhla.git

How do I run LOHHLA?

LOHHLA is coded in R, and can be executed from the command line (Terminal, in Linux/UNIX/OSX, or Command Prompt in MS Windows) directly, or using a shell script (see example below).

Running LOHHLA with no arguments prints the usage information.

USAGE: Rscript /location/of/LOHHLA/script [OPTIONS]

OPTIONS:

-id CHARACTER, --patientId=CHARACTER
    patient ID

-o CHARACTER, --outputDir=CHARACTER
    location of output directory

-nBAM CHARACTER, --normalBAMfile=CHARACTER
    normal BAM file
    can be FALSE to run without normal sample

-BAM CHARACTER, --BAMDir=CHARACTER
    location of all BAMs to test

-hla CHARACTER, --hlaPath=CHARACTER
    location to patient HLA calls

-hlaLoc CHARACTER, --HLAfastaLoc=CHARACTER
    location of HLA FASTA [default=~/lohhla/data/hla_all.fasta]

-cn CHARACTER, --CopyNumLoc=CHARACTER
    location to patient purity and ploidy output
    can be FALSE to only estimate allelic imbalance

-ov CHARACTER, --overrideDir=CHARACTER
    location of flagstat information if already run [default= FALSE]

-mc CHARACTER, --minCoverageFilter=CHARACTER
    minimum coverage at mismatch site [default= 30]

-kmer CHARACTER, --kmerSize=CHARACTER
    size of kmers to fish with [default= 50]

-mm CHARACTER, --numMisMatch=CHARACTER
    number of mismatches allowed in read to map to HLA allele [default= 1]

-ms CHARACTER, --mappingStep=CHARACTER
    does mapping to HLA alleles need to be done [default= TRUE]

-fs CHARACTER, --fishingStep=CHARACTER
    if mapping is performed, also look for fished reads matching kmers of size kmerSize [default= TRUE]

-ps CHARACTER, --plottingStep=CHARACTER
    are plots made [default= TRUE]

-cs CHARACTER, --coverageStep=CHARACTER
    are coverage differences analyzed [default= TRUE]

-cu CHARACTER, --cleanUp=CHARACTER
    remove temporary files [default= TRUE]

-no CHARACTER, --novoDir=CHARACTER
    path to novoalign executable [default= ]

-ga CHARACTER, --gatkDir=CHARACTER
    path to GATK executable [default= ]

-ex CHARACTER, --HLAexonLoc=CHARACTER
    HLA exon boundaries for plotting [default=~/lohhla/data/hla.dat]

-w CHARACTER, --ignoreWarnings=CHARACTER
    continue running with warnings [default= TRUE]

-h, --help
    Show this help message and exit            

What is the output of LOHHLA?

LOHHLA produces multiple different files (see correct-example-out for an example). To determine HLA LOH in a given sample, the most relevant output is the file which ends '.HLAlossPrediction CI.xls'. The most relavant columns are:

HLA_A_type1                          - the identity of allele 1
HLA_A_type2                          - the identity of allele 2
Pval_unique                          - this is a p-value relating to allelic imbalance 
LossAllele                           - this corresponds to the HLA allele that is subject to loss
KeptAllele                           - this corresponds to the HLA allele that is not subject to loss
HLA_type1copyNum_withBAFBin          - the estimated raw copy number of HLA (allele 1)
HLA_type2copyNum_withBAFBin          - the estimated raw copy number of HLA (allele 2)

For a full definition of the columns, see below, in each case whether the column should be used [use], or can be ignored [legacy]is indicated:

region                               - the region or tumor sample [use]
HLA_A_type1                          - the identity of allele 1 [use]
HLA_A_type2                          - the identity of allele 2 [use]
HLAtype1Log2MedianCoverage           - the median LogR coverage across allele 1 [use] 
HLAtype2Log2MedianCoverage           - the median LogR coverage across allele 2 [use]
HLAtype1Log2MedianCoverageAtSites    - the median LogR coverage across allele 1, restricted to mismatch sites [use]
HLAtype2Log2MedianCoverageAtSites    - the median LogR coverage across allele 2, restricted to mismatch sites [use]
HLA_type1copyNum_withoutBAF          - estimated copy number of allele 1, without using BAF [legacy] 
HLA_type1copyNum_withoutBAF_lower    - lower 95% confidence interval of estimated copy number of allele 1, without using BAF [legacy] 
HLA_type1copyNum_withoutBAF_upper    - upper 95% confidence interval of estimated copy number of allele 1, without using BAF [legacy] 
HLA_type1copyNum_withBAF             - estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type1copyNum_withBAF_lower       - lower 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type1copyNum_withBAF_upper       - upper 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type2copyNum_withoutBAF          - estimated copy number of allele 2 without using BAF  [legacy] 
HLA_type2copyNum_withoutBAF_lower    - lower 95% confidence interval of estimated copy number of allele 2, without using BAF [legacy] 
HLA_type2copyNum_withoutBAF_upper    - upper 95% confidence interval of estimated copy number of allele 2, without using BAF [legacy] 
HLA_type2copyNum_withBAF             - estimated copy number of allele 2 using BAF, without binning sites [legacy] 
HLA_type2copyNum_withBAF_lower       - lower 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type2copyNum_withBAF_upper       - upper 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type1copyNum_withoutBAFBin       - estimated copy number of allele 1 using binning, but without BAF [legacy]  
HLA_type1copyNum_withoutBAFBin_lower - lower 95% confidence interval of estimated copy number of allele 1 using binning, but without BAF [legacy]   
HLA_type1copyNum_withoutBAFBin_upper - upper 95% confidence interval of estimated copy number of allele 1 using binning, but without BAF [legacy]   
HLA_type1copyNum_withBAFBin          - estimated copy number of allele 1 using binning and BAF [use] 
HLA_type1copyNum_withBAFBin_lower    - lower 95% confidence interval of estimated copy number of allele 1 using binning and BAF [use] 
HLA_type1copyNum_withBAFBin_upper    - upper 95% confidence interval of estimated copy number of allele 1 using binning and BAF [use]  
HLA_type2copyNum_withoutBAFBin       - estimated copy number of allele 2 using binning, but without BAF [legacy]  
HLA_type2copyNum_withoutBAFBin_lower - lower 95% confidence interval of estimated copy number of allele 2 using binning, but without BAF [legacy]   
HLA_type2copyNum_withoutBAFBin_upper - upper 95% confidence interval of estimated copy number of allele 2 using BAF, without binning sites [legacy]     
HLA_type2copyNum_withBAFBin          - estimated copy number of allele 2 using binning and BAF [use] 
HLA_type2copyNum_withBAFBin_lower    - lower 95% confidence interval of estimated copy number of allele 2 using binning and BAF [use] 
HLA_type2copyNum_withBAFBin_upper    - upper 95% confidence interval of estimated copy number of allele 2 using binning and BAF [use
PVal                                 - p-value relating to difference in logR between allele 1 and allele 2 (paired t-test)[legacy]
UnPairedPval                         - p-value relating to difference in logR between allele 1 and allele 2 (unpaired t-test)[legacy]
PVal_unique                          - p-value relating to difference in logR between allele 1 and allele 2, ensuring each read only contributes once (paired t-test) [use]
UnPairedPval_unique                  - p-value relating to difference in logR between allele 1 and allele 2, ensuring each read only contributes once (unpaired t-test) [use]
LossAllele                           - HLA allele that is present at lower frequency (potentially subject to loss) [use]
KeptAllele                           - HLA allele that is present at higher frequency (potentially not subject to loss) [use]
numMisMatchSitesCov                  - number of mismatch sites with sufficient coverage [use]
propSupportiveSites                  - proportion of missmatch sites that are consistent with loss or allelic imbalance [use]

How can I test if LOHHLA is working?

Example data is included in the LOHHLA repository. To run LOHHLA on the example dataset, alter the "example.sh" script to match your local file structure and ensure the requisite dependencies are available / loaded. The --HLAfastaLoc, --gatkDir, and --novoDir file paths should also be updated to the corresponding locations. File paths must be full paths. Run "example.sh" and the output should match that found in the "correct-example-out" directory provided. All BAM files (normal and tumour) should be found in or linked to the same directory.

Who do I talk to?

If you have any issues with lohhla, please send an email to lohhla@gmail.com

How do I cite LOHHLA ?

If you use LOHHLA in your research, please cite the following paper:

McGranahan et al., Allele-Specific HLA Loss and Immune Escape in Lung Cancer Evolution, Cell (2017), https://doi.org/10.1016/j.cell.2017.10.001