griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

How to speed up the pvacseq computation? #104

Closed ShixiangWang closed 6 years ago

ShixiangWang commented 6 years ago

Hello,

I want to ask for help about how to speed up the pvacseq computation.

I use NetMHC method, 9 epitope length and all valid HLAs for NetMHC method, it costs about 1 hour to finish a sample. I have hundreds of samples to process, thus I have to speed up my computation.

I read http://pvactools.readthedocs.io/en/latest/pvacseq/frequently_asked_questions.html about how to speed up and I have following questions:

  1. How to realize that split the VCF into smaller subsets and process in parallel? I see pvacseq split variants into multiple files after transforming vcf to tsv file as default but run one by one.
  2. How to set value of --fasta-size? There seems no default value and I use local installation of IEDB, thus I should set bigger fasta-size? I use 9 epitope length, how big the size should I set will be appropriate?
  3. How to set value of --downstream-sequence-length? There seems also no default value, how big should I set will be appropriate for 9 epitope length?
  4. Other suggestion I have taken into practice, are there other suggestion can effectively reduce the run time?

Best wishes, Shixiang


YAML file:

$ cat ./log/inputs.yml 
additional_report_columns: sample_name
alleles:
- HLA-A*01:01
- HLA-A*02:01
- HLA-A*02:02
- HLA-A*02:03
- HLA-A*02:06
- HLA-A*02:11
- HLA-A*02:12
- HLA-A*02:16
- HLA-A*02:17
- HLA-A*02:19
- HLA-A*02:50
- HLA-A*03:01
- HLA-A*11:01
- HLA-A*23:01
- HLA-A*24:02
- HLA-A*24:03
- HLA-A*25:01
- HLA-A*26:01
- HLA-A*26:02
- HLA-A*26:03
- HLA-A*29:02
- HLA-A*30:01
- HLA-A*30:02
- HLA-A*31:01
- HLA-A*32:01
- HLA-A*32:07
- HLA-A*32:15
- HLA-A*33:01
- HLA-A*66:01
- HLA-A*68:01
- HLA-A*68:02
- HLA-A*68:23
- HLA-A*69:01
- HLA-A*80:01
- HLA-B*07:02
- HLA-B*08:01
- HLA-B*08:02
- HLA-B*08:03
- HLA-B*14:02
- HLA-B*15:01
- HLA-B*15:02
- HLA-B*15:03
- HLA-B*15:09
- HLA-B*15:17
- HLA-B*18:01
- HLA-B*27:05
- HLA-B*27:20
- HLA-B*35:01
- HLA-B*35:03
- HLA-B*38:01
- HLA-B*39:01
- HLA-B*40:01
- HLA-B*40:02
- HLA-B*40:13
- HLA-B*42:01
- HLA-B*44:02
- HLA-B*44:03
- HLA-B*45:01
- HLA-B*46:01
- HLA-B*48:01
- HLA-B*51:01
- HLA-B*53:01
- HLA-B*54:01
- HLA-B*57:01
- HLA-B*58:01
- HLA-B*73:01
- HLA-B*83:01
- HLA-C*03:03
- HLA-C*04:01
- HLA-C*05:01
- HLA-C*06:02
- HLA-C*07:01
- HLA-C*07:02
- HLA-C*08:02
- HLA-C*12:03
- HLA-C*14:02
- HLA-C*15:02
- HLA-E*01:01
binding_threshold: 500
downstream_sequence_length: 1000
epitope_lengths:
- 9
exclude_NAs: false
expn_val: 1
fasta_size: 200
gene_expn_file: null
iedb_executable: /home/diviner-wsx/ProjectsManager/biotools/mhc/mhc_i/src/predict_binding.py
iedb_retries: 5
input_file: /home/diviner-wsx/ProjectsManager/cacheData/luad_mutect_neoantigens//annot_dir/TCGA-05-4244-01_annotated_filterd.vcf
input_file_type: vcf
keep_tmp_files: false
minimum_fold_change: 0
net_chop_method: cterm
net_chop_threshold: 0.5
netmhc_stab: true
normal_cov: 5
normal_indels_coverage_file: null
normal_snvs_coverage_file: null
normal_vaf: 2
output_dir: /home/diviner-wsx/ProjectsManager/cacheData/luad_mutect_neoantigens/pvacseq_dir/MHC_Class_I
peptide_sequence_length: 21
prediction_algorithms:
- NetMHC
pvactools_version: 1.0.2
sample_name: TCGA-05-4244-01
tdna_cov: 10
tdna_indels_coverage_file: null
tdna_snvs_coverage_file: null
tdna_vaf: 40
tmp_dir: /home/diviner-wsx/ProjectsManager/cacheData/luad_mutect_neoantigens/pvacseq_dir/MHC_Class_I/tmp
top_result_per_mutation: true
top_score_metric: median
transcript_expn_file: null
trna_cov: 10
trna_indels_coverage_file: null
trna_snvs_coverage_file: null
trna_vaf: 40
susannasiebert commented 6 years ago

Hi Shixiang,

To speed up your processing I would suggest to limit the list of alleles to only those that match the HLA type of each sample. The determine the HLA types of your samples you can run an HLA typing software like OptiType or HLAminer.

You can also start multiple pVACseq runs in parallel for your individual samples.

To respond to your individual questions: 1) We don't currently parallelize the iedb predictions for the fasta subsets. This is on our to-do list but is further down on our priorities list. 2) The default --fasta-size is 200. If you are using a local IEDB install, the files don't need to necessarily be subset so you can increase this value to a higher number, if desired. I'm not sure this will speed things up very much though. 3) The --downstream-sequence-length is set to 1000 by default. This value is really up to you. For a frameshift the whole downstream tail is novel so ideally you would want to make predictions for all of the epitopes in the downstream sequence. However, the longer the sequence, the longer it takes IEDB to make predictions for it. We've found that 1000 is a good number that still returns results in a reasonable amount of time while containing a large number of novel epitopes. You can reduce this number if you are ok with potentially missing some novel epitopes. 4) The biggest time save will be to reduce the number of alleles. Assuming 6 class I alleles for a person's HLA type you would now be making 6 calls to IEDB instead of 78 so your processing time would decrease by 90% (assuming everything else stays the same).

ShixiangWang commented 6 years ago

I can not agree with you more. Thanks @susannasiebert .