Closed Reda94 closed 3 years ago
Would you be able to share the output directory for this run with us? This contains the temporary file that is causing this error so it would help us to figure out what the contents of the file are that it's trying to parse.
@susannasiebert Thanks for your reply. Please find attached both the vcf and the output directory of a sample that gave the same error (I am not sure this corresponds exactly to the exact same run that gave the above error but I am sure the files are from a run that gave the same error). mysample_pvacseq.zip
It looks like for some of the tmp output files for MHCflurry, the percentile column doesn't contain any values. I haven't encountered this before. I'm not sure what might be causing it. Could you execute the following command and post what it returns for you: mhcflurry-predict --alleles HLA-C*01:02 --peptides GFGPRDAD
@susannasiebert I executed the command as you suggested from within the pvactools singularity container and this was the output:
Forcing tensorflow backend.
2021-08-18 20:07:13.764487: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2021-08-18 20:07:13.800209: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2600000000 Hz
2021-08-18 20:07:13.801002: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fec50000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-08-18 20:07:13.801035: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-08-18 20:07:13.807433: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-18 20:07:13.807466: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-18 20:07:13.807492: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (int000): /proc/driver/nvidia/version does not exist
WARNING:root:No flanking information provided. Specify --no-flanking to silence this warning
Predicting processing.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.98s/it]
Predicting affinities.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.51s/it]
/usr/local/lib/python3.7/site-packages/mhcflurry/class1_affinity_predictor.py:1021: UserWarning: Allele HLA-C*01:02 has no percentile rank information
warnings.warn(msg)
allele,peptide,mhcflurry_affinity,mhcflurry_affinity_percentile,mhcflurry_processing_score,mhcflurry_presentation_score,mhcflurry_presentation_percentile
HLA-C*01:02,GFGPRDAD,27978.666327944356,,0.0005653205935232108,0.003611682294810309,99.28660326086957
Looks like this is the problem: /usr/local/lib/python3.7/site-packages/mhcflurry/class1_affinity_predictor.py:1021: UserWarning: Allele HLA-C*01:02 has no percentile rank information
I confirmed that this problem also occurs in the docker container. I think something might've gone wrong when I originally created it. I re-created the image from scratch, confirmed that the above command now works, and updated both 2.0.3 and latest to use the new image (new sha256:b2e70954e73cfab5a8e428c87e61861436d785548cfc193b820322afd74080f0). Please recreate your singularity image and test the above command again. If the above returns a percentile rank then I believe you can rerun your pVACseq runs. You will need to run them from scratch again.
Hi @susannasiebert, I re-pulled the new image as you suggested and I didn't get the error (the sample completed successfully). However, I a, now getting this error for other samples:
An exception occured in thread 14: (<class 'Exception'>, An error occurred while calling MHCflurry:
2021-08-19 19:19:55.662375: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-08-19 19:19:55.691008: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2599945000 Hz
2021-08-19 19:19:55.693342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7eff98000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-08-19 19:19:55.693374: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-08-19 19:19:55.707345: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-19 19:19:55.707372: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-19 19:19:55.707395: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ca091): /proc/driver/nvidia/version does not exist
WARNING:root:No flanking information provided. Specify --no-flanking to silence this warning
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:30<00:00, 30.86s/it]
100%|██████████| 1/1 [00:30<00:00, 30.86s/it]
0%| | 0/1 [00:00<?, ?it/s]).
Traceback (most recent call last):
File "/usr/local/bin/pvacseq", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/tools/pvacseq/main.py", line 95, in main
args[0].func.main(args[1])
File "/usr/local/lib/python3.7/site-packages/tools/pvacseq/run.py", line 122, in main
pipeline.execute()
File "/usr/local/lib/python3.7/site-packages/lib/pipeline.py", line 475, in execute
self.call_iedb(chunks)
File "/usr/local/lib/python3.7/site-packages/lib/pipeline.py", line 374, in call_iedb
p.print("Making binding predictions on Allele %s and Epitope Length %s with Method %s - File %s - Completed" % (a, epl, method, filename))
File "/usr/local/lib/python3.7/site-packages/pymp/__init__.py", line 148, in __exit__
raise exc_t(exc_val)
Exception: An error occurred while calling MHCflurry:
2021-08-19 19:19:55.662375: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-08-19 19:19:55.691008: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2599945000 Hz
2021-08-19 19:19:55.693342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7eff98000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-08-19 19:19:55.693374: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-08-19 19:19:55.707345: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-19 19:19:55.707372: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-19 19:19:55.707395: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ca091): /proc/driver/nvidia/version does not exist
WARNING:root:No flanking information provided. Specify --no-flanking to silence this warning
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:30<00:00, 30.86s/it]
100%|██████████| 1/1 [00:30<00:00, 30.86s/it]
0%| | 0/1 [00:00<?, ?it/s]
slurmstepd: error: Detected 83 oom-kill event(s) in step 27430023.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Any idea what this might be due to?
I think the actual error here is this: slurmstepd: error: Detected 83 oom-kill event(s) in step 27430023.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler
which sounds like you ran out of memory. Can you try running on a machine with more memory?
@susannasiebert Thanks for the suggestion! Will try this out and let you know what I get. Quick question regarding the variant types considered by pvacseq: are stopgain and/or stoploss mutations (either SNVc or indels) kept out?
Stop gain and stop loss mutations will be processed by pVACseq as long as they are also annotated as a snv/frameshift/inframe indel.
@susannasiebert regarding the MHCflurry error above (slurmstepd: error: Detected 83 oom-kill event(s)
), it was indeed due to insufficient memory. Thanks for the suggestion!
Regarding the pvacseq parameters, is there a way to disable the binding affinity (ic50) filtering? That would be equivalent to set it to a very high value (infinity) but I am not sure if there is a proper way to disable it, for example, if I want to filter only based on the percentile.
Also, is there a way to run NetMHCstabpan for all epitopes and not only the filtered ones?
Many thanks for your help!
Unfortunately, there is not outright way to disable binding affinity filtering but still filter on the percentile. Like you suggested, setting the binding threshold to a very large number would be the best course of action. Something like 1,000,000 should be large enough.
Right now there is no way to run NetMHCstabpan on the all epitopes file although we will be adding a standalone command in the next major version releases to run just this step which you can then use to run this step on your all epitopes file outside from the prediction runs. If you are interested in testing this out, I can create an alpha release docker container for you that has this command.
Hi @susannasiebert , I tried to set the binding affinity threshold to a quite high value but I am now getting an empty .filtered.tsv file for all my samples. This is the command I ran for all my samples:
singularity exec -B /local/Reda/test_mysample:/home/test_mysample /local/Reda/pVACtools_installation_no_MHCflurry_bug/pvactools_latest.sif \
pvacseq run \
--iedb-install-directory /opt/iedb \
--keep-tmp-files \
--n-threads 4 \
--fasta-size 100000 \
--class-i-epitope-length 8,9,10 \
--binding-threshold 10000000 \
--net-chop-method cterm \
--netmhc-stab \
--run-reference-proteome-similarity \
/home/test_mysample/mysample.genotyped.vep.vcf \
mysample \
$(cat /local/Reda/haplotypes_reformatted_for_pvacseq/mysample.txt) \
NetMHCpan \
/home/test_mysample
Can you share your input VCF with me for further debugging?
@susannasiebert here's one example vcf. Thanks! mysample.genotyped.vep.vcf.zip
@susannasiebert I digged a bit into it and I think the issue of the empty filter.tsv files has to do with the inclusion of either --net-chop-method cterm, --netmhc-stab, or --run-reference-proteome-similarity; instead of --binding-threshold 10000000. I am not sure what's wrong with these parameters: --net-chop-method cterm, --netmhc-stab, or --run-reference-proteome-similarity.
Are you sure you're running on version 2.0.3? A problem that had the same symptom was resolved in version 2.0.1 (https://pvactools.readthedocs.io/en/latest/releases/2_0.html#version-2-0-1). I'm running with all of the parameters on the 2.0.3 docker container and it's taking quite a bit of time to run these three steps which wouldn't be the case if any of these steps returned an empty file.
From inside of your singularity container, what does pip show pvactools
return?
Hi @susannasiebert, I ran pip show pvactools and this is what I got:
Name: pvactools
Version: 2.0.3
Summary: A cancer immunotherapy tools suite
Home-page: https://github.com/griffithlab/pVACtools
Author: Jasreet Hundal, Susanna Kiwala, Joshua McMichael, Yang-Yang Feng, Christopher A. Miller, Aaron Graubert, Amber Wollam, Connor Liu, Jonas Neichin, Megan Neveau, Jason Walker, Elaine R. Mardis, Obi L. Griffith, Malachi Griffith
Author-email: help@pvactools.org
License: BSD-3-Clause-Clear
Location: /usr/local/lib/python3.7/site-packages
Requires: mhcflurry, networkx, biopython, swagger-spec-validator, simanneal, PyVCF, pandas, tensorflow, Pillow, pysam, PyYAML, flask-cors, jsonschema, pymp-pypi, tornado, requests, connexion, bokeh, wget, vaxrank, mhcflurry, watchdog, mhcnuggets, mhcnuggets, py-postgresql, mock
Required-by:
I am indeed running the 2.0.3 version that contains the MHCflurry bug fix (the one you created after I reported the original error in this thread).
Did you manage to get results from the vcf I sent? I also noticed that when I run the command interactively (i.e. not as a submitted slurm job) it takes quite a bit of time. However when the command is ran in a submitted job I always get an empty .filtered.tv file. Do NetMHCstab, NetChop and/or the reference proteome blast need an internet connection to run?
@susannasiebert I confirmed that the issue (i.e. getting an empty .filtered.tsv file containing only the header) occurs when I submit multiple singularity pvacseq jobs in parallel (one job per sample). I honestly struggle to understand why this happens... Could this be because of some netMHCstabpan server restrictions on the number of requests? Also, I had a look at the netmhc_stab.py script (https://github.com/griffithlab/pVACtools/blob/master/lib/netmhc_stab.py) and noticed that the config file referenced in line 67 does not exist within my pvactools singularity container; in fact /var/www does not even exist. Is this normal?
I'm not sure, to be honest. Can you provide some more information about how you kick off your parallel jobs so I can try to reproduce this on my end?
RE the config file. This is submitted as a parameter to the NetMHCstabpan API and is a file on their server, not in the docker container.
@susannasiebert I use a bash loop to submit the same pvacseq script for each sample so they are all submitted one after the other but practically they run in parallel. I have been trying to figure this out the whole weekend but I still struggle to understand why I get an empty .filtered.tsv whenever I launch my jobs this way. When I test for individual samples, or even for two, the returned .filtered.tsv is not empty and contains the netMHCstabpan predictions...
And you execute your bash loop from inside of your docker container (as opposed to launching parallel docker run jobs)?
I execute the bash loop outside the singularity container, i.e. the bash loop launches the singularity exec command
@susannasiebert is there a way to run netMHCstabpan (and netChop) without making use of the web APIs? i.e. run them as local installations from within pvactools
Unfortunately, pVACseq only supports running these tools via their API.
And may I know what the exact command to make a call to, say, nethMHCpan is within pvactools?
The logic to call an IEDB algorithm can be found here: https://github.com/griffithlab/pVACtools/blob/master/lib/prediction_class.py#L53.
ok, I tried to replicate the issue you're describing as follows:
#test single run without NetMHCstabpan
docker run -v /Users/ssiebert/Documents/Work/pVACtools:/data griffithlab/pvactools:2.0.3 pvacseq run /data/pvacseq_example_data/input.vcf Test HLA-A*02:01 MHCflurry /data/test_out_10 -e1 9
#test single run with NetMHCstabpan
docker run -v /Users/ssiebert/Documents/Work/pVACtools:/data griffithlab/pvactools:2.0.3 pvacseq run /data/pvacseq_example_data/input.vcf Test HLA-A*02:01 MHCflurry /data/test_out_11 -e1 9 --netmhc-stab
#test multiple runs
for i in 12 13 14 15 16; do docker run -v /Users/ssiebert/Documents/Work/pVACtools:/data griffithlab/pvactools:2.0.3 pvacseq run /data/pvacseq_example_data/input.vcf Test HLA-A*02:01 MHCflurry /data/test_out_$i -e1 9 --netmhc-stab; done
#check outputs
wc -l test_out_1*/MHC_Class_I/Test.filtered.tsv
4 test_out_10/MHC_Class_I/Test.filtered.tsv
4 test_out_11/MHC_Class_I/Test.filtered.tsv
4 test_out_12/MHC_Class_I/Test.filtered.tsv
4 test_out_13/MHC_Class_I/Test.filtered.tsv
4 test_out_14/MHC_Class_I/Test.filtered.tsv
4 test_out_15/MHC_Class_I/Test.filtered.tsv
4 test_out_16/MHC_Class_I/Test.filtered.tsv
So doing that I'm unable to replicate the issue you're seeing. Can you share with me the script you use to launch your runs?
@susannasiebert Thanks very much for trying. Please find attached the scripts I use to launch my pvacseq runs. The laumching command is the following:
bash launching_loop.sh -i /cluster/working/Reda/pvacseq_runs/vcf_paths.txt \
-o /cluster/working/Reda/pvacseq_runs/out \
-f /cluster/working/Reda/ref/hg19.fasta \
-t NetMHCpan \
-s false \
-c /cluster/working/Reda/pVACtools_installation_no_MHCflurry_bug/pvactools_latest.sif
where launching_loop.sh is the loop bash script that reads the input vcfs from vcf_paths.txt (each line in this file corresponds to one vcf input = one sample). Each loop launches a slurm job whose script is main_pvacseq_script.sh (this script first annotates the vcfs in the right format if necessary then calls pvacseq).
Thanks for your help.
I was able to replicate your issue and as you suspected, NetMHCpan and NetChop limit the number of jobs you can run concurrently on their server: I get the following message:
You have reached the limit of queued and active jobs for this service from a single site.<br>
In other words, you already have jobs submitted to our service.<br>
Please wait until your current job(s) are finished and resubmit again.
I will try and work on a way to retry requests when this message is encountered.
Thanks! Great that this is confirmed... What I don't get is 1) why I do not get this message and 2) why pvacseq finishes its run without any error but with an empty .filtered.tsv file... Would be great to find a way around this indeed. Probably the best long term solution would be to make calls to netMHCstabpan and netChop using their local installations without any cgi/api call whatsoever.
@susannasiebert on a side note, is there a way to know what the maximum number of jobs/requests the netMHCstabpan/netChop cgi accepts at any one time?
@susannasiebert also, maybe, a way to mitigate this could be to increase the chunk size to post less requests for the same (current) sample? https://github.com/griffithlab/pVACtools/blob/71cd12351cafc73057be87e8c717b1be59801ca1/lib/netmhc_stab.py#L43
You don't get this message because the api still returns a 200 and we just assumed it succeeded and try to parse the output (which contains the above message). This inadvertently results in the empty output file as well. I will also address that part of the problem.
It looks like their API only allows one concurrent job.
I just made a new release (2.0.4) that should resolve the issue you're seeing with running multiple jobs in parallel while using NetMHCstabpan/NetChop. I'm closing this issue but please feel free to reopen should you still encounter problems.
@susannasiebert @malachig Could you please help me with the below issue? I get it while using MHCflurry.
I am getting the following error for some samples:
This is the command I ran (using singularity):
Full output:
For some other samples, the pipeline finished completely and successfully. Any idea as to why I am getting the error above?