Error executing process with NeoFuse

antaralabiba97 commented 2 years ago

Hi,

I came across your pipeline not long ago and found it would be great for my research so I am really keen to make sure I can get it running.

I set the pipeline up and did not input the HLA-HD file as I wanted to first make sure I could get the pipeline running to predict MHCI neoepitopes first. The only thing I did change was in the process.conf file where I changed the CPU usage for all processes to 40.

I have attached the HTML file detailing the error which stopped the pipeline running at the NeoFuse stage. I went and had a look at the "sample1_MHCI_final.log" file as detailed in the error and the ".command.sh" script file in the working directory stated in the pdf file of the error which I have attached. In the pdf file the first line of the error has been omitted when I converted from HTML to pdf but it stated "Error executing process > 'Neofuse (sample1)".

I was happy that it was running smoothly and the prepoccessing steps had been completed successfully however, unfortunaley an error arised. I am actually not sure how to resolve the error stated and would really appreciate any help.

Thank you in advance!

Nextflow Workflow Report.pdf sample1_MHCI_final.log

riederd commented 2 years ago

Hi, I'm sorry that you are running into these troubles.

In order to sort this out I'd need some more information:

Please tell us which version of nextNEOpi you are running
Post the .nextflow.log file from the failed run.
Post the .command.run file from the /data/SBCS-BessantLab/Antara/nextNEOpi/work/a9/f5d9ded3bc5bc9dfbb2f33b7f58074

Thanks

antaralabiba97 commented 2 years ago

Thank you for your prompt response.

I am using version nextNEOpi_v1.3.1 which I believe is the latest? I set this up on 21st of May using the latest documents and data available on GitHub. Also, I am running only one sample at the moment from the TESLA consortium data used to benchmark your pipeline. I have WES normal and tumor and tumor rnaseq data.

I have attached the files you have asked for. Just changed the extension from .run to .txt so I can attach it here.

Please let me know if you require any other info.

Thank you.

nextflow.log command.run.txt

riederd commented 2 years ago

Thanks for the information. We are checking the relevant code in NEOfuse and try to understand why you get this error. Can you help us meanwhile with the following:

post the following log file /data/SBCS-BessantLab/Antara/nextNEOpi/work/a9/f5d9ded3bc5bc9dfbb2f33b7f58074/sample1/LOGS/sample1_8_MHCFlurry.log
check if /data/SBCS-BessantLab/Antara/nextNEOpi/work/a9/f5d9ded3bc5bc9dfbb2f33b7f58074/sample1/NeoFuse/tmp/MHC_I/sample1_8_NEK11_ALDH1L1_1_8.tsv exists
rerun (use -resume) with a smaller number of CPUs in NEOfuse (e.g. 8)

antaralabiba97 commented 2 years ago

sample1_8_MHCFlurry.log

As you can see from the image the sample1_8_NEK11_ALDH1L1_1_8.tsv does not exist.

Also, just changed CPU parameters to 8 so will try and resume and let you know what happens.

Thank you.

riederd commented 2 years ago

Thanks! It seems that mhcflurry is not running or is being stopped/killed just shortly after it starts up.

Can you try to run it manually, e.g.:

$ singularity exec --no-home /data/SBCS-BessantLab/Antara/nextNEOpi/work/singularity/apps-01.i-med.ac.at-images-singularity-NeoFuse_dev_0d1d4169.sif /bin/bash
singularity> mhcflurry-predict --affinity-only --alleles A*02:01 --peptides TPDPGAEV --out /tmp/test_1.txt --models /home/neofuse/.local/share/mhcflurry/4/2.0.0/models_class1_pan/models.combined
[...]
singularity> cat /tmp/test_1.txt

antaralabiba97 commented 2 years ago

Hi, I have run the above and have attached what the output looks like. Please let me know the next step. Thank you again! Really appreciate the help

riederd commented 2 years ago

Ok, it seems that in principle mhcflurry works. Let's see if your run with less CPUs completes.

antaralabiba97 commented 2 years ago

Hi,

So I was having issues with the jobs I had submitted to run on the cluster so decided to kill the previous runs and start fresh with the amended CPU 8 defined for NeoFuse. I set up the directory as I did before but noticed this time the link to the resources file on your GitHub is unreachable? So I used the resources folder I had already created. However, when I run the pipeline again there is no route to the host to pull the singularity image ...(attached screenshot). Added the nextflow log too.

I understand this is a different issue and if you would rather me post on a new thread please let me know! Very keen to get this pipeline working and looking forward to hopefully doing this soon!

Thank you!

nextflow.log

riederd commented 2 years ago

Hi, unfortunately we had an electricity issue last night which affects the server on which the resource is located. The bad thing is that there is a holiday and long weekend now so it might take until Monday to get this fixed since not all things involved are in our hands. We are sorry for this.

riederd commented 2 years ago

The resource download should work again.

antaralabiba97 commented 2 years ago

Ah great, will test the new run shortly, thank you. Will let you know if I encounter the same problems with MHCflurry (hopefully not!)

antaralabiba97 commented 2 years ago

Hello,

So I tried running again with changing the CPUs to 8 for NeoFuse but unfortunately I encountered the same error as before. I have added the .nextflow.log , sample1_8_MHCFlurry.log and the sample1_MHCI_final.log. The sample1_8_NEK11_ALDH1L1_1_8.tsv does not exist in the location specficed in the error. Also, tried the singularity command again which you posted above and got the same output.

Not sure what is causing the issue with this missing file but please do let me know on ways to get this sorted.

Thank you!

command.run.txt sample1_8_MHCFlurry.log sample1_MHCI_final.log nextflow.log

antaralabiba97 commented 2 years ago

Hello,

I was just wondering if you have had a chance to take a look at this error? Appreciate you may be busy but please let me know if there is any solution when you get the time! Thank you :)

riederd commented 2 years ago

Hi, we were out of office the last days. We continue to look into this and keep you updated. Meanwhile, can you try to do the following:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/30/21453b69c882f412b58dee6149538c
$ bash .command.run.sh

antaralabiba97 commented 2 years ago

Hi, no worries and thank you.

Also, tried this but there is no ".command.run.sh" file. Below are the files available in the directory you specified.

riederd commented 2 years ago

sorry I meant .command.run

antaralabiba97 commented 2 years ago

Hi, I ran the above and this is the error output...

riederd commented 2 years ago

Hi, it seems you are running into a resource limit. Can you post the output of:

$ ulimit -a

Can you do this on the head node of your cluster and on one of the compute nodes, in case you are running nextNEOpi on a cluster.

antaralabiba97 commented 2 years ago

Hello, I am just running on the head node and not submitting a job on the cluster, this is the output on the head node.

riederd commented 2 years ago

hmmm, this is stange, I do not see a big difference to our settings here. These two settings differ:

pending signals                 (-i) 12383285
max locked memory       (kbytes, -l) unlimited

But I don't think this is the problem.

What puzzles my also is that the featureCounts step takes so long in your case, more than 4 hrs. I would expect 10-20 min, as we see here in our environment.

Just to make sure if the Resource temporarily unavailable issue is really temporary can you please try once more to run

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/30/21453b69c882f412b58dee6149538c
$ bash .command.run

Thanks

antaralabiba97 commented 2 years ago

Hi, I ran again and now get the error below

riederd commented 2 years ago

So you really seem to hit some resource limit on your machine. Do you have many other processes running on that machine? Can you check with:

$ ps -eLf | wc -l
$ ps -eLf | grep hfy006 | wc -l

You can try to raise some limits:

$ ulimit -n 4096
$ ulimit -l unlimited
$ ulimit -u 8192

and then run the .command.run script again.

antaralabiba97 commented 2 years ago

I am unable to change "ulimit -l" to unlimited as it is locked however, I have changed the other two parameters and will re-run. The only other processes I had running were the mhcflurry-predict which did not fully terminate after the previous run exited.

Will keep you updated. Thank you!

antaralabiba97 commented 2 years ago

After doing the above running the ".command.run script" completed! I had to kill the previous processes which were still running from the nextflow run that exited with the MHCflurry error. Please let me know on how to proceed from the stage the pipeline exited. Thank you for all your help so far, glad to be one step closer!

riederd commented 2 years ago

That's good! Now I suggest to do the following:

create a file in the nextNEOpi bin/ directory named set_limits.sh with the following content:
```
ulimit -n 4096
ulimit -u 8192
```

Edit the conf/process.config file in the nexNEOpi directory and look for:

withName:Neofuse {
container = 'https://apps-01.i-med.ac.at/images/singularity/NeoFuse_dev_0d1d4169.sif'
cpus = 10
}

change it to:

withName:Neofuse {
beforeScript = 'source /data/SBCS-BessantLab/Antara/nextNEOpi/bin/set_limits.sh'
container = 'https://apps-01.i-med.ac.at/images/singularity/NeoFuse_dev_0d1d4169.sif'
cpus = 10
}

rerun the pipeline with the -resume option set.

antaralabiba97 commented 2 years ago

Hi,

So I did all the above and the Neofuse part of the run completed and I have the output folder for this in my results! Thank for the help on this part, it's really appreciated!

However, I now have an error during the pVACseq stage which is causing the process to exit. I have attached the files associated with the error.

nextflow.log command.run.txt command.sh.txt

Feels like I'm nearly there so I am very excited for the run to complete and then hopefully once I have a working pipeline I will be able to run my other samples!

riederd commented 2 years ago

Hi, can you also try to reduce the number of CPUs to 10 for pVACseq?

For a manual test you may do this by editing command.sh in /data/SBCS-BessantLab/Antara/nextNEOpi/work/09/7305b172968e7a9bc25e0b59f2eb8a and set the threads parameter from -t 40 to -t 10. After this you may run:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/09/7305b172968e7a9bc25e0b59f2eb8a
$ bash .command.run

if this works, you may change the cpus in conf/process.config

antaralabiba97 commented 2 years ago

Hi, I tried doing the above, and the process aborts

riederd commented 2 years ago

can you try to do the following:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/09/7305b172968e7a9bc25e0b59f2eb8a
$ rm -rf ./MHC_Class*
$ bash .command.run

antaralabiba97 commented 2 years ago

The process exits with "Error: No command specified".

riederd commented 2 years ago

Hmmm, I think you hit an issue in pVACseq, which might be solved in the newest version. I'll prepare an updated image this evening. Meanwhile, can you send me a tar archive from that working directory, so that I can test locally. You would need to create it as follows:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/09
$ tar -chvzf testdata.tar.gz 7305b172968e7a9bc25e0b59f2eb8a

Please send me a private e-mail with a download link for the resulting testdata.tar.gz.

antaralabiba97 commented 2 years ago

Sent the email, please let me know if you do not receive it. Thank you.

riederd commented 2 years ago

One more thing to try:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/09/7305b172968e7a9bc25e0b59f2eb8a
$ rm -rf ./MHC_Class*
$ singularity exec --no-mount hostfs -B /data/SBCS-BessantLab/Antara/nextNEOpi -B "$PWD" --no-home -B /data/SBCS-BessantLab/Antara/nextNEOpi/assets -B /data/SBCS-BessantLab/Antara/nextNEOpi/tmpDir -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/iedb:/opt/iedb -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/mhcflurry_data:/opt/mhcflurry_data /data/SBCS-BessantLab/Antara/nextNEOpi/work/singularity/apps-01.i-med.ac.at-images-singularity-pVACtools_3.0.0_icbi_5dfca363.sif /bin/bash
Singularity> bash .command.sh

riederd commented 2 years ago

...and in case you get an netMHCstab error please look for the following line in .command.sh:

--netmhc-stab

and remove it and re-run the commands above. NetMHCstab is run from a webservice which is not always working as expected. It can be disabled in nextNEOpi with the option --use_NetMHCstab false

antaralabiba97 commented 2 years ago

I didn't get the netMHCstab error but the same error as before "Error: No command specified".

riederd commented 2 years ago

Did you get pandas warnings with this?

antaralabiba97 commented 2 years ago

Yes I did, the same as before

riederd commented 2 years ago

This is interesting, you should not get those. Can you check the pandas version and path for me:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/09/7305b172968e7a9bc25e0b59f2eb8a
$ rm -rf ./MHC_Class*
$ singularity exec --no-mount hostfs -B /data/SBCS-BessantLab/Antara/nextNEOpi -B "$PWD" --no-home -B /data/SBCS-BessantLab/Antara/nextNEOpi/assets -B /data/SBCS-BessantLab/Antara/nextNEOpi/tmpDir -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/iedb:/opt/iedb -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/mhcflurry_data:/opt/mhcflurry_data /data/SBCS-BessantLab/Antara/nextNEOpi/work/singularity/apps-01.i-med.ac.at-images-singularity-pVACtools_3.0.0_icbi_5dfca363.sif /bin/bash
Singularity> pip show pandas

and

Singularity> python
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> print(pd.__version__)

Can you then try the new test image that I prepared:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/09/7305b172968e7a9bc25e0b59f2eb8a
$ rm -rf ./MHC_Class*
$ singularity exec --no-mount hostfs -B /data/SBCS-BessantLab/Antara/nextNEOpi -B "$PWD" --no-home -B /data/SBCS-BessantLab/Antara/nextNEOpi/assets -B /data/SBCS-BessantLab/Antara/nextNEOpi/tmpDir -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/iedb:/opt/iedb -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/mhcflurry_data:/opt/mhcflurry_data https://apps-01.i-med.ac.at/images/singularity/pVACtools_3.0.1_icbi_test_20220609.sif /bin/bash
Singularity> pip show pandas

and then:

Singularity> bash .command.sh

antaralabiba97 commented 2 years ago

Did all the steps to check python and pandas version which are the same as yours.

Tried the new test image and it worked!

When I go and look in the folder with the outputs I don't see the "sample1_tumor.filtered.tsv" file but just the filtered results file for HLA-A02:01 "sample1_tumor_HLA-A02:01.filtered.tsv".

For now, I have not included "HLA-HD" in the pipeline so no MHC-II predictions are generated but once this run finishes completely I will go back to include it.

riederd commented 2 years ago

Cool. Thanks!

The final filtered result for the entire sample is generated by nextNEOpi after collecting the parallelized junks. So what you see is expected.

I did not disclose my pandas version ;-) so I don't think you can state that it is the same as yours. Would it be possible for you to post the output of:

$ cd /data/SBCS-BessantLab/Antara/nextNEOpi/work/09/7305b172968e7a9bc25e0b59f2eb8a
$ singularity exec --no-mount hostfs -B /data/SBCS-BessantLab/Antara/nextNEOpi -B "$PWD" --no-home -B /data/SBCS-BessantLab/Antara/nextNEOpi/assets -B /data/SBCS-BessantLab/Antara/nextNEOpi/tmpDir -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/iedb:/opt/iedb -B /data/SBCS-BessantLab/Antara/nextNEOpi/resources/databases/mhcflurry_data:/opt/mhcflurry_data /data/SBCS-BessantLab/Antara/nextNEOpi/work/singularity/apps-01.i-med.ac.at-images-singularity-pVACtools_3.0.0_icbi_5dfca363.sif /bin/bash
Singularity> pip show pandas

antaralabiba97 commented 2 years ago

Haha you're right I mean just the python version!

Here's the output of the above:

riederd commented 2 years ago

Thanks! This is very interesting, python from the singularity image is using the pandas package that is installed in your home directory, which is not working with pVACseq in the image. In principle you should not see anything from your home directory from within the container since we use the --no-home and --no-mount hostfs options to startup the container. This is working fine here, e.g. see what happens if I want to change to my home dir from within the container:

Singularity> cd ~
bash: cd: /home/rieder: No such file or directory

May I ask which version of singularity you use?

antaralabiba97 commented 2 years ago

Hmm, yes I understand just had a read around this. This is the version:

riederd commented 2 years ago

Thanks, I'll try to reproduce this

riederd commented 2 years ago

I think I get a clue what happens a your site. Can you please post the output of:

grep "bind path" /etc/singularity/singularity.conf

antaralabiba97 commented 2 years ago

Here is the output for the above command:

riederd commented 2 years ago

yes, here we go

bind path = /data

tells singularity to bind mount /data from the host to /data in the container. Now, your user home $HOME is located in /data i.e. /data/home/hfy006 this way, no matter if we tell singularity to not mount the user home (--no-home), it will still be present in the container because it gets mounted by default via the explicit bind path = /data directive in the global config.

When importing a library, Python is first looking in the user home under $HOME/.local/lib/... for a matching package and if finds one it will use it. Now, if the package has an incompatible version you will get warnings or errors or any sort of unexpected behavior.

So the quickest fix is to remove the bind path = /data from /etc/singularity/singularity.conf, since you will be likely to hit these package/library conflicts also with other singularity containers, it can happen not only with python packages but also - for example - with R libraries. However, I'm not sure if this is something you/your admin are/is concerned about and there maybe some important reasons why you/your admin set this configuration as it is.

I need to check if there is any other way to avoid this situation. Since this is not a specific nextNEOpi bug I'll close the issue for now, but feel free to reopen it.

Thanks a lot for all your input!

antaralabiba97 commented 2 years ago

Ok, I understand the issue now.

For now, I am running nextNEOpi on the cluster so I will ask the admin team to see if we can work around this. I do have my own custom-built PC arriving soon which is designed to run pipelines like nextNEOpi locally without memory or performance problems so may be able to avoid the issue above.

I will get back to you once I am able to work off the cluster and hopefully be able to run the pipeline smoothly! Thanks for all your help thus far :)

riederd commented 2 years ago

One thing, that may work would be to set a "fake home" in the params.conf which points to the tmpDir

e.g.

singularity {
    enabled = true
    autoMounts = true
    runOptions =  "--no-home" + " -H " + params.singularityTmpMount + " -B " +  params.singularityAssetsMount + " -B " + params.singularityTmpMount + " -B " + params.resourcesBaseDir + params.singularityHLAHDmount + " -B " + params.databases.IEDB_dir + ":/opt/iedb" + " -B " + params.databases.MHCFLURRY_dir + ":/opt/mhcflurry_data"
}

Might work, but this is untested, so I have no idea if other problems pop up with is hack.

antaralabiba97 commented 2 years ago

Ok, I will try and hope for the best 😅

Will let you know what happens.

icbi-lab / nextNEOpi

Error executing process with NeoFuse #5