Closed Kkaholoaa closed 1 year ago
Hey Kaiku,
woooow you really dug deep there!
Your first question: The PID is the internal number that the operating system (at least Unix/Linux/Mac - bit different on Windows) uses to identify a process or program that you are running. It's just the way that programs are internally organized and addressed. If you want to end a process externally for example (because it might be stuck, or something), you can usually call kill 1234
with the PID there.
Your second question: From what you wrote, I really have no idea where the problem is coming from. I don't think that filtering for existing PIDs in the slurm script is the way to go though. This seems to be an issue within the trimmomatic script itself. The cluster and slurm are not concerned with the internals of a rule once it is running. Or why did you come to the conclusion that slurm is involved at all?
Based on the error that you reported, I'd instead look into the trimmomatic rule, and see where and why it is breaking. The exact code for the rule execution is here, which you can copy into a script, and instead of having the rule execute the wrapper script, have it execute your copy of the script. You can then add debugging log messages to it to see how far it gets and where it breaks.
As a pre-test though to see if this is actually being caused by that particular rule, you could try to use another trimming tool, and see if that works. That would then hint at the trimmomatics script being the issue. If that however also fails with a similar error, then I'd say you are right in the assumption that the error is coming more from a snakemake/slurm thing.
Hope that helps, cheers Lucas
Oh wait, there is more:
--cores
instead of --profile
so that it runs "locally" within that interactive session, instead of submitting the rules as jobs. Failure or success of that would also give strong hints into where this error is coming from.Lucas
Hi Lucas,
Sorry for the late reply, I really wanted to try to troubleshoot this on my own but it seems I hit a wall - probably because of my lack of computational knowledge. The good news is I was able to identify that the issue seems to be with my wrappers, and I would really appreciate any advice you have for me to troubleshoot this...
To begin, I switched trimming tools from trimmomatic to adapterremoval, and it worked! At the same time, I also started receiving errors for fastqc:
By looking into the fast qc rule, I found yet another PID error:
Per your suggestion, I also ran grenepipe locally using --cores instead of --profile, and still came across the PID issues, meaning that it is probably not slurm that is causing this issue.
As it turns out, both trimmomatic and fastqc use snakemake wrappers, while adapter removal does not (it uses a shell script instead). This tells me that the error might have something to do with snakemake wrappers, or the way that slurm is interacting with them.
Here, the PID errors point to the snakemake wrapper in both Trimmomatic and Fastqc, but with adapter removal there is no snakemake wrapper and the trim was successful:
FastQC:
Trimmomatic:
Adapter Removal:
To explore more into this issue, I attempted to use debugging codes by having print functions (simply added in echo statements to get more info on why the wrappers dont work... but I don't really understand debugging just yet), so that hopefully I could identify the location in which in both wrappers (trimmomatic and fastqc) errors occurred. Although my debugging didnt work at all, I did find that grenepipe also began to map my trimmed reads from adapter removal onto my reference genome. This then led to the discovery that BWA also does not work, and with no surprise, yet another PID error that points to a snakemake wrapper:
As a result, I am convinced the error is with snakemake wrappers, but google does not do a very good job as to explaining the issue nor any potential ways to resolve it, and this is where I need your help.
Here is some of my environment information, and I noticed that the version of my snakemake wrapper utils is not the version recommended on the snakemake website, and I also noticed that everything in grenepipe was created by conda forge except for snakemake, which was created by bio conda.
I also found some links that may be helpful, where some people are saying that their cluster does not support internet access (which snakemake requires), and another different post that describes that subworkflows have been depreciated in favor of snakemake modules (unsure if this is relevant here).
https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html
Finally, I'd just thought to send this over just in case, but here is my snakemake code that I am using to run grenepipe... maybe there is an error on the way I am running grenepipe??? For context, I use tmux to create a new session, then I activate the grenepipe environment, and finally I use the follow code to run grenepipe from within the grenepipe directory, but pointing to my analysis directory where both my config.yaml and my sample_table.tsv files are located.
Overall, do you have any advice for me at this time? Anything would be appreciated into how to resolve this snakemake wrapper issue...
Thank you! Kaiku
Hey Kaiku,
interesting, there is some progress there! Your thorough analysis helps, but it has one flaw, as seen in one of your screenshots:
The call to the Snakemake wrapper in the FastQC rule is actually commented out, and we are using a local script instead there.
Hence,
As a result, I am convinced the error is with snakemake wrappers
might not be the case ;-) I think that the error is some part of the script crashing due to some other error, so that something stops running, and the PID error is just the visible effect of that, but not the cause.
What you could try: Following the Cluster Troubleshooting steps, locate the .err
and .out
slurm log files of the rule that crashed. In your case, for example from your screenshot
you'd want to look for the slurm log files for job IDs 19203847 and 19203864, and see if there are any hints in there. Maybe out of memory, or some other trouble with the program. It is admittedly weird that two different tools kind of crash with the same consequences (PID not found), but it might be that they both crash for unrelated reasons, and the PID thing is just the effect that this has on Snakemake. Not sure at this point. If you could post those err
and out
files, we might find more hints.
Cheers Lucas
Hi Lucas, apologies for not making it clear, but the PID errors are from the slurm logs (see below), and the normal logs are coming up as empty.
Slurm log 19203864:
Slurm log 19203847:
Logs > FastQC is filled with a bunch of empty logs:
I agree that it seems like something weird is happening on the back end, and hopefully there is somewhere else that I could look for hints?
Please let me know what you think, Thank you! Kaiku
Hm, so, this is not an issues with the wrappers (as per my answer above), and not caused by a tool in the scripts or wrappers failing directly with any usable error... It could be an issue with the cluster itself - some misconfiguration or limitation that we are not aware of. Could you try running it on the Carnegie cluster instead?
Other than that, I have no good idea as of now, unfortunately. We can debug together once we are both back in the same room ;-)
Hello, after running this on the Carnegie cluster instead of Stanford's Sherlock Cluster, the pipeline now works!!
Although it works on the Carnegie Cluster, I will attempt to work with the Stanford IT staff to get this issue resolved on the Sherlock cluster, and will post our findings here just in case any other clusters out there may have the same issue...
Thank you so much! Best, Kaiku
Awesome, so then this does not seem to be a grenepipe issue per-se, but more related to some cluster infrastructure or configuration. I'll close the issue here then - but feel free to add further findings. As you said, it might be relevant for others as well. And we can re-open should it turn out to be related to grenepipe after all ;-)
Thanks again for your persistence, and curious to hear more on this story, if you ever find out what's going on there!
Awesome, so then this does not seem to be a grenepipe issue per-se, but more related to some cluster infrastructure or configuration. I'll close the issue here then - but feel free to add further findings. As you said, it might be relevant for others as well. And we can re-open should it turn out to be related to grenepipe after all ;-)
Thanks again for your persistence, and curious to hear more on this story, if you ever find out what's going on there!
Hi Lucas,
Tl;dr: What is a PID, and how can I amend my slurm profile to fix it?
To begin, Thank you for your “troubleshooting” page on the grenepipe wiki, as it helped me identify a specific error that was happening across all my samples.
The error is regarding PID, which (after a lot of google searches) means that the process ID is off, and it is occurring during my paired end read trimming step. Here is the error I got from the slurm log:
And here is the rule for trim_reads_pe for reference:
Problem Solving/Troubleshooting Attempt:
After trying to understand the problem, I came to the conclusion that the problem may be with the slurm profile, as snakemake may have some sort of issue working with Stanford’s Server: Sherlock. Sherlock does in fact use slurm, and so I am convinced that I would need to amend grenepipe's provided slurm profile so that it can run properly. However, after doing some digging in grenepipe's slurm profile, including these these three files below, grenepipe does not have any instances where it describes a Process ID.
Additionally, I do not think it is in slurm_utilis.py because it seems that file was meant not to be amended, so my gut tells me to leave that one alone.
Overall, my question for you is “What is a PID, and how can I amend my slurm profile to fix this bug?”
My best educated guess to resolve this is to input a code in one of those files that says if PID == “not found”: PID = pid
What do you think? Do you have any thought’s as if I’m heading in the right direction?
Thank you! Kaiku Kaholoa'a