Closed pedro-mmartins closed 3 months ago
Hi Pedro,
Can you share the slurm and other logs for a job that failed? I don't see any errors in the job logs that you posted.
Tim
Hi! I found this one:
Using shell: /usr/bin/bash
Provided cores: 160
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, cpus=1
Select jobs to execute...
[Fri Sep 1 13:03:09 2023]
rule run_phyloacc_gt:
input: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg, /home/martins/PhyloACC/Astral_tree.tre
output: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13_elem_lik.txt
log: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log
jobid: 0
reason: Missing output files: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13_elem_lik.txt
wildcards: gt_batch=13
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, partition=long, nodes=1, mem=1g, time=1:00:00, cpus=1
PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log
/usr/bin/bash: line 1: 651964 Segmentation fault (core dumped) PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log
[Fri Sep 1 13:03:09 2023]
Error in rule run_phyloacc_gt:
jobid: 0
input: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg, /home/martins/PhyloACC/Astral_tree.tre
output: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13_elem_lik.txt
log: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log (check log file(s) for error details)
shell:
PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Some batches seem to work, but still, no elem file is ever generated
How long have you waited to see if output is produced? The log you posted earlier, phyloacc-output/37-phyloacc-gt-out/37-phyloacc.log, looks to me like it is still running (in fact it looks like it is just starting). Are you seeing jobs that slurm/snakemake report as complete that have both no errors and no output?
As far as the error is concerned, a segmentation fault can sometimes be caused by not having enough memory, but can also have other complex causes. I guess the first thing I'd try is giving the jobs more than 1 Gb of RAM, which seems low (the default is 4 Gb). You could probably try 8 Gb to see if that works.
@gwct @HanY-H @xyz111131 any other ideas?
How long have you waited to see if output is produced? This was just an example run I did right now. But I've waited longer before, and I got the same problem.
a segmentation fault can sometimes be caused by not having enough memory, I didn't realize that. I'll give it a try and I'll get back to you. Thanks for that idea!
Just to be clear, you are saying that you have seen jobs that Snakemake/slurm report as complete, with no errors in the log, but also no output produced?
Could you also share a log of a completed job that did not produce output so we can try to figure out what is going on there?
Just to be clear, you are saying that you have seen jobs that Snakemake/slurm report as complete, with no errors in the log, but also no output produced? Yes, that's it
Could you also share a log of a completed job that did not produce output so we can try to figure out what is going on there? I'll try to find one, but I might have lost it. I'll do a run with 8Gb mem, and I'll update you with the log files.
This is all kind of confusing, and can be hard to track down between all the log files. So we have:
A couple of things to clarify:
(phyloacc-output/37-phyloacc-gt-out/37-phyloacc.log)
PhyloPazu/results/
. If you're referring to the latter, that will only be generated when you run phyloacc_post.py
, which we shouldn't try until we resolve the other errors.Otherwise, I agree with Tim that not enough memory is a likely cause for the segmentation faults. The SLURM log for any particular failed run would likely leave an OUT_OF_MEMORY flag if you can track down one of those.
For the jobs that appear to be running correctly, do they get cancelled when the other jobs error out? PhyloAcc-GT runs can take quite a while, and if they are still running, or get cancelled while running, the log file will look like the one you posted I always get this kind of output, so I think it gets cancelled.
When you say "no elem file is ever generated", do you mean the elem file for that particular run, or the final elem file in PhyloPazu/results/ I meant the ones for each run, the 13_elem_lik.txt, for example. I tried to run phyloacc_post.py, but it doesn't work.
I'm trying to do a run with more memory. I keep you posted.
But thanks for all the help so far!
No problem! If the other ones get cancelled when one errors out this would all make sense.
Another thing to try if increasing memory doesn't help is to run some of the jobs individually without snakemake, just to try and resolve the error:
PhyloAcc-GT PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg
Or whichever batch you want to run, just replacing the 13
with the batch number. This would be best because it removes snakemake and SLURM from the picture. But be careful because this would be running on your login node, which might have limited resources to begin with. You could also submit the command above in a SLURM script to at least remove snakemake from the picture. After the error is resolved then you can run the rest with snakemake.
That seems to be a good idea. I'll definetely try it out and I'll let you guys know what happens. Thanks!
Hello, again!
I took me a while to reach you guys out. It seems like doing it one-by-one will work.
But now I have a different problem. I'm using some new data, but now the first script (phyloacc.py) seems to have some issue:
This is my command line:
phyloacc.py -d /home/martins/PhyloACC/Aln -m ave_noncons_named.mod -l Astral_tree.tre -t Pazu -n 20 -p 20 -j 20 -mem 20 -part long -r gt -o PhyloPazu
And this is the error message I see on the screen:
# 09.22.2023 09:19:01 Reading input FASTA files Success: 3851 files read 1.26477 0.8165 58.52344 16803.53125
# 09.22.2023 09:19:02 Calculating alignment stats In progress... multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/martins/anaconda3/envs/phylo/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/martins/anaconda3/envs/phylo/lib/python3.10/site-packages/phyloacc_lib/seq.py", line 270, in locusAlnStats
aln_len = len(aln[list(aln.keys())[0]]);
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/martins/anaconda3/envs/phylo/bin/phyloacc.py", line 145, in <module>
globs = SEQ.alnStats(globs);
File "/home/martins/anaconda3/envs/phylo/lib/python3.10/site-packages/phyloacc_lib/seq.py", line 338, in alnStats
for result in pool.imap(locusAlnStats, ((locus, globs['alns'][locus], globs['aln-skip-chars']) for locus in globs['alns'])):
File "/home/martins/anaconda3/envs/phylo/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
IndexError: list index out of range
I figure it has something to do with my aligments, but I know what might be causing this?
Thanks!
Hmm, so the ones that were crashing or leaving no output when running through snakemake/slurm are running fine when you just run them individually in the shell (no SLURM)? That means the culprit was likely lack of time or memory when submitting the jobs by SLURM, but unfortunately without seeing the SLURM logs of one that errored out its hard to tell.
For the new error, it does look like its having trouble reading at least one alignment. From just that error message I would guess one of the alignments is empty, but something else could be going on. A quick way to check whether any of the files are empty would be:
find [path to directory with alignments] -type f -empty
If nothing comes up, I'll need to see some of your alignments to see if anything stands out.
Yep! They seem to running just fine. I'll test the new set of alignments and I let you know what happens. Thanks!
Thanks for the tip, but that doens't seem to be the cause. What would be the best way for me to send you some of the alignments? May I email them to you?
I hope they don't take too long to run individually!
Go ahead and send them to my email: gthomas [at] g [dot] harvard [dot] edu
Hello,
I've encountered some issues I have not been able to solve when using PhyloAcc.
This is my command line:
phyloacc.py -d /home/martins/PhyloACC/All_spp -m ave_noncons_named.mod -l Astral_tree.tre -t Pazu -n 20 -j 20 -mem 1 -part long -r gt -o PhyloPazu
The first output seems to be ok:![unnamed](https://github.com/phyloacc/PhyloAcc/assets/70378525/e1e665cc-98e7-4a57-8642-db961f965ac7)
Yet, when I run the snakemake command![unnamed](https://github.com/phyloacc/PhyloAcc/assets/70378525/3d98e35e-76ca-41ad-849b-10db6d4047ee)
snakemake -p -s /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/snakemake/run_phyloacc.smk --configfile /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/snakemake/phyloacc-config.yaml --profile /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/snakemake/profiles/slurm_profile --cores 20
I always have an issue with the phyloacc_gt part. This is what comes up:These "run_phyloacc_gt" files never seem to be generated. When I perform a trial run withou the "--dryrun" option and I read the log files,. I see that some inputs are read, some aren't, but in all cases the jobs seems to be incomplete.
This is what appears on my screen when trying to run:
This is an example of a log file (phyloacc-output/37-phyloacc-gt-out/37-phyloacc.log):
This is the slurm log for the same batch:
Are you able to help me?
Thanks!