phyloacc / PhyloAcc

PhyloAcc a software to detect the changes of conservation of a genomic region
GNU General Public License v3.0
26 stars 12 forks source link

phyloacc_gt issues #53

Closed pedro-mmartins closed 3 months ago

pedro-mmartins commented 10 months ago

Hello,

I've encountered some issues I have not been able to solve when using PhyloAcc.

This is my command line: phyloacc.py -d /home/martins/PhyloACC/All_spp -m ave_noncons_named.mod -l Astral_tree.tre -t Pazu -n 20 -j 20 -mem 1 -part long -r gt -o PhyloPazu

The first output seems to be ok: unnamed

Yet, when I run the snakemake command snakemake -p -s /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/snakemake/run_phyloacc.smk --configfile /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/snakemake/phyloacc-config.yaml --profile /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/snakemake/profiles/slurm_profile --cores 20 I always have an issue with the phyloacc_gt part. This is what comes up: unnamed

These "run_phyloacc_gt" files never seem to be generated. When I perform a trial run withou the "--dryrun" option and I read the log files,. I see that some inputs are read, some aren't, but in all cases the jobs seems to be incomplete.

This is what appears on my screen when trying to run:

Error in rule run_phyloacc_gt:
    jobid: 23
    input: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/23-gt.cfg, /home/martins/PhyloACC/Astral_tree.tre
    output: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/23-phyloacc-gt-out/23_elem_lik.txt
    log: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/23-phyloacc-gt-out/23-phyloacc.log (check log file(s) for error details)
    shell:

        PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/23-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/23-phyloacc-gt-out/23-phyloacc.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error executing rule run_phyloacc_gt on cluster (jobid: 23, external: 651526, jobscript: /home/martins/PhyloACC/.snakemake/tmp.hjp809xj/snakejob.run_phyloacc_gt.23.sh). For error details see the cluster log and the log files of the involved rule(s).

This is an example of a log file (phyloacc-output/37-phyloacc-gt-out/37-phyloacc.log):

Loading input data and running parameters......
Loading program configurations from /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/37-gt.cfg......
  # total length = 20351 (50). # Species = 11. # elements = 50. Mean gene set size = 407.0200.
# Burn-ins = 500. # MCMC Updates = 1000. # thin = 1.  RND SEED = 1.
# Threads = 1

Loading phylogenetic tree from /home/martins/PhyloACC/ave_noncons_named.mod......
Loading phylogenetic tree in coalescent unit from /home/martins/PhyloACC/Astral_tree.tre...
The species in profile and tree match perfectly. Reorder the species in profile matrix by the tree.

InitPhyloTree finished
50 elements to be computed
element 0, number of base pair=256

This is the slurm log for the same batch:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 160
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, cpus=1
Select jobs to execute...

[Fri Sep  1 13:03:09 2023]
rule run_phyloacc_gt:
    input: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/37-gt.cfg, /home/martins/PhyloACC/Astral_tree.tre
    output: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/37-phyloacc-gt-out/37_elem_lik.txt
    log: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/37-phyloacc-gt-out/37-phyloacc.log
    jobid: 0
    reason: Missing output files: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/37-phyloacc-gt-out/37_elem_lik.txt
    wildcards: gt_batch=37
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, partition=long, nodes=1, mem=1g, time=1:00:00, cpus=1

        PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/37-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/37-phyloacc-gt-out/37-phyloacc.log'

Are you able to help me?

Thanks!

tsackton commented 10 months ago

Hi Pedro,

Can you share the slurm and other logs for a job that failed? I don't see any errors in the job logs that you posted.

Tim

pedro-mmartins commented 10 months ago

Hi! I found this one:

Using shell: /usr/bin/bash
Provided cores: 160
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, cpus=1
Select jobs to execute...

[Fri Sep  1 13:03:09 2023]
rule run_phyloacc_gt:
    input: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg, /home/martins/PhyloACC/Astral_tree.tre
    output: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13_elem_lik.txt
    log: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log
    jobid: 0
    reason: Missing output files: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13_elem_lik.txt
    wildcards: gt_batch=13
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, partition=long, nodes=1, mem=1g, time=1:00:00, cpus=1

        PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log

/usr/bin/bash: line 1: 651964 Segmentation fault      (core dumped) PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log
[Fri Sep  1 13:03:09 2023]
Error in rule run_phyloacc_gt:
    jobid: 0
    input: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg, /home/martins/PhyloACC/Astral_tree.tre
    output: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13_elem_lik.txt
    log: /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log (check log file(s) for error details)
    shell:

        PhyloAcc-GT /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg &> /home/martins/PhyloACC/PhyloPazu/phyloacc-job-files/phyloacc-output/13-phyloacc-gt-out/13-phyloacc.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.

Some batches seem to work, but still, no elem file is ever generated

tsackton commented 10 months ago

How long have you waited to see if output is produced? The log you posted earlier, phyloacc-output/37-phyloacc-gt-out/37-phyloacc.log, looks to me like it is still running (in fact it looks like it is just starting). Are you seeing jobs that slurm/snakemake report as complete that have both no errors and no output?

As far as the error is concerned, a segmentation fault can sometimes be caused by not having enough memory, but can also have other complex causes. I guess the first thing I'd try is giving the jobs more than 1 Gb of RAM, which seems low (the default is 4 Gb). You could probably try 8 Gb to see if that works.

@gwct @HanY-H @xyz111131 any other ideas?

pedro-mmartins commented 10 months ago

How long have you waited to see if output is produced? This was just an example run I did right now. But I've waited longer before, and I got the same problem.

a segmentation fault can sometimes be caused by not having enough memory, I didn't realize that. I'll give it a try and I'll get back to you. Thanks for that idea!

tsackton commented 10 months ago

Just to be clear, you are saying that you have seen jobs that Snakemake/slurm report as complete, with no errors in the log, but also no output produced?

Could you also share a log of a completed job that did not produce output so we can try to figure out what is going on there?

pedro-mmartins commented 10 months ago

Just to be clear, you are saying that you have seen jobs that Snakemake/slurm report as complete, with no errors in the log, but also no output produced? Yes, that's it

Could you also share a log of a completed job that did not produce output so we can try to figure out what is going on there? I'll try to find one, but I might have lost it. I'll do a run with 8Gb mem, and I'll update you with the log files.

gwct commented 10 months ago

This is all kind of confusing, and can be hard to track down between all the log files. So we have:

A couple of things to clarify:

Otherwise, I agree with Tim that not enough memory is a likely cause for the segmentation faults. The SLURM log for any particular failed run would likely leave an OUT_OF_MEMORY flag if you can track down one of those.

pedro-mmartins commented 10 months ago

For the jobs that appear to be running correctly, do they get cancelled when the other jobs error out? PhyloAcc-GT runs can take quite a while, and if they are still running, or get cancelled while running, the log file will look like the one you posted I always get this kind of output, so I think it gets cancelled.

When you say "no elem file is ever generated", do you mean the elem file for that particular run, or the final elem file in PhyloPazu/results/ I meant the ones for each run, the 13_elem_lik.txt, for example. I tried to run phyloacc_post.py, but it doesn't work.

I'm trying to do a run with more memory. I keep you posted.

But thanks for all the help so far!

gwct commented 10 months ago

No problem! If the other ones get cancelled when one errors out this would all make sense.

Another thing to try if increasing memory doesn't help is to run some of the jobs individually without snakemake, just to try and resolve the error:

PhyloAcc-GT PhyloPazu/phyloacc-job-files/cfgs/13-gt.cfg

Or whichever batch you want to run, just replacing the 13 with the batch number. This would be best because it removes snakemake and SLURM from the picture. But be careful because this would be running on your login node, which might have limited resources to begin with. You could also submit the command above in a SLURM script to at least remove snakemake from the picture. After the error is resolved then you can run the rest with snakemake.

pedro-mmartins commented 10 months ago

That seems to be a good idea. I'll definetely try it out and I'll let you guys know what happens. Thanks!

pedro-mmartins commented 9 months ago

Hello, again!

I took me a while to reach you guys out. It seems like doing it one-by-one will work.

But now I have a different problem. I'm using some new data, but now the first script (phyloacc.py) seems to have some issue: This is my command line: phyloacc.py -d /home/martins/PhyloACC/Aln -m ave_noncons_named.mod -l Astral_tree.tre -t Pazu -n 20 -p 20 -j 20 -mem 20 -part long -r gt -o PhyloPazu

And this is the error message I see on the screen:

# 09.22.2023  09:19:01  Reading input FASTA files               Success: 3851 files read                1.26477             0.8165          58.52344                 16803.53125
# 09.22.2023  09:19:02  Calculating alignment stats             In progress...                          multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/martins/anaconda3/envs/phylo/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/martins/anaconda3/envs/phylo/lib/python3.10/site-packages/phyloacc_lib/seq.py", line 270, in locusAlnStats
    aln_len = len(aln[list(aln.keys())[0]]);
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/martins/anaconda3/envs/phylo/bin/phyloacc.py", line 145, in <module>
    globs = SEQ.alnStats(globs);
  File "/home/martins/anaconda3/envs/phylo/lib/python3.10/site-packages/phyloacc_lib/seq.py", line 338, in alnStats
    for result in pool.imap(locusAlnStats, ((locus, globs['alns'][locus], globs['aln-skip-chars']) for locus in globs['alns'])):
  File "/home/martins/anaconda3/envs/phylo/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
IndexError: list index out of range

I figure it has something to do with my aligments, but I know what might be causing this?

Thanks!

gwct commented 9 months ago

Hmm, so the ones that were crashing or leaving no output when running through snakemake/slurm are running fine when you just run them individually in the shell (no SLURM)? That means the culprit was likely lack of time or memory when submitting the jobs by SLURM, but unfortunately without seeing the SLURM logs of one that errored out its hard to tell.

For the new error, it does look like its having trouble reading at least one alignment. From just that error message I would guess one of the alignments is empty, but something else could be going on. A quick way to check whether any of the files are empty would be:

find [path to directory with alignments] -type f -empty

If nothing comes up, I'll need to see some of your alignments to see if anything stands out.

pedro-mmartins commented 9 months ago

Yep! They seem to running just fine. I'll test the new set of alignments and I let you know what happens. Thanks!

Thanks for the tip, but that doens't seem to be the cause. What would be the best way for me to send you some of the alignments? May I email them to you?

gwct commented 9 months ago

I hope they don't take too long to run individually!

Go ahead and send them to my email: gthomas [at] g [dot] harvard [dot] edu