gphocs-dev / G-PhoCS

G-PhoCS is a software package for inferring ancestral population sizes, population divergence times, and migration rates from individual genome sequences.
33 stars 4 forks source link

Segmentation fault (core dumped) #43

Open llanos-garrido opened 6 years ago

llanos-garrido commented 6 years ago

When I run G-PhoCS, for 1 locus with my entire dataset of alligned SNPs I get the following error:

Reading sequence data... 1 loci, as specified in sequence file Reading loci (.=100 loci)... Segmentation fault (core dumped).

Is there any problem for using "one locus" in that way? My problem is that I have lost my .loci file during variant calling process... Thank you for your help. Alex.

gphocs-dev commented 6 years ago

Runnig G-PhoCS with one locus is not recommended. The problem you're getting is likely a memory issue caused by too much data for a single locus. The expectation is that each locus will have up to a few hundred distinct site patterns. If you dump everything into one locus, you're violating this assumption and the results you'll get will be senseless (even if you are able to run the program).

FatihSarigol commented 5 years ago

Hello, Here is my version, what can you suggest?

Reading sequence data...  237290 loci, as specified in sequence file 6ORCAS.
Reading loci (.=100 loci)
...
...after running several hours
...
...
./RunGphocs: line 1: 27776 Segmentation fault      /ddn/data/cjqr89/GPhoCS/G-PhoCS/bin/G-PhoCS KillerWhales.ctl

I have 6 populations each with a single sample. Each locus has 10.000 bases and there are no indels for the samples.

Thank you! :)

FatihSarigol commented 5 years ago

Hello again, do you have any suggestions for my case from my last post? was my sequence length for the loci too long? or did I have too many loci? thanks

gphocs-dev commented 5 years ago

Sorry, I missed your previous post. I'm not sure which of the two factors is contributing to the segmentation fault here. It could be a combination of both. Regardless of that you should shorten your loci and thin them out. 10,000 bp per locus seems too long, because you're going to have many unmodeled recombination events per locus (the model assume no recombination within each locus. So unless recombination rates are very low in killer whales, you should bring it down to 1,000 bp or less. You should also make sure that your loci are spread far enough apart, because the model assumes free recombination between loci. Note that a typical G-PhoCS analysis covers only a few percent of the genome (our human analysis covered ~40 Mb of sequence). You can even start with much less to get some quick results. 5000-10,000 loci should be good enough for starters.

FatihSarigol commented 5 years ago

Hello again. I first reduced my loci length down to 1,000 bp and selected top 50,000 loci; that run gave a segmentation error. Next, I reduced my loci down to 1,000 as well to make sure nothing else is causing the problem, but that too gave a segmentation error. Both runs started running properly like the initial run. Can you think of any way that my settings in the control file can make the model too complicated that it may require too much memory after some step no matter how short or few loci are, such as the tau-initial values, or the amount of migration bands between populations? I want to share my full control file below, please let me know if you see anything I should change. Many thanks for your guidance!

GENERAL-INFO-START

        seq-file            first1Kloci1Kb
        trace-file          first1Kloci1Kb.log
        locus-mut-rate          CONST

        mcmc-iterations   5000
        iterations-per-log  50
        logs-per-line       10

        find-finetunes          FALSE
        finetune-coal-time      0.01
        finetune-mig-time       0.3
        finetune-theta          0.04
        finetune-mig-rate       0.02
        finetune-tau            0.0000008
        finetune-mixing         0.003
#   finetune-locus-rate 0.3

        tau-theta-print         10000.0
        tau-theta-alpha         1.0                     # for STD/mean ratio of 100%
        tau-theta-beta          10000.0         # for mean of 1e-4

        mig-rate-print          0.001
        mig-rate-alpha          0.002
        mig-rate-beta           0.00001

GENERAL-INFO-END

CURRENT-POPS-START

                POP-START
                                name            Norway
                                samples         Norway d
                POP-END

                POP-START
                                name            NPresident
                                samples         NPresident d
                POP-END

                POP-START
                                name            NPtransient
                                samples         NPtransient d
                POP-END

                POP-START
                                name            SouthAfrican
                                samples         SouthAfrican d
                POP-END

                POP-START
                                name            Antarctic
                                samples         Antarctic d
                POP-END

                POP-START
                                name            MarionIsland
                                samples         MarionIsland d
                POP-END

CURRENT-POPS-END

ANCESTRAL-POPS-START

                POP-START
                                name            KW1
                                children                Norway          NPresident
                                tau-initial             0.00001
                POP-END

                POP-START
                                name            KW2
                                children                KW1             NPtransient
                                tau-initial             0.00002
                POP-END

                POP-START
                                name            KW3
                                children                KW2             SouthAfrican
                                tau-initial             0.00003
                POP-END

                POP-START
                                name            KW4
                                children                KW3             Antarctic
                                tau-initial             0.00004
                POP-END

                POP-START
                                name            root
                                children                KW4             MarionIsland
                                tau-initial             0.00005
                POP-END

ANCESTRAL-POPS-END

MIG-BANDS-START

                BAND-START
                                source          Norway
                                target          NPresident
                BAND-END

                BAND-START
                                source          Norway
                                target          NPtransient
                BAND-END

                BAND-START
                                source          Norway
                                target          SouthAfrican
                BAND-END

                BAND-START
                                source          Norway
                                target          Antarctic
                BAND-END

                BAND-START
                                source          Norway
                                target          MarionIsland
                BAND-END

                BAND-START
                                source          NPresident
                                target          Norway
                BAND-END

                BAND-START
                                source          NPresident
                                target          NPtransient
                BAND-END

                BAND-START
                                source          NPresident
                                target          SouthAfrican
                BAND-END

                BAND-START
                                source          NPresident
                                target          Antarctic
                BAND-END

                BAND-START
                                source          NPresident
                                target          MarionIsland
                BAND-END

                BAND-START
                                source          NPtransient
                                target          Norway
                BAND-END

                BAND-START
                                source          NPtransient
                                target          NPresident
                BAND-END

                BAND-START
                                source          NPtransient
                                target          SouthAfrican
                BAND-END

                BAND-START
                                source          NPtransient
                                target          Antarctic
                BAND-END

                BAND-START
                                source          NPtransient
                                target          MarionIsland
                BAND-END

                BAND-START
                                source          SouthAfrican
                                target          Norway
                BAND-END

                BAND-START
                                source          SouthAfrican
                                target          NPresident
                BAND-END

                BAND-START
                                source          SouthAfrican
                                target          NPtransient
                BAND-END

                BAND-START
                                source          SouthAfrican
                                target          Antarctic
                BAND-END

                BAND-START
                                source          SouthAfrican
                                target          MarionIsland
                BAND-END

                BAND-START
                                source          Antarctic
                                target          Norway
                BAND-END

                BAND-START
                                source          Antarctic
                                target          NPresident
                BAND-END

                BAND-START
                                source          Antarctic
                                target          NPtransient
                BAND-END

                BAND-START
                                source          Antarctic
                                target          SouthAfrican
                BAND-END

                BAND-START
                                source          Antarctic
                                target          MarionIsland
                BAND-END

                BAND-START
                                source          MarionIsland
                                target          Norway
                BAND-END

                BAND-START
                                source          MarionIsland
                                target          NPresident
                BAND-END

                BAND-START
                                source          MarionIsland
                                target          NPtransient
                BAND-END

                BAND-START
                                source          MarionIsland
                                target          SouthAfrican
                BAND-END

                BAND-START
                                source          MarionIsland
                                target          Antarctic
                BAND-END

MIG-BANDS-END
gphocs-dev commented 5 years ago

The data settings look fine, so I don't think that they're the cause for the segmentation fault. The only thing I can think of is the number of migration bands. It could be that you're relaxing your model too much and then sampling converges to a "corner" in parameter space, in which divergence times do not really restrict the sampling. First, I would suggest to run a version with no migration bands. This is often very useful because it gives you an idea how much gene flow actually affects your other demographic estimates. When you add migration bands, maybe do it in groups to "weed out" bands that get near zero rates. Another thing to look out for is migration between sister populations (Norway and NPresident in your case). With gene flow between sister populations it is often very difficult to differentiate between a model with no gene flow and a model with gene flow and deeper divergence. So try versions with and without these bands and see if they're causing the problem.

FatihSarigol commented 5 years ago

Thank you so much once again!

So I removed all of the migration bands, basically deleted the lines from my control file so it became this there:

MIG-BANDS-START

MIG-BANDS-END

And that too gave segmentation fault unfortunately on my 1KB 1K loci file again after running for some time.

Do you think the way my population structure (having only 1 sample for each population maybe and also the tree) or the tau-initial values may also have an effect? Or anything else you can think of?

Thanks!

gphocs-dev commented 5 years ago

I can't see anything in your control file that could be causing this. The initial tau values look completely fine. My guess is that it's something to do with the format of your sequence file. Probably something mundane. If you wish to send me your data file and control file to ilan.gronau@idc.ac.il, i can try to have someone have a look at it. However, this will likely take a few weeks for us to get to this.

FatihSarigol commented 5 years ago

Thanks, I have emailed you the files.

You may remember another error I had faced from this thread: https://github.com/gphocs-dev/G-PhoCS/issues/62#issuecomment-520159775 Therefore to make sure it's not because of my manual update on the control file, I also wanted to try converting my original control file I created on windows with your java program to unix format using dos2unix but the converted file also gave the same errors I mentioned in the comment above. I had fixed that myself later making it similar to the example control file.

As for the sequence file, I generated that myself with my own code as you may also remember from another thread https://github.com/gphocs-dev/G-PhoCS/issues/51#issuecomment-502763618 and it is in the same shape as far as I can compare with the example file; and strangely they run well only until some random different point. Thanks

avancise commented 5 years ago

Hello,

I am getting a similar error to this, after ~21 hours the program fails and the last line of my log file reads:

/cm/local/apps/slurm/var/spool/job478857/slurm_script: line 15: 232301 Segmentation fault      bin/G-PhoCS Gmac_Gmel_Oorc_control.ctl -n 36

I have 5 samples and 18,754 1KB loci. At first I thought it might be a memory issue as you suggested earlier in this thread, but our HPC specialist confirmed that I didn't run out of memory. The program is successfully running through the penultimate MCMC iteration, and then fails on the final iteration and the trace file isn't readable by Tracer.

Have you had any success determining what may be causing this error in others' files?

Thank you!

FatihSarigol commented 5 years ago

Hello Amy, We figured out that for my specific case the problem was not with my files or settings but with my HPC's OpenMPI. I would ask you to try the same run using the Gphocs on single thread (maybe build it in default mode first again without multithread support). That's all I know about it : ) Good luck!

avancise commented 5 years ago

Hi Fatih,

Thanks for the quick reply :)

My first guess was that there were issues with the HPC's OpenMPI, but I talked with the HPC specialist about that and he wasn't able to find any problems there.

I'm hoping Ilan might be able to share further details that might help us pinpoint the issue.

Thanks again, Amy

gphocs-dev commented 5 years ago

Amy,

I'm on vacation now for two weeks. Will have a look when I return.

--Ilan

On Fri, Oct 11, 2019, 6:02 PM Amy Van Cise notifications@github.com wrote:

Hi Fatih,

Thanks for the quick reply :)

My first guess was that there were issues with the HPC's OpenMPI, but I talked with the HPC specialist about that and he wasn't able to find any problems there.

I'm hoping Ilan might be able to share further details that might help us pinpoint the issue.

Thanks again, Amy

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gphocs-dev/G-PhoCS/issues/43?email_source=notifications&email_token=ADO7ILVM3DWA7O4XK2BSHQDQOCIQLA5CNFSM4FBRM572YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBAI2MQ#issuecomment-541101362, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADO7ILUIWVXIO76VPKXU3JTQOCIQLANCNFSM4FBRM57Q .

FatihSarigol commented 5 years ago

Amy, OpenMPI normally also works on our HPC as well; perhaps there is a compatibility problem for some reason. It also starts running on multiple threads for me but then runs into error eventually. edit-actually I'm not sure if it ever uses multiple threads or if it just sets that at the beginning; I don't remember for sure if I checked that.. Best

avancise commented 5 years ago

Fatih, Do you have any additional information on how/when OpenMPI is being used by G-PhoCS, based on your earlier troubleshoot? Our HPC specialist told me that the program isn't actually calling OpenMPI, but rather OpenMP (I'm using a single node, with 36 threads, so my understanding is that OpenMP is being used instead of OpenMPI). He said he didn't see any calls to OpenMPI in the G-PhoCS code. We're running now with a single core to troubleshoot, but it will be some time before it finishes running. Thanks again!

FatihSarigol commented 5 years ago

Amy, You are right, I confused them two, sorry. When I do this: echo | cpp -fopenmp -dM | grep -i openmp I get this:

define _OPENMP 201107

I also activated another version and got to receive this

define _OPENMP 201511

but in both cases my runs on multiple threads failed, but on single thread succeeded Perhaps the program starts using multiple threads after some point of the analysis and that point gives segmentation fault? Best wishes..

gphocs-dev commented 5 years ago

Amy, Fatih,

To be honest, I don't know for sure what might be causing these segmentation faults. I can't seem to reproduce them on any of my machines. If you're able to pinpoint the cause, we can try to fix the underlying cause. I do have a few comments that you may find helpful: 1) G-PhoCS is using OpenMP for threading, as you mention in your threads. 2) Since the segmentation fault appears to occur close to the end of the run, you should still be able to get a usable trace file out of it. It could very well be that only the last line is corrupt (which is why you cannot open it with Tracer). I suggest examining the trace file as text and looking for possible issues (just do head trace.txt and tail trace.txt). you should still be able to get usable results even if you encountered a segmentation faults toward the end of the run.

avancise commented 5 years ago

Hi Fatih and Ilan,

Thank you both for these responses, they have been helpful in guiding our next steps.

Can you tell us which compiler you used? We originally tried gcc, and are now testing the intel compiler, to see if the issue has anything to do with how the program is compiled.

As a test, I ran the three versions (single thread, multi-thread gcc, and multi-thread intel) on the sample data provided with the program. The single thread run was successful, and the multi-thread gcc-compiled run produced the same error I got using my own data. The multi-thread intel-compiled run produced a new error, and I am waiting to see if the same error is produced on my data which is also currently running on the intel-compiled version of the program.

Interestingly, I wasn't able to open the results from any of the test runs in Tracer, from either the sample input data or my own input data. The format of all files looks exactly the same, with no obvious corruption in the head or the tail of the multi-thread results. The only difference I notice is that the end values seem very different between runs. I tried deleting the last several lines, as you suggested, but this did not seem to change anything.

Fatih, have you been able to open the results of your single-thread run in Tracer?

Thank you, Amy

gphocs-dev commented 5 years ago

Amy,

I personally use gcc, and this is the default compiler defined in the Makefile. I think that the Intel compiler should also be fine, but I didn't test this with the latest version. I also cannot figure out any likely cause for the issues you're getting with Tracer. I'd gladly have a look if you send me via e-mail one of the trace files that you cannot open (the smallest one, if you can)? Send to ilan.gronau@idc.ac.il

FatihSarigol commented 5 years ago

Amy, Yes, I could open my Gphocs MCMC results on Tracer (on windows); I just had to give my file a ".log" extension; otherwise the program don't see the file -but I guess your problem is not that but the tail of the file?" Best wishes

FatihSarigol commented 5 years ago

By the way Ilan, a useful note maybe for the future possible error messages from other users; my complete file with 10KB 237,290 loci with 6 samples finished successfully with single thread Gphocs. It took about 5 days with 5,000 MCMC iterations, so a good run would take several months probably, but the 14GB file size or locus length did not cause a problem. Best

avancise commented 5 years ago

Hi again, just following up for any others who come across this error. We discovered that I was running into two separate errors: 1) the segmentation fault, which caused the program to fail just before completion, and 2) issues with opening trace files on the newest version of Tracer (v1.7.1). Regarding issue #1, enough runs are completed so that the trace file is usable even after the segmentation fault. Using a dataset of ~18 loci for 5 individuals, the program completed 999,999 MCMC iterations in 4 days and 9 hours before failing with a segmentation error. Regarding issue #2, the trace files open on Tracer v1.6 and earlier, so using one of these versions is preferred.

Since the segmentation fault has only been generated for people using HPCs, our specialist suggested that one solution would be to reproduce the quasi-exact environment that the program was written in, including: OS [Kernel version, Glibc version], Compiler [Compiler version], and Cpu Model.

Ilan, if you are interested in pursuing this issue further and have the above information, I'd be happy to test this potential fix.

gphocs-dev commented 5 years ago

Thanks for the detailed explanations. I'm sure it will prove useful to future users.

1) We will try to figure out what's causing the issues with loading the trace file to new versions of Tracer. We'd like to stay compatible with the most updated version to help our users analyze their traces.

2) Regarding the segmentation fault, this is still a big mystery to us. Ideally, we would like to find the part in the code that triggers this issue and fix it. I don't quite understand the suggestion provided by your HPC specialist. I can specify the Glibc version and compiler version, but I suspect that these are not the causes. The OS could be an issue, but how would it help to specify it? Users would like to run the software on their own OS. Running it on a VM seems excessive to address this minor segfault issue. I would gladly follow up on this if you can elaborate more on the suggested solution.