nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
171 stars 50 forks source link

Segmentation fault with some input files #170

Closed Suncuss closed 6 years ago

Suncuss commented 8 years ago

Hi,

I'm getting Seg Faluts with gubbins when running with certain input files. It's been working fine with other input files though.

I did an strace on my run, and here's the last few lines before it crashes. one thing seems weird to me is that the mmap is invoked with the file descriptor -1

mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe0b1708000 lseek(3, 0, SEEK_SET) = 0 read(3, "##fileformat=VCFv4.2\n##contig=<I"..., 4096) = 4096 lseek(3, 4096, SEEK_SET) = 4096 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7fff56d26e98} --- +++ killed by SIGSEGV (core dumped) +++

gubbins_strace.31071.txt

Thanks

svanhal71 commented 8 years ago

I'm having similar issues see output - gubbins run 1 iteration and the fails. 1.txt Happy to share the data file if that helps.

Thanks

andrewjpage commented 8 years ago

Hi, Are you running it over a whole genome or just a small section? Andrew

On 16 August 2016 at 08:20, svanhal71 notifications@github.com wrote:

I'm having similar issues see output - gubbins run 1 iteration and the fails. 1.txt https://github.com/sanger-pathogens/gubbins/files/419708/1.txt Happy to share the data file if that helps.

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/gubbins/issues/170#issuecomment-240022998, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeVwrKoloHVn7-ay_vEsHUQsWVfB0uks5qgWTKgaJpZM4I1ydn .

Suncuss commented 8 years ago

Yes, so we are running it over a whole genome~

svanhal71 commented 8 years ago

Its over the whole genome – mapped file.

Sebastian

From: andrewjpage [mailto:notifications@github.com] Sent: Tuesday, 16 August 2016 5:32 PM To: sanger-pathogens/gubbins Cc: Sebastian Van Hal; Comment Subject: Re: [sanger-pathogens/gubbins] Segmentation fault with some input files (#170)

Hi, Are you running it over a whole genome or just a small section? Andrew

On 16 August 2016 at 08:20, svanhal71 notifications@github.com<mailto:notifications@github.com> wrote:

I'm having similar issues see output - gubbins run 1 iteration and the fails. 1.txt https://github.com/sanger-pathogens/gubbins/files/419708/1.txt Happy to share the data file if that helps.

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/gubbins/issues/170#issuecomment-240022998, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeVwrKoloHVn7-ay_vEsHUQsWVfB0uks5qgWTKgaJpZM4I1ydn .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/sanger-pathogens/gubbins/issues/170#issuecomment-240025142, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMuewmI1ik-5H-IV2gkuuSU5qBY3j-90ks5qgWdfgaJpZM4I1ydn.


This email has been scanned for the Sydney & South Western Sydney Local Health Districts by the MessageLabs Email Security System. Sydney & South Western Sydney Local Health Districts regularly monitor email and attachments to ensure compliance with the NSW Ministry of Health's Electronic Messaging Policy.


This email has been scanned for the Sydney & South Western Sydney Local Health Districts by the MessageLabs Email Security System. Sydney & South Western Sydney Local Health Districts regularly monitor email and attachments to ensure compliance with the NSW Ministry of Health's Electronic Messaging Policy.

saurabh-mk commented 8 years ago

Similar issue. Getting a Seg Fault only with some files. Happy to share if any details needed.

ramadatta commented 6 years ago

Dear Andrew,

I am facing a similar issue with gubbins. It runs successfully for a set of fasta files and not for other set. Could not find much documentation to trace and resolve this error. This is the step which is throwing me the error:

gubbins -r -v seqnew.fa.gaps.vcf -a 100 -b 10000 -f test_gubbins/tmp/seqnew.fa -t seqnew.fa.iteration_1 -m 3 seqnew.fa.gaps.snp_sites.aln
Failed while running Gubbins. Please ensure you have enough free memory

Would be grateful to your advice. Many thanks in advance.

andrewjpage commented 6 years ago

Hi, If you could provide me with your file I can take a look. Regards, Andrew

amilesj commented 6 years ago

Hi, I have the same issue described above. I'm including a Google drive link to my input file (it is ~200 MB...I tried smaller subsets of data but couldn't get the same error). The file is a whole genome alignment of 130 E. coli ST131 isolates and a ST131 reference genome; it was the full.aln output file from Snippy-core. When I remove three isolates that are divergent from the rest of the isolates but cluster tightly together (CP0650-A, SC0599-A, CP0634-A), Gubbins works fine. Not sure if this is a bug, or if those three isolates are just too divergent from the rest for Gubbins. Thank you for taking a look!

Best, Arianna

File: https://drive.google.com/open?id=1cVzzD1gAIfbk8JnEtQoCbtTiBdqLRM5q Command I ran: run_gubbins.py --threads 16 aln.fa

andrewjpage commented 6 years ago

I successfully ran your data using Gubbins, however for those 3 isolates 10% of the genome was flagged as recombination, which is a bit high, and you were still left with a very long branch. I would recommend going back to the original raw reads and double checking you don't have contamination from closely related strains, or bleeding in multiplex tags etc...

amilesj commented 6 years ago

Hi Andrew,

Thanks for testing out the data and getting back to me. I will take a closer look at the raw reads for those samples.

Could you give me any more details about what command you ran and/or what kind of environment you're working in to help me pinpoint the issue on my end?

Best, Arianna

amilesj commented 6 years ago

Hi Andrew,

Feel free to ignore the above request, as I did get it working. Thanks!

In case it is helpful to others who encounter a similar issue: I was working on a machine that had sufficient memory and computing power available, but did not use a job scheduler. While I could allocate the number of cores using the --threads option, I could not allocate memory in my command. Gubbins would crash well before tapping into the available memory.

My problem was resolved when I ran the same commands and files on a cluster that used a PBS job scheduler, and was able to specifically allocate memory.

I am not sure if this is a bug in Gubbins that results in it not accessing memory properly, or if it is a problem with the original machine I was using. I will note that root users on that machine got the same error I did.

Best, Arianna

ToniWestbrook commented 6 years ago

Hi Andrew,

We've also been experiencing a similar problem with segfaults (I'm submitting this for a colleague, but we had someone else in the department a few months back whose gubbins runs would fail for certain fastas as well). I've included a link to the current fasta having issues (~35 megs). I can confirm that memory usage uses less than 1% of the system memory before segfaulting, so it's not a memory related issue. The exact segfault is as follows:

gubbins[133842]: segfault at 7ffdd43e4fe8 ip 00007ff271d36fc4 sp 00007ffdd43e4ff0 error 6 in libgubbins.so.0.0.1[7ff271d2f000+e000]

And the command she ran is

run_gubbins.py ~/WGfastaAlignMafft.out --threads 10 -f 100

(though we've tried it with different threads/f parameters)

That fasta can be obtained here: https://unh.app.box.com/s/r53pd0ravr92t7gglqjaugrpe4saatn1

Thanks for your help! -Toni

ToniWestbrook commented 6 years ago

It turns out our issue was caused by ambiguous IUPAC characters in the fastas - converting these over to Ns fixed the issue. Thanks

marade commented 6 years ago

I'm seeing this issue as well. It happens with or without --threads and I've also checked for ambiguous IUPAC characters. Any ideas?

ToniWestbrook commented 6 years ago

We are actually continuing to have this issue too. While removing ambiguous IUPAC characters from the fastas above fixed the issue, we have many that don't have ambiguous characters in them that still segfault. I've had 4-5 users running into this (both at UNH and other institutions) - is there a common scenario or aspect of our sequences that is causing this that we can avoid somehow? Thanks -

simonrharris commented 6 years ago

Would it be possible for someone with this issue to provide some data which is still causing the segfault, as well as some information on the versions of RAxML you are using. Also, please confirm you are using the version of fastML that is shipped with Gubbins, as there is a known problem in the original version with big trees. Thanks.

andrewjpage commented 6 years ago

Hi @twestbrookunh I've taken a look at your data and it appears to be a data quality issue. Your bug is 5mb, but the alignment has brought it to over 6mb. There are also over 250,000 SNP sites. Both of these point to a low quality alignment. Unfortunately in this case Gubbins has crashed rather than tell you something useful, so for this I apologise. Regards, Andrew

saadleeshehreen commented 6 years ago

Hi, Trying to run gubbins. It behaves ok initially and generated some files. But, stopped and gave an error message

" Failed while running gubbins. Please ensure you have enough free memory"

It was running on the server and was just tried with 4 genomes. As while trying 2, it gave the error message that for analyzing, I have to give 3 or more genomes.

How did you ensure enough free memory in the server? Any opinion?

puethe commented 6 years ago

Hi @saadleeshehreen This is the default error message, hence lack of memory is not necessarily the reason gubbins breaks for you. Can you make your input files (alignment and, if used, starting tree) available to me, and tell me the exact command with which you run gubbins? Also, please let me know the version of gubbins you're using (get it by typing "run_gubbins.py --version").

saadleeshehreen commented 6 years ago

Hi,

Thanks for your reply. Here is my .fasta file after conversion.

https://drive.google.com/open?id=1WUelLlicR5u2zZ63BkeH5gtWvvWo8ewx

The version of gubbins is 2.3.1 I used the following command: run_gubbins.py t2.fasta

puethe commented 6 years ago

Thanks, @saadleeshehreen . I could reproduce the behaviour. I'll have a look at it and get back to you, but allow 2 weeks please.

saadleeshehreen commented 6 years ago

Hi, Thanks. Take your times. I also tried to truncated the fasta file and run gubbins on that. Now get a new error message "Failed while building the tree." Disappointing because can't understand the problem. Is it a problem with software or my files? I tested gubbins on the test file provided with software and it works fine. However, I started using gubbins for the first time. So, at the beginning start with only four genomes. But eventually, I have to run it with a file containing alignment of 100-150 genomes. Can the software handle such big files? So, it will be very helpful for me if you give some tips (e.g: conversion of .xmfa from progressiveMauve to .fasta).

simonrharris commented 6 years ago

Hi,

The problem you are having is not related in to memory. If you give us some time we will attempt to pin down the cause of the error message you are seeing, which we hope will also help other users. Gubbins is designed to deal with large numbers of genomes, so 150 should not be a problem. However, it is designed for closely-related clones, so if your data are highly diverse the method will not give good results. Based on the alignment you sent the method you are using to create the alingment from Mauve looks fine, although from experience you may find that some of the mauve blocks that appear to be restricted to one sample actually align to each other. I have no idea why this happens, but it seems to be a problem with the mauve algorithm.

Please bear with us while we try to resolve the bug you are experiencing and we'll get back to you ASAP.

Best wishes,

Simon

On 7/24/18 10:37 PM, saadleeshehreen wrote:

Hi, Thanks. Take your times. I started using gubbins for the first time. So, at the beginning start with only four genomes. But eventually, I have to run it with a file containing alignment of 100-150 genomes. Can the software handle such big files? So, it will be very helpful for me if you give some tips (e.g: conversion of .xmfa from progressiveMauve to .fasta). I don't know my .fasta file contains problem or not.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/gubbins/issues/170#issuecomment-407560330, or mute the thread https://github.com/notifications/unsubscribe-auth/AKCyCS-3FcR3FMoE0ZzLtmRCvZaKgfvaks5uJ5OBgaJpZM4I1ydn.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

puethe commented 6 years ago

Hi @saadleeshehreen

I think I have found the problem. Just published a new version of gubbins (2.3.4) that hopefully resolves the issue. At least I managed to run gubbins successfully with the new version on the data you supplied.

There are currently two ways to use the new version:

Please also have a look at the INSTALL file.

We will try to make the new version available via bioconda as soon as possible. Do not use the Debian package (available via apt-get install gubbins), since that one is fairly outdated.

Please give us a feedback whether you could run gubbins successfully - that would be extremely valuable for us.

Regards, Christoph

puethe commented 6 years ago

Gubbins 2.3.4 is now also available via bioconda. We have received independent confirmation that the new version of the program completes successfully on alignments on which the old version used to crash, so I regard this as solved and will close this issue.

saadleeshehreen commented 6 years ago

Hi,

Thanks a lot for the new version. I will test the new version with my data and let you know.

Cheers

Saadlee Shehreen

PhD Student

Department of Biochemistry

email: saadleeshehreen@gmail.com

Contact No: 220894453


From: Christoph Püthe notifications@github.com Sent: 08 August 2018 00:46:49 To: sanger-pathogens/gubbins Cc: Saadlee Shehreen; Mention Subject: Re: [sanger-pathogens/gubbins] Segmentation fault with some input files (#170)

Gubbins 2.3.4 is now also available via bioconda. We have received independent confirmation that the new version of the program completes successfully on alignments on which the old version used to crash, so I regard this as solved and will close this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sanger-pathogens/gubbins/issues/170#issuecomment-411043621, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AfatlKDQL1lH0ORMchNIs5FTiSBq6_nCks5uOYw5gaJpZM4I1ydn.

sreerampeela commented 8 months ago

Hi @puethe .. I am using the docker image for version 3.0.0 and the run gets killed. Its a core snp alignment from snippy-core and has 200 samples of pneumococcus. I included options to remove duplicate sequences (detected in previous run) and restarted it. In v3.0.0 there is no resume option to restart the analysis. Am I missing something?

There are no ambiguous characters as base frequencies are adding up to 1. Base frequencies: A: 0.213 C: 0.288 G: 0.284 T: 0.215 (STDOUT)

run_gubbins.py --prefix /data/jip_2024 --threads 10 --verbose --remove-identical-sequences --tree-builder iqtree --bootstrap 1000 --model-fitter iqtree /data/cleaned_jipmer.aln

nickjcroucher commented 8 months ago

Please upgrade to a more recent version (e.g. >=3.3).

sreerampeela commented 8 months ago

Trying to install using conda but not working.. too long for solving env..is there any faster method (except for mamba)?

nickjcroucher commented 8 months ago

You can follow the instructions on the repo for installing from source, but mamba is the best. Unless it actually fails, best just to wait (or try a different machine).