sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

number of clusters(51652) exceeds limit 5000 Multifastas not created. please check the spread for cintamination from differert species of increase the --group_limit parameter. #349

Closed cccsnd closed 6 years ago

cccsnd commented 7 years ago

Use of uninitialized value in require at /usr/lib64/perl5/vendor_perl/Encode.pm line 60. Number of clusters (51652) exceeds limit (50000). Multifastas not created. Please check the spreadsheet for contamination from different species or increase the --group_limit parameter. 2017/09/07 12:15:03 Exiting early because number of clusters is too high

I have tried to change the --group_limit in Roary(10 ---> 1000000), but it does't work. How can i to fix this?

andrewjpage commented 7 years ago

What is the exact command you are running?

Also if you have that many clusters I would recommend you look at the results already produced to see if there is contamination in there (unless you working on something like E.coli). Should be fairly obvious from the tree accessory_binary_genes.fa.newick

cccsnd commented 7 years ago

Yes, I used the E.coli data, and the command was: roary -e --mafft -p 8 *.gff. There have 382 genome sequences in my file. Can not the roary be applied for the E.coli ? Or I could modify the parameters in the scripts? Thank you!

tseemann commented 7 years ago

@cccsnd Are these completed E.coli genomes or draft ones you have assembled and annotated yourself?
Have you checked each assembly is roughly about the same size and number of genes?

E.coli tends to be very plastic, and has a small core genome. Roary is designed to work with things that are genetically close. Also draft genomes tend to produce a lot of small hypothetical genes. You might want to filter those out. @andrewjpage could roary get a "-min-cds-len" param?

cccsnd commented 7 years ago

@tseemann Thank you for your prompt reply. The genomes data were download from NCBI, which was asigned to be the complete sequences. Maybe there have several genome not annotated well. I will check the data again. I don't find out any param for the "-min-cds-len", and if there have any other methods to fix this? Thanks again!

tseemann commented 7 years ago

@cccsnd "complete sequences" do you mean a single contig for each chromosome and plasmid? Stick to the RefSeq subset if possible, as they have been curated more. But once again "E.coli" is a mythical "species" which has a tiny core genome and a huge pan genome. There are many Shigella which are closer to E.coli than they other E.coli are :-)

andrewjpage commented 7 years ago

Ah the E.coli/Shigella debate, much like vi vs emacs in the computing world.

The E.coli refseq genomes vary in size from 3.9 to 5.8mb, which is quite a difference in size. Its probably easier to focus on a single ST (or close by STs). A fantastic website for looking at E.coli is Enterobase. It pulls in all public data from the archives (WGS + reference genomes), along with lots of legacy MLST. Its great for fishing expeditions.

agavriilidou commented 6 years ago

I got the same error and I have a question on this. Do the genomes need to be roughly of the same size and number of genes? I am studying a whole family of bacteria and there are some draft genomes and some complete ones that I include in my study.

DuarteFD commented 1 year ago

Hello, I got the same error, I would like to know if there is any command that I can use to eliminate this error.

script Number of clusters (67208) exceeds limit (50000). Multifast not created. Please check the spreadsheet for contamination from different

andrewjpage commented 1 year ago

It's not an error. It's intentional functionality to catch poor quality input data. I would recommend QCing your dataset to ensure all the samples are what you think they are.

On Sat, 8 Apr 2023, 17:29 DuarteFD, @.***> wrote:

Hello, I got the same error, I would like to know if there is any command that I can use to eliminate this error.

script Number of clusters (67208) exceeds limit (50000). Multifast not created. Please check the spreadsheet for contamination from different

— Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/349#issuecomment-1500926987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF4V4USEKPWHIJSY2GX6LXAGG57ANCNFSM4DZ5L5QA . You are receiving this because you modified the open/close state.Message ID: @.***>